2

I have a set of variables X1 and X2 and Y with relationship plot as shown below. X2 values are used for color coding.

X1, X2, and X3 are integer variables.

enter image description here

The observed pattern is multimodal.

What is the best way to predict Y based on X1 and X2?

Can we use non-linear or hurdle models for this?

Also what are the tools available to achieve this in R?

1
  • 1
    This will get closed as its off topic for SO -- coding questions only here :) But maybe try a spline or other GAM (generalized additive model)
    – DanY
    Commented Feb 14, 2022 at 20:27

1 Answer 1

1

Generally speaking, there is no need to worry about the distribution of the response. Although you are showing a bivariate plot, it is possible that the multi-modality is explained by X2 (or other, missing variables)

It is the distribution of the model residuals that matters (if it matters at all).

If the residuals are non-normal, then certain inferences may be invalid, although this may not be a problem at all if the model is used for prediction.

If you really do have a curvilinear association then you could consider:

  • transformations
  • non-linear terms
  • splines
  • generalised additive models (GAMs)
  • non-linear models

Of course, if the underlying problem is that you have missing explanatory variables, then some of these approaches may lead to an overfitted model.

Not the answer you're looking for? Browse other questions tagged or ask your own question.