Question 1

Use read_csv to read csv file as a dataframe.

pns = pd.read_csv('persons.csv')

Then check if there are any null values in the variable age, and drop those if there are any. Then convert all values in age and education from float to integers. Select all variables except for wealthC and wealthI as features, and select wealthC as the target.

check_nan = pns['age'].isnull().values.any()
pns.dropna(inplace = True)

pns['age'] = pns['age'].astype(int)
pns['edu'] = pns['edu'].astype(int)


X = pns.drop(["wealthC", "wealthI"], axis = 1)
y = pns.wealthC

Question 2

From the above, we can conclude that although standardizing features makes coefficients change a lot, it does not significantly improve the MSE/R^2 of the fitted model.

Question 3

For all models I used testing score to judge the performance of the model, and also check the corresponding MSE as an additional criteria for the model performance. For all model I’ve chosen to use 10-fold, since according to previous experience changing k would nott bring a significant difference to the result.

Question 4

Question 5

### Question 6 First of all, all the model using WealhI as the target should not be used since they all have extremely large MSE, even those corresponding to the optimal alpha in ridge and lasso regression.

For models using WealthC as the target, there is not significant difference between the testing score or MSE for all 6 types of model(three types of regression with or without features standardized). Actually, if we just look at three digits after the decimal point, they are exactly the same. Thus, all these 6 models can be seen as the “best” model.