This repository has been archived on 2026-03-24. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
TP-ML/ex/Lab11_exercises.ipynb
2022-06-05 15:43:42 +02:00

439 lines
26 KiB
Plaintext

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Lab11_exercises.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Tree-based methods \n",
"\n",
"In this lab, we'll consider the Pima Indians dataset, which primarly objective is to diagnostically predict whether female patients suffer from diabetes or not, based on a series of medical attributes. To this end, we'll be using several tree-based methods that were covered during the course lectures. \n",
"\n",
"Here are some info regarding the dataset's attributes :\n",
" - Num_pregnant : The number of pregnancies the patient had. \n",
" - glucose_con : Patient's plasma glucsose concentration.\n",
" - blood_pressure : Patient's dialostic blood pressure (mmHg).\n",
" - triceps_thickness : Patient's triceps skin-fold thickness (mm).\n",
" - insulin : Patient's 2-h serum insulin (mu U/mL).\n",
" - bmi : Patient's body mass index (kg/m^2).\n",
" - dpf : Patient's diabetes pedigree function.\n",
" - age : Patient's age. \n",
" - diabetes : Whether the patient has diabetes (1) or not (0)."
],
"metadata": {
"id": "vP2BHEtuQ26i",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**Load the necessary libraries** "
],
"metadata": {
"id": "LWWFCoLHEeA4",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"id": "Pa-8tr1GBLlE",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import numpy as np \n",
"import pandas as pd \n",
"from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score, mean_squared_error\n",
"from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV\n",
"from sklearn.impute import SimpleImputer\n",
"import matplotlib.pyplot as plt\n",
"import math\n",
"from sklearn.pipeline import Pipeline \n",
"from sklearn.tree import DecisionTreeClassifier, plot_tree, DecisionTreeRegressor\n",
"from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor, BaggingClassifier, BaggingRegressor"
]
},
{
"cell_type": "markdown",
"source": [
"**1) Load the dataset, get its general information, and check for missing values.**"
],
"metadata": {
"id": "dHo7LMdIEhmQ",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 41,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Num_pregnant glucose_con blood_pressure triceps_thickness insulin \\\n",
"0 6.0 148.0 72.0 35.0 0.0 \n",
"1 1.0 85.0 66.0 29.0 0.0 \n",
"2 8.0 183.0 64.0 0.0 0.0 \n",
"3 1.0 89.0 66.0 23.0 94.0 \n",
"4 0.0 137.0 40.0 35.0 168.0 \n",
"\n",
" bmi dpf age diabetes \n",
"0 33.6 0.627 50.0 1 \n",
"1 26.6 0.351 31.0 0 \n",
"2 23.3 0.672 32.0 1 \n",
"3 28.1 0.167 21.0 0 \n",
"4 43.1 2.288 33.0 1 \n",
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 768 entries, 0 to 767\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Num_pregnant 750 non-null float64\n",
" 1 glucose_con 755 non-null float64\n",
" 2 blood_pressure 761 non-null float64\n",
" 3 triceps_thickness 757 non-null float64\n",
" 4 insulin 768 non-null float64\n",
" 5 bmi 768 non-null float64\n",
" 6 dpf 767 non-null float64\n",
" 7 age 761 non-null float64\n",
" 8 diabetes 768 non-null int64 \n",
"dtypes: float64(8), int64(1)\n",
"memory usage: 60.0 KB\n",
"None\n",
"Num_pregnant 18\n",
"glucose_con 13\n",
"blood_pressure 7\n",
"triceps_thickness 11\n",
"insulin 0\n",
"bmi 0\n",
"dpf 1\n",
"age 7\n",
"diabetes 0\n",
"dtype: int64\n"
]
}
],
"source": [
"file = '../data/pima_indians_lab.csv'\n",
"\n",
"##Read dataframe##\n",
"\n",
"df = pd.read_csv(file,index_col=0)\n",
"#df= df.astype({'quality':'category'})\n",
"\n",
"\n",
"print(df.head())\n",
"print(df.info())\n",
"print(df.isna().sum())"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"# Decision Trees"
],
"metadata": {
"id": "Mje6W8ucZYvi",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**2) Select 'diabetes' as the target variable, and all the remaining columns as predictors. Create a pipeline containing the preprocessing steps (missing values imputer, scaler,...) and a Decision Tree classifier, which maximum depth should be set to 3 (through the 'max_depth' argument). Set the entropy as the split criterion. Do you think scaling the variables is necessary ?**\n",
"\n",
"**Fit this pipeline to the data (do not split the dataset for the time being), and plot the decision tree. How do you interpret it ?** \n",
"\n",
"**You'll need the 'plot_tree' class from the sklearn library. You can access the pipeline's classifier using the 'named_steps['classifier']' attributes.**"
],
"metadata": {
"id": "qKdTxEtuE47K",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 41,
"outputs": [],
"source": [
"imp_cont = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
"\n",
"cont_columns = df.select_dtypes(exclude=['category', 'int']).columns\n",
"\n",
"df[cont_columns] = imp_cont.fit_transform(df[cont_columns])\n",
"\n",
"print(df.isna().sum())"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 44,
"outputs": [
{
"ename": "ValueError",
"evalue": "Input contains NaN, infinity or a value too large for dtype('float64').",
"output_type": "error",
"traceback": [
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
"\u001B[0;31mValueError\u001B[0m Traceback (most recent call last)",
"Input \u001B[0;32mIn [44]\u001B[0m, in \u001B[0;36m<module>\u001B[0;34m\u001B[0m\n\u001B[1;32m 7\u001B[0m classifier \u001B[38;5;241m=\u001B[39m DecisionTreeClassifier(max_depth\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m3\u001B[39m,criterion\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mentropy\u001B[39m\u001B[38;5;124m\"\u001B[39m)\n\u001B[1;32m 8\u001B[0m model \u001B[38;5;241m=\u001B[39m Pipeline(steps\u001B[38;5;241m=\u001B[39m[(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mpreprocessor\u001B[39m\u001B[38;5;124m'\u001B[39m,imp_cont),(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mclassifier\u001B[39m\u001B[38;5;124m'\u001B[39m,classifier)])\n\u001B[0;32m----> 9\u001B[0m \u001B[43mmodel\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43mY\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/pipeline.py:394\u001B[0m, in \u001B[0;36mPipeline.fit\u001B[0;34m(self, X, y, **fit_params)\u001B[0m\n\u001B[1;32m 392\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_final_estimator \u001B[38;5;241m!=\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mpassthrough\u001B[39m\u001B[38;5;124m\"\u001B[39m:\n\u001B[1;32m 393\u001B[0m fit_params_last_step \u001B[38;5;241m=\u001B[39m fit_params_steps[\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39msteps[\u001B[38;5;241m-\u001B[39m\u001B[38;5;241m1\u001B[39m][\u001B[38;5;241m0\u001B[39m]]\n\u001B[0;32m--> 394\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_final_estimator\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mXt\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mfit_params_last_step\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 396\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\n",
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/tree/_classes.py:937\u001B[0m, in \u001B[0;36mDecisionTreeClassifier.fit\u001B[0;34m(self, X, y, sample_weight, check_input, X_idx_sorted)\u001B[0m\n\u001B[1;32m 899\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mfit\u001B[39m(\n\u001B[1;32m 900\u001B[0m \u001B[38;5;28mself\u001B[39m, X, y, sample_weight\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mNone\u001B[39;00m, check_input\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m, X_idx_sorted\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mdeprecated\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m 901\u001B[0m ):\n\u001B[1;32m 902\u001B[0m \u001B[38;5;124;03m\"\"\"Build a decision tree classifier from the training set (X, y).\u001B[39;00m\n\u001B[1;32m 903\u001B[0m \n\u001B[1;32m 904\u001B[0m \u001B[38;5;124;03m Parameters\u001B[39;00m\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 934\u001B[0m \u001B[38;5;124;03m Fitted estimator.\u001B[39;00m\n\u001B[1;32m 935\u001B[0m \u001B[38;5;124;03m \"\"\"\u001B[39;00m\n\u001B[0;32m--> 937\u001B[0m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 938\u001B[0m \u001B[43m \u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 939\u001B[0m \u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 940\u001B[0m \u001B[43m \u001B[49m\u001B[43msample_weight\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43msample_weight\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 941\u001B[0m \u001B[43m \u001B[49m\u001B[43mcheck_input\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mcheck_input\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 942\u001B[0m \u001B[43m \u001B[49m\u001B[43mX_idx_sorted\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mX_idx_sorted\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 943\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 944\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\n",
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/tree/_classes.py:165\u001B[0m, in \u001B[0;36mBaseDecisionTree.fit\u001B[0;34m(self, X, y, sample_weight, check_input, X_idx_sorted)\u001B[0m\n\u001B[1;32m 163\u001B[0m check_X_params \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mdict\u001B[39m(dtype\u001B[38;5;241m=\u001B[39mDTYPE, accept_sparse\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mcsc\u001B[39m\u001B[38;5;124m\"\u001B[39m)\n\u001B[1;32m 164\u001B[0m check_y_params \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mdict\u001B[39m(ensure_2d\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mFalse\u001B[39;00m, dtype\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mNone\u001B[39;00m)\n\u001B[0;32m--> 165\u001B[0m X, y \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_validate_data\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 166\u001B[0m \u001B[43m \u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mvalidate_separately\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mcheck_X_params\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mcheck_y_params\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 167\u001B[0m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 168\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m issparse(X):\n\u001B[1;32m 169\u001B[0m X\u001B[38;5;241m.\u001B[39msort_indices()\n",
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/base.py:579\u001B[0m, in \u001B[0;36mBaseEstimator._validate_data\u001B[0;34m(self, X, y, reset, validate_separately, **check_params)\u001B[0m\n\u001B[1;32m 577\u001B[0m check_X_params, check_y_params \u001B[38;5;241m=\u001B[39m validate_separately\n\u001B[1;32m 578\u001B[0m X \u001B[38;5;241m=\u001B[39m check_array(X, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mcheck_X_params)\n\u001B[0;32m--> 579\u001B[0m y \u001B[38;5;241m=\u001B[39m \u001B[43mcheck_array\u001B[49m\u001B[43m(\u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mcheck_y_params\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 580\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 581\u001B[0m X, y \u001B[38;5;241m=\u001B[39m check_X_y(X, y, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mcheck_params)\n",
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/utils/validation.py:800\u001B[0m, in \u001B[0;36mcheck_array\u001B[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)\u001B[0m\n\u001B[1;32m 794\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\n\u001B[1;32m 795\u001B[0m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mFound array with dim \u001B[39m\u001B[38;5;132;01m%d\u001B[39;00m\u001B[38;5;124m. \u001B[39m\u001B[38;5;132;01m%s\u001B[39;00m\u001B[38;5;124m expected <= 2.\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m 796\u001B[0m \u001B[38;5;241m%\u001B[39m (array\u001B[38;5;241m.\u001B[39mndim, estimator_name)\n\u001B[1;32m 797\u001B[0m )\n\u001B[1;32m 799\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m force_all_finite:\n\u001B[0;32m--> 800\u001B[0m \u001B[43m_assert_all_finite\u001B[49m\u001B[43m(\u001B[49m\u001B[43marray\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mallow_nan\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mforce_all_finite\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m==\u001B[39;49m\u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mallow-nan\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[1;32m 802\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m ensure_min_samples \u001B[38;5;241m>\u001B[39m \u001B[38;5;241m0\u001B[39m:\n\u001B[1;32m 803\u001B[0m n_samples \u001B[38;5;241m=\u001B[39m _num_samples(array)\n",
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/utils/validation.py:114\u001B[0m, in \u001B[0;36m_assert_all_finite\u001B[0;34m(X, allow_nan, msg_dtype)\u001B[0m\n\u001B[1;32m 107\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m (\n\u001B[1;32m 108\u001B[0m allow_nan\n\u001B[1;32m 109\u001B[0m \u001B[38;5;129;01mand\u001B[39;00m np\u001B[38;5;241m.\u001B[39misinf(X)\u001B[38;5;241m.\u001B[39many()\n\u001B[1;32m 110\u001B[0m \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m allow_nan\n\u001B[1;32m 111\u001B[0m \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m np\u001B[38;5;241m.\u001B[39misfinite(X)\u001B[38;5;241m.\u001B[39mall()\n\u001B[1;32m 112\u001B[0m ):\n\u001B[1;32m 113\u001B[0m type_err \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124minfinity\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m allow_nan \u001B[38;5;28;01melse\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mNaN, infinity\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[0;32m--> 114\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\n\u001B[1;32m 115\u001B[0m msg_err\u001B[38;5;241m.\u001B[39mformat(\n\u001B[1;32m 116\u001B[0m type_err, msg_dtype \u001B[38;5;28;01mif\u001B[39;00m msg_dtype \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m \u001B[38;5;28;01melse\u001B[39;00m X\u001B[38;5;241m.\u001B[39mdtype\n\u001B[1;32m 117\u001B[0m )\n\u001B[1;32m 118\u001B[0m )\n\u001B[1;32m 119\u001B[0m \u001B[38;5;66;03m# for object dtype data, we only check for NaNs (GH-13254)\u001B[39;00m\n\u001B[1;32m 120\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m X\u001B[38;5;241m.\u001B[39mdtype \u001B[38;5;241m==\u001B[39m np\u001B[38;5;241m.\u001B[39mdtype(\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mobject\u001B[39m\u001B[38;5;124m\"\u001B[39m) \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m allow_nan:\n",
"\u001B[0;31mValueError\u001B[0m: Input contains NaN, infinity or a value too large for dtype('float64')."
]
}
],
"source": [
"pred = list(df.columns)\n",
"pred.remove('diabetes')\n",
"X = df[['diabetes']]\n",
"Y = df[pred]\n",
"\n",
"imp_cont = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
"classifier = DecisionTreeClassifier(max_depth=3,criterion=\"entropy\")\n",
"model = Pipeline(steps=[('preprocessor',imp_cont),('classifier',classifier)])\n",
"model.fit(X,Y)\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**3) Let's see how the model's performance evolve as a function of the tree's maximum depth.**\n",
"\n",
"**To this end, split the dataset into a training and a test set following a 0.8/0.2 partition. Then, for maximum depths varying from 1 to 20, fit a Decision Tree classifier to the *training* data using a 10 folds cross-validation with the AUROC as metric. Plot the the means of the training and validation AUROCS across each folds as a function of the maximum depth. Also, compute the standard error of the means at each depth, and add it to the plot as shaded grey area around the means (cfr. plot below) . What can you conclude regarding the model's performance, as well as the uncertainty for the in-sample and out-of-sample AUROC estimates ?**\n",
"\n",
"**Identify which depth would lead a priori to the best model's out-of-sample performance. Using this depth, fit a decision tree to the training split and report the training AUROC and the test AUROC.** "
],
"metadata": {
"id": "hHndFhA_J4QK",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"source": [],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 651
},
"id": "wzjb32Tqm6ij",
"outputId": "cc477bf5-fb6a-4ed2-b208-283e1bc2b605",
"pycharm": {
"name": "#%%\n"
}
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Clearly, the decision tree starts overfitting very early as the maximum depth increases, shown by the increasing gap between the training and validation curves. \n",
"\n",
"As observed, the standard error on the training AUROC is much less than the standard error on the validation AUROC, wathever the depth. This is expected as the training AUROC in each fold is computed using 10 times as much data points as the validation AUROC, resulting in a lower variance in the training AUROC across each folds, and hence, in a lower standard error compared to the validation AUROC's standard error. Furthermore, as the depth increases, the model begins to interpolate the training data in each fold, and correctly predicts each training point with probability 1. As the AUROC is equal to 1 in each training fold, the standard error is equal to 0, as observed starting from approximately depth 13.\n",
"\n",
"On the other hand, the standard error of the validation AUROC remains approximately constant throughout the different depths. "
],
"metadata": {
"id": "u4-kO2sKOSIg",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"# Bagging "
],
"metadata": {
"id": "tdvKtPEa7KRe",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**4) Implement your own bagging algorithm by fitting a decision tree to each bootstrap sample. The bootstramp samples should be drawn from and of equal size as the training set. Set the the number of bootstrap samples to 30 and the maximum_depth of each decision as the optimum depth found above.**\n",
"\n",
"**Then using the decision trees fitted on each bootsrapped training data, predict on the test set and use the majority vote strategy to get the final predictions. Redo the same by averaging the trees probabilities. Display the confusion matrix for the test set.**"
],
"metadata": {
"id": "EwkZbxViZ_Y0",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"# Random Forest"
],
"metadata": {
"id": "WlMQavsWhGO6",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**5) Perform a random search on a specified grid of hyper-parameters to find the best hyper-parameters configuration for a RandomForestClassifier. Set the scoring function as the AUROC and limit the number of combination to try to 10.**\n",
"\n",
"**Fit the best model found in the previous procedure to the training data, and predict on the test set. Report the test AUROC and display the ROC curve.**"
],
"metadata": {
"id": "b1_648UmlJrd",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"# Boosting"
],
"metadata": {
"id": "qpJYcRiqfa0R",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**7) Fit a boosting classifier to the training data, and report the training and test AUROC's. You can use the GradientBoostingClassifier from the sklearn library.**"
],
"metadata": {
"id": "xbLLa_mDmpV2",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**8) For a decision tree classifier, a bagging classifier, a random forest classifier and a boosting classifier, perform a random search on a predifined grid of hyper-parameters. Amongst all models and hyper-parameters combinations, select the best model and report the best *validation* AUROC. The random search should be performed on the *training* data, and you can set the number of combinations to try per model to 5.**\n",
"\n",
"**For the best model found, report the training and test AUROC's, and display the training and test ROC curves**"
],
"metadata": {
"id": "7iMGufJ8XlEV",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"# Regression "
],
"metadata": {
"id": "jEvdTtjsrCEf",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**9) Select 'bmi' as the target variable and all the remaining columns in the dataframe as the predictors. Split your dataset into a training and test set, fit a decision tree regressor to the training data, and report the MSE on the training and test sets. What do you observe ?**"
],
"metadata": {
"id": "--wfsvsUYo3C",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"As we did not impose any restrictions on the depth that the tree could reach, it grew to the point that each leaf contains a single training observation. The value predicted for a leaf being the mean of the observations contained in it, and as each leaf contains a single observation, the model exactly predicts all training points, and the MSE is null. The behaviour obviously does not generalize well, as shown by the test MSE. "
],
"metadata": {
"id": "wA09_yPrZg8g",
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**10) For a decision tree regressor, a bagging regressor, a random forest regressor and a boosting regressor, perform a random search on a predifined grid of hyper-parameters. Amongst all models and hyper-parameters combinations, select the best model and report the best *validation* MSE. The random search should be performed on the *training* data, and you can set the number of combinations to try per model to 5.**\n",
"\n",
"**For the best model found, report the training and test MSE.**"
],
"metadata": {
"id": "dMUd7Kyqa6S4",
"pycharm": {
"name": "#%% md\n"
}
}
}
]
}