TP-ML/ex/Lab11_exercises.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "name": "Lab11_exercises.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Tree-based methods \n",
    "\n",
    "In this lab, we'll consider the Pima Indians dataset, which primarly objective is to diagnostically predict whether female patients suffer from diabetes or not, based on a series of medical attributes. To this end, we'll be using several tree-based methods that were covered during the course lectures. \n",
    "\n",
    "Here are some info regarding the dataset's attributes :\n",
    "  - Num_pregnant : The number of pregnancies the patient had. \n",
    "  - glucose_con : Patient's plasma glucsose concentration.\n",
    "  - blood_pressure : Patient's dialostic blood pressure (mmHg).\n",
    "  - triceps_thickness : Patient's triceps skin-fold thickness (mm).\n",
    "  - insulin : Patient's 2-h serum insulin (mu U/mL).\n",
    "  - bmi : Patient's body mass index (kg/m^2).\n",
    "  - dpf : Patient's diabetes pedigree function.\n",
    "  - age : Patient's age. \n",
    "  - diabetes : Whether the patient has diabetes (1) or not (0)."
   ],
   "metadata": {
    "id": "vP2BHEtuQ26i",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**Load the necessary libraries** "
   ],
   "metadata": {
    "id": "LWWFCoLHEeA4",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "id": "Pa-8tr1GBLlE",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd \n",
    "from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score, mean_squared_error\n",
    "from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV\n",
    "from sklearn.impute import SimpleImputer\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "from sklearn.pipeline import Pipeline \n",
    "from sklearn.tree import DecisionTreeClassifier, plot_tree, DecisionTreeRegressor\n",
    "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor, BaggingClassifier, BaggingRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**1) Load the dataset, get its general information, and check for missing values.**"
   ],
   "metadata": {
    "id": "dHo7LMdIEhmQ",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Num_pregnant  glucose_con  blood_pressure  triceps_thickness  insulin  \\\n",
      "0           6.0        148.0            72.0               35.0      0.0   \n",
      "1           1.0         85.0            66.0               29.0      0.0   \n",
      "2           8.0        183.0            64.0                0.0      0.0   \n",
      "3           1.0         89.0            66.0               23.0     94.0   \n",
      "4           0.0        137.0            40.0               35.0    168.0   \n",
      "\n",
      "    bmi    dpf   age  diabetes  \n",
      "0  33.6  0.627  50.0         1  \n",
      "1  26.6  0.351  31.0         0  \n",
      "2  23.3  0.672  32.0         1  \n",
      "3  28.1  0.167  21.0         0  \n",
      "4  43.1  2.288  33.0         1  \n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 768 entries, 0 to 767\n",
      "Data columns (total 9 columns):\n",
      " #   Column             Non-Null Count  Dtype  \n",
      "---  ------             --------------  -----  \n",
      " 0   Num_pregnant       750 non-null    float64\n",
      " 1   glucose_con        755 non-null    float64\n",
      " 2   blood_pressure     761 non-null    float64\n",
      " 3   triceps_thickness  757 non-null    float64\n",
      " 4   insulin            768 non-null    float64\n",
      " 5   bmi                768 non-null    float64\n",
      " 6   dpf                767 non-null    float64\n",
      " 7   age                761 non-null    float64\n",
      " 8   diabetes           768 non-null    int64  \n",
      "dtypes: float64(8), int64(1)\n",
      "memory usage: 60.0 KB\n",
      "None\n",
      "Num_pregnant         18\n",
      "glucose_con          13\n",
      "blood_pressure        7\n",
      "triceps_thickness    11\n",
      "insulin               0\n",
      "bmi                   0\n",
      "dpf                   1\n",
      "age                   7\n",
      "diabetes              0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "file = '../data/pima_indians_lab.csv'\n",
    "\n",
    "##Read dataframe##\n",
    "\n",
    "df = pd.read_csv(file,index_col=0)\n",
    "#df= df.astype({'quality':'category'})\n",
    "\n",
    "\n",
    "print(df.head())\n",
    "print(df.info())\n",
    "print(df.isna().sum())"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Decision Trees"
   ],
   "metadata": {
    "id": "Mje6W8ucZYvi",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**2) Select 'diabetes' as the target variable, and all the remaining columns as predictors. Create a pipeline containing the preprocessing steps (missing values imputer, scaler,...) and a Decision Tree classifier, which maximum depth should be set to 3 (through the 'max_depth' argument). Set the entropy as the split criterion. Do you think scaling the variables is necessary ?**\n",
    "\n",
    "**Fit this pipeline to the data (do not split the dataset for the time being), and plot the decision tree. How do you interpret it ?** \n",
    "\n",
    "**You'll need the 'plot_tree' class from the sklearn library. You can access the pipeline's classifier using the 'named_steps['classifier']' attributes.**"
   ],
   "metadata": {
    "id": "qKdTxEtuE47K",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "outputs": [],
   "source": [
    "imp_cont = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
    "\n",
    "cont_columns = df.select_dtypes(exclude=['category', 'int']).columns\n",
    "\n",
    "df[cont_columns] = imp_cont.fit_transform(df[cont_columns])\n",
    "\n",
    "print(df.isna().sum())"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Input contains NaN, infinity or a value too large for dtype('float64').",
     "output_type": "error",
     "traceback": [
      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
      "\u001B[0;31mValueError\u001B[0m                                Traceback (most recent call last)",
      "Input \u001B[0;32mIn [44]\u001B[0m, in \u001B[0;36m<module>\u001B[0;34m\u001B[0m\n\u001B[1;32m      7\u001B[0m classifier \u001B[38;5;241m=\u001B[39m DecisionTreeClassifier(max_depth\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m3\u001B[39m,criterion\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mentropy\u001B[39m\u001B[38;5;124m\"\u001B[39m)\n\u001B[1;32m      8\u001B[0m model \u001B[38;5;241m=\u001B[39m Pipeline(steps\u001B[38;5;241m=\u001B[39m[(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mpreprocessor\u001B[39m\u001B[38;5;124m'\u001B[39m,imp_cont),(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mclassifier\u001B[39m\u001B[38;5;124m'\u001B[39m,classifier)])\n\u001B[0;32m----> 9\u001B[0m \u001B[43mmodel\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43mY\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/pipeline.py:394\u001B[0m, in \u001B[0;36mPipeline.fit\u001B[0;34m(self, X, y, **fit_params)\u001B[0m\n\u001B[1;32m    392\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_final_estimator \u001B[38;5;241m!=\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mpassthrough\u001B[39m\u001B[38;5;124m\"\u001B[39m:\n\u001B[1;32m    393\u001B[0m         fit_params_last_step \u001B[38;5;241m=\u001B[39m fit_params_steps[\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39msteps[\u001B[38;5;241m-\u001B[39m\u001B[38;5;241m1\u001B[39m][\u001B[38;5;241m0\u001B[39m]]\n\u001B[0;32m--> 394\u001B[0m         \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_final_estimator\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mXt\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mfit_params_last_step\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    396\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/tree/_classes.py:937\u001B[0m, in \u001B[0;36mDecisionTreeClassifier.fit\u001B[0;34m(self, X, y, sample_weight, check_input, X_idx_sorted)\u001B[0m\n\u001B[1;32m    899\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mfit\u001B[39m(\n\u001B[1;32m    900\u001B[0m     \u001B[38;5;28mself\u001B[39m, X, y, sample_weight\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mNone\u001B[39;00m, check_input\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m, X_idx_sorted\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mdeprecated\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m    901\u001B[0m ):\n\u001B[1;32m    902\u001B[0m     \u001B[38;5;124;03m\"\"\"Build a decision tree classifier from the training set (X, y).\u001B[39;00m\n\u001B[1;32m    903\u001B[0m \n\u001B[1;32m    904\u001B[0m \u001B[38;5;124;03m    Parameters\u001B[39;00m\n\u001B[0;32m   (...)\u001B[0m\n\u001B[1;32m    934\u001B[0m \u001B[38;5;124;03m        Fitted estimator.\u001B[39;00m\n\u001B[1;32m    935\u001B[0m \u001B[38;5;124;03m    \"\"\"\u001B[39;00m\n\u001B[0;32m--> 937\u001B[0m     \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m    938\u001B[0m \u001B[43m        \u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    939\u001B[0m \u001B[43m        \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    940\u001B[0m \u001B[43m        \u001B[49m\u001B[43msample_weight\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43msample_weight\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    941\u001B[0m \u001B[43m        \u001B[49m\u001B[43mcheck_input\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mcheck_input\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    942\u001B[0m \u001B[43m        \u001B[49m\u001B[43mX_idx_sorted\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mX_idx_sorted\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    943\u001B[0m \u001B[43m    \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    944\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/tree/_classes.py:165\u001B[0m, in \u001B[0;36mBaseDecisionTree.fit\u001B[0;34m(self, X, y, sample_weight, check_input, X_idx_sorted)\u001B[0m\n\u001B[1;32m    163\u001B[0m check_X_params \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mdict\u001B[39m(dtype\u001B[38;5;241m=\u001B[39mDTYPE, accept_sparse\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mcsc\u001B[39m\u001B[38;5;124m\"\u001B[39m)\n\u001B[1;32m    164\u001B[0m check_y_params \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mdict\u001B[39m(ensure_2d\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mFalse\u001B[39;00m, dtype\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mNone\u001B[39;00m)\n\u001B[0;32m--> 165\u001B[0m X, y \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_validate_data\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m    166\u001B[0m \u001B[43m    \u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mvalidate_separately\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mcheck_X_params\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mcheck_y_params\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    167\u001B[0m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    168\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m issparse(X):\n\u001B[1;32m    169\u001B[0m     X\u001B[38;5;241m.\u001B[39msort_indices()\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/base.py:579\u001B[0m, in \u001B[0;36mBaseEstimator._validate_data\u001B[0;34m(self, X, y, reset, validate_separately, **check_params)\u001B[0m\n\u001B[1;32m    577\u001B[0m     check_X_params, check_y_params \u001B[38;5;241m=\u001B[39m validate_separately\n\u001B[1;32m    578\u001B[0m     X \u001B[38;5;241m=\u001B[39m check_array(X, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mcheck_X_params)\n\u001B[0;32m--> 579\u001B[0m     y \u001B[38;5;241m=\u001B[39m \u001B[43mcheck_array\u001B[49m\u001B[43m(\u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mcheck_y_params\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    580\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m    581\u001B[0m     X, y \u001B[38;5;241m=\u001B[39m check_X_y(X, y, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mcheck_params)\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/utils/validation.py:800\u001B[0m, in \u001B[0;36mcheck_array\u001B[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)\u001B[0m\n\u001B[1;32m    794\u001B[0m         \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\n\u001B[1;32m    795\u001B[0m             \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mFound array with dim \u001B[39m\u001B[38;5;132;01m%d\u001B[39;00m\u001B[38;5;124m. \u001B[39m\u001B[38;5;132;01m%s\u001B[39;00m\u001B[38;5;124m expected <= 2.\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m    796\u001B[0m             \u001B[38;5;241m%\u001B[39m (array\u001B[38;5;241m.\u001B[39mndim, estimator_name)\n\u001B[1;32m    797\u001B[0m         )\n\u001B[1;32m    799\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m force_all_finite:\n\u001B[0;32m--> 800\u001B[0m         \u001B[43m_assert_all_finite\u001B[49m\u001B[43m(\u001B[49m\u001B[43marray\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mallow_nan\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mforce_all_finite\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m==\u001B[39;49m\u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mallow-nan\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[1;32m    802\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m ensure_min_samples \u001B[38;5;241m>\u001B[39m \u001B[38;5;241m0\u001B[39m:\n\u001B[1;32m    803\u001B[0m     n_samples \u001B[38;5;241m=\u001B[39m _num_samples(array)\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/sklearn/utils/validation.py:114\u001B[0m, in \u001B[0;36m_assert_all_finite\u001B[0;34m(X, allow_nan, msg_dtype)\u001B[0m\n\u001B[1;32m    107\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m (\n\u001B[1;32m    108\u001B[0m         allow_nan\n\u001B[1;32m    109\u001B[0m         \u001B[38;5;129;01mand\u001B[39;00m np\u001B[38;5;241m.\u001B[39misinf(X)\u001B[38;5;241m.\u001B[39many()\n\u001B[1;32m    110\u001B[0m         \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m allow_nan\n\u001B[1;32m    111\u001B[0m         \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m np\u001B[38;5;241m.\u001B[39misfinite(X)\u001B[38;5;241m.\u001B[39mall()\n\u001B[1;32m    112\u001B[0m     ):\n\u001B[1;32m    113\u001B[0m         type_err \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124minfinity\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m allow_nan \u001B[38;5;28;01melse\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mNaN, infinity\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[0;32m--> 114\u001B[0m         \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\n\u001B[1;32m    115\u001B[0m             msg_err\u001B[38;5;241m.\u001B[39mformat(\n\u001B[1;32m    116\u001B[0m                 type_err, msg_dtype \u001B[38;5;28;01mif\u001B[39;00m msg_dtype \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m \u001B[38;5;28;01melse\u001B[39;00m X\u001B[38;5;241m.\u001B[39mdtype\n\u001B[1;32m    117\u001B[0m             )\n\u001B[1;32m    118\u001B[0m         )\n\u001B[1;32m    119\u001B[0m \u001B[38;5;66;03m# for object dtype data, we only check for NaNs (GH-13254)\u001B[39;00m\n\u001B[1;32m    120\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m X\u001B[38;5;241m.\u001B[39mdtype \u001B[38;5;241m==\u001B[39m np\u001B[38;5;241m.\u001B[39mdtype(\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mobject\u001B[39m\u001B[38;5;124m\"\u001B[39m) \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m allow_nan:\n",
      "\u001B[0;31mValueError\u001B[0m: Input contains NaN, infinity or a value too large for dtype('float64')."
     ]
    }
   ],
   "source": [
    "pred = list(df.columns)\n",
    "pred.remove('diabetes')\n",
    "X = df[['diabetes']]\n",
    "Y = df[pred]\n",
    "\n",
    "imp_cont = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
    "classifier = DecisionTreeClassifier(max_depth=3,criterion=\"entropy\")\n",
    "model = Pipeline(steps=[('preprocessor',imp_cont),('classifier',classifier)])\n",
    "model.fit(X,Y)\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**3) Let's see how the model's performance evolve as a function of the tree's maximum depth.**\n",
    "\n",
    "**To this end, split the dataset into a training and a test set following a 0.8/0.2 partition. Then, for maximum depths varying from 1 to 20, fit a Decision Tree classifier to the *training* data using a 10 folds cross-validation with the AUROC as metric. Plot the the means of the training and validation AUROCS across each folds as a function of the maximum depth. Also, compute the standard error of the means at each depth, and add it to the plot as shaded grey area around the means (cfr. plot below) . What can you conclude regarding the model's performance, as well as the uncertainty for the in-sample and out-of-sample AUROC estimates ?**\n",
    "\n",
    "**Identify which depth would lead a priori to the best model's out-of-sample performance. Using this depth, fit a decision tree to the training split and report the training AUROC and the test AUROC.** "
   ],
   "metadata": {
    "id": "hHndFhA_J4QK",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 651
    },
    "id": "wzjb32Tqm6ij",
    "outputId": "cc477bf5-fb6a-4ed2-b208-283e1bc2b605",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "Clearly, the decision tree starts overfitting very early as the maximum depth increases, shown by the increasing gap between the training and validation curves. \n",
    "\n",
    "As observed, the standard error on the training AUROC is much less than the standard error on the validation AUROC, wathever the depth. This is expected as the training AUROC in each fold is computed using 10 times as much data points as the validation AUROC, resulting in a lower variance in the training AUROC across each folds, and hence, in a lower standard error compared to the validation AUROC's standard error. Furthermore, as the depth increases, the model begins to interpolate the training data in each fold, and correctly predicts each training point with probability 1. As the AUROC is equal to 1 in each training fold, the standard error is equal to 0, as observed starting from approximately depth 13.\n",
    "\n",
    "On the other hand, the standard error of the validation AUROC remains approximately constant throughout the different depths. "
   ],
   "metadata": {
    "id": "u4-kO2sKOSIg",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Bagging "
   ],
   "metadata": {
    "id": "tdvKtPEa7KRe",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**4) Implement your own bagging algorithm by fitting a decision tree to each bootstrap sample. The bootstramp samples should be drawn from and of equal size as the training set. Set the the number of bootstrap samples to 30 and the maximum_depth of each decision as the optimum depth found above.**\n",
    "\n",
    "**Then using the decision trees fitted on each bootsrapped training data, predict on the test set and use the majority vote strategy to get the final predictions. Redo the same by averaging the trees probabilities. Display the confusion matrix for the test set.**"
   ],
   "metadata": {
    "id": "EwkZbxViZ_Y0",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Random Forest"
   ],
   "metadata": {
    "id": "WlMQavsWhGO6",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**5) Perform a random search on a specified grid of hyper-parameters to find the best hyper-parameters configuration for a RandomForestClassifier. Set the scoring function as the AUROC and limit the number of combination to try to 10.**\n",
    "\n",
    "**Fit the best model found in the previous procedure to the training data, and predict on the test set. Report the test AUROC and display the ROC curve.**"
   ],
   "metadata": {
    "id": "b1_648UmlJrd",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Boosting"
   ],
   "metadata": {
    "id": "qpJYcRiqfa0R",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**7) Fit a boosting classifier to the training data, and report the training and test AUROC's. You can use the GradientBoostingClassifier from the sklearn library.**"
   ],
   "metadata": {
    "id": "xbLLa_mDmpV2",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**8) For a decision tree classifier, a bagging classifier, a random forest classifier and a boosting classifier, perform a random search on a predifined grid of hyper-parameters. Amongst all models and hyper-parameters combinations, select the best model and report the best *validation* AUROC. The random search should be performed on the *training* data, and you can set the number of combinations to try per model to 5.**\n",
    "\n",
    "**For the best model found, report the training and test AUROC's, and display the training and test ROC curves**"
   ],
   "metadata": {
    "id": "7iMGufJ8XlEV",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Regression "
   ],
   "metadata": {
    "id": "jEvdTtjsrCEf",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**9) Select 'bmi' as the target variable and all the remaining columns in the dataframe as the predictors. Split your dataset into a training and test set, fit a decision tree regressor to the training data, and report the MSE on the training and test sets. What do you observe ?**"
   ],
   "metadata": {
    "id": "--wfsvsUYo3C",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "As we did not impose any restrictions on the depth that the tree could reach, it grew to the point that each leaf contains a single training observation. The value predicted for a leaf being the mean of the observations contained in it, and as each leaf contains a single observation, the model exactly predicts all training points, and the MSE is null. The behaviour obviously does not generalize well, as shown by the test MSE. "
   ],
   "metadata": {
    "id": "wA09_yPrZg8g",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**10) For a decision tree regressor, a bagging regressor, a random forest regressor and a boosting regressor, perform a random search on a predifined grid of hyper-parameters. Amongst all models and hyper-parameters combinations, select the best model and report the best *validation* MSE. The random search should be performed on the *training* data, and you can set the number of combinations to try per model to 5.**\n",
    "\n",
    "**For the best model found, report the training and test MSE.**"
   ],
   "metadata": {
    "id": "dMUd7Kyqa6S4",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  }
 ]
}