TP-ML/ex/Lab7_exercises.ipynb

{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "name": "Lab7_exercises",
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "**In this lab about linear regression, we'll be working with the library [StatsModel](https://www.statsmodels.org/stable/index.html), which provides numerous classes and functions for the estimation of statistical models.**\n",
    "\n",
    "**The dataset that we'll be considering is 'diamond.csv' [[1]](https://www.kaggle.com/datasets/shivam2503/diamonds), which contains several information about diverse diamonds, such as their dimensions, the quality of their cuts, their prices, etc... The goal of the lab will be to define linear regression models to best estimate diamonds prices using a bunch of predictor variables, and to understand the meaning of the obtained coefficients.**\n",
    "\n",
    "**Dataset's column information :**\n",
    "\n",
    "*   'price' : price in US dollars.\n",
    "*   'carat' : weight of the diamond. \n",
    "*   'cut' : quality of the cut (Fair, Good, Very Good, Premium, Ideal)\n",
    "*   'color' : diamond's color's, from J (worst) to D (best).\n",
    "*  'clarity' : how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))\n",
    "* 'x' : length in mm.\n",
    "* 'y' : width in mm. \n",
    "* 'z' : height in mm.\n",
    "* 'table' : width of top of the diamond relative to its widest point. \n",
    "\n"
   ],
   "metadata": {
    "id": "5FnAdl_9MJ7-",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**Import necessary libraries**"
   ],
   "metadata": {
    "id": "u6jRRhX7PZQD",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "id": "ujsqLRDOa1-s",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "import numpy as np\n",
    "import pandas as pd \n",
    "from patsy import dmatrices\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**1) Load the dataset, take a look at its properties (shape, data type, etc...). Be careful to set the dataframe incides correctly. Check for missing values, and replace them appropriately if any are present.**\n",
    "\n",
    "**Drop the column 'depth'**"
   ],
   "metadata": {
    "id": "Nd9BGtrOPfq3",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "file = '../data/diamonds.csv'\n",
    "\n",
    "##Read dataframe##\n",
    "\n",
    "df = pd.read_csv(file,index_col=0)\n",
    "df= df.astype({'cut':'category','color':'category','clarity':'category'})\n",
    "df.drop(['depth'],axis=1,inplace=True)\n",
    "\n",
    "print(df.head())\n",
    "print(df.info())\n",
    "print(df.isna().sum())"
   ],
   "metadata": {
    "id": "z0QYojPwf0-s",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 16,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   carat      cut color clarity  table  price     x     y     z\n",
      "1   0.23    Ideal     E     SI2   55.0    326  3.95  3.98  2.43\n",
      "2   0.21  Premium     E     SI1   61.0    326  3.89  3.84  2.31\n",
      "3   0.23     Good     E     VS1   65.0    327  4.05  4.07  2.31\n",
      "4   0.29  Premium     I     VS2   58.0    334  4.20  4.23  2.63\n",
      "5   0.31     Good     J     SI2   58.0    335  4.34  4.35  2.75\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 53940 entries, 1 to 53940\n",
      "Data columns (total 9 columns):\n",
      " #   Column   Non-Null Count  Dtype   \n",
      "---  ------   --------------  -----   \n",
      " 0   carat    53940 non-null  float64 \n",
      " 1   cut      53940 non-null  category\n",
      " 2   color    53940 non-null  category\n",
      " 3   clarity  53940 non-null  category\n",
      " 4   table    53940 non-null  float64 \n",
      " 5   price    53940 non-null  int64   \n",
      " 6   x        53940 non-null  float64 \n",
      " 7   y        53940 non-null  float64 \n",
      " 8   z        53940 non-null  float64 \n",
      "dtypes: category(3), float64(5), int64(1)\n",
      "memory usage: 3.0 MB\n",
      "None\n",
      "carat      0\n",
      "cut        0\n",
      "color      0\n",
      "clarity    0\n",
      "table      0\n",
      "price      0\n",
      "x          0\n",
      "y          0\n",
      "z          0\n",
      "dtype: int64\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**2) Generate scatter plots of the variable 'price' against the variables 'x', 'y' and 'z'. Do you notice anything strange ? How would you handle such cases ?** "
   ],
   "metadata": {
    "id": "an3oV-dUZtP-",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "x1 = df['x'].values\n",
    "x2 = df['y'].values\n",
    "x3 = df['z'].values\n",
    "\n",
    "y = df['price'].values\n",
    "\n",
    "fig = plt.figure()\n",
    "ax = plt.axes(projection ='3d')\n",
    "ax.scatter(x1,x2,x3)\n",
    "ax.set_xlabel('x')\n",
    "ax.set_ylabel('y')\n",
    "ax.set_zlabel('z')\n",
    "img = ax.scatter(x1, x2, x3, c=y, cmap='YlOrRd', alpha=1)\n",
    "\n",
    "plt.show()\n"
   ],
   "metadata": {
    "id": "AE9Eo-7_Z_0G",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 17,
   "outputs": [
    {
     "data": {
      "text/plain": "<IPython.core.display.Javascript object>",
      "application/javascript": "/* Put everything inside the global mpl namespace */\n/* global mpl */\nwindow.mpl = {};\n\nmpl.get_websocket_type = function () {\n    if (typeof WebSocket !== 'undefined') {\n        return WebSocket;\n    } else if (typeof MozWebSocket !== 'undefined') {\n        return MozWebSocket;\n    } else {\n        alert(\n            'Your browser does not have WebSocket support. ' +\n                'Please try Chrome, Safari or Firefox ≥ 6. ' +\n                'Firefox 4 and 5 are also supported but you ' +\n                'have to enable WebSockets in about:config.'\n        );\n    }\n};\n\nmpl.figure = function (figure_id, websocket, ondownload, parent_element) {\n    this.id = figure_id;\n\n    this.ws = websocket;\n\n    this.supports_binary = this.ws.binaryType !== undefined;\n\n    if (!this.supports_binary) {\n        var warnings = document.getElementById('mpl-warnings');\n        if (warnings) {\n            warnings.style.display = 'block';\n            warnings.textContent =\n                'This browser does not support binary websocket messages. ' +\n                'Performance may be slow.';\n        }\n    }\n\n    this.imageObj = new Image();\n\n    this.context = undefined;\n    this.message = undefined;\n    this.canvas = undefined;\n    this.rubberband_canvas = undefined;\n    this.rubberband_context = undefined;\n    this.format_dropdown = undefined;\n\n    this.image_mode = 'full';\n\n    this.root = document.createElement('div');\n    this.root.setAttribute('style', 'display: inline-block');\n    this._root_extra_style(this.root);\n\n    parent_element.appendChild(this.root);\n\n    this._init_header(this);\n    this._init_canvas(this);\n    this._init_toolbar(this);\n\n    var fig = this;\n\n    this.waiting = false;\n\n    this.ws.onopen = function () {\n        fig.send_message('supports_binary', { value: fig.supports_binary });\n        fig.send_message('send_image_mode', {});\n        if (fig.ratio !== 1) {\n            fig.send_message('set_device_pixel_ratio', {\n                device_pixel_ratio: fig.ratio,\n            });\n        }\n        fig.send_message('refresh', {});\n    };\n\n    this.imageObj.onload = function () {\n        if (fig.image_mode === 'full') {\n            // Full images could contain transparency (where diff images\n            // almost always do), so we need to clear the canvas so that\n            // there is no ghosting.\n            fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n        }\n        fig.context.drawImage(fig.imageObj, 0, 0);\n    };\n\n    this.imageObj.onunload = function () {\n        fig.ws.close();\n    };\n\n    this.ws.onmessage = this._make_on_message_function(this);\n\n    this.ondownload = ondownload;\n};\n\nmpl.figure.prototype._init_header = function () {\n    var titlebar = document.createElement('div');\n    titlebar.classList =\n        'ui-dialog-titlebar ui-widget-header ui-corner-all ui-helper-clearfix';\n    var titletext = document.createElement('div');\n    titletext.classList = 'ui-dialog-title';\n    titletext.setAttribute(\n        'style',\n        'width: 100%; text-align: center; padding: 3px;'\n    );\n    titlebar.appendChild(titletext);\n    this.root.appendChild(titlebar);\n    this.header = titletext;\n};\n\nmpl.figure.prototype._canvas_extra_style = function (_canvas_div) {};\n\nmpl.figure.prototype._root_extra_style = function (_canvas_div) {};\n\nmpl.figure.prototype._init_canvas = function () {\n    var fig = this;\n\n    var canvas_div = (this.canvas_div = document.createElement('div'));\n    canvas_div.setAttribute(\n        'style',\n        'border: 1px solid #ddd;' +\n            'box-sizing: content-box;' +\n            'clear: both;' +\n            'min-height: 1px;' +\n            'min-width: 1px;' +\n            'outline: 0;' +\n            'overflow: hidden;' +\n            'position: relative;' +\n            'resize: both;'\n    );\n\n    function on_keyboard_event_closure(name) {\n        return function (event) {\n            return fig.key_event(event, name);\n        };\n    }\n\n    canvas_div.addEventListener(\n        'keydown',\n        on_keyboard_event_closure('key_press')\n    );\n    canvas_div.addEventListener(\n        'keyup',\n        on_keyboard_event_closure('key_release')\n    );\n\n    this._canvas_extra_style(canvas_div);\n    this.root.appendChild(canvas_div);\n\n    var canvas = (this.canvas = document.createElement('canvas'));\n    canvas.classList.add('mpl-canvas');\n    canvas.setAttribute('style', 'box-sizing: content-box;');\n\n    this.context = canvas.getContext('2d');\n\n    var backingStore =\n        this.context.backingStorePixelRatio ||\n        this.context.webkitBackingStorePixelRatio ||\n        this.context.mozBackingStorePixelRatio ||\n        this.context.msBackingStorePixelRatio ||\n        this.context.oBackingStorePixelRatio ||\n        this.context.backingStorePixelRatio ||\n        1;\n\n    this.ratio = (window.devicePixelRatio || 1) / backingStore;\n\n    var rubberband_canvas = (this.rubberband_canvas = document.createElement(\n        'canvas'\n    ));\n    rubberband_canvas.setAttribute(\n        'style',\n        'box-sizing: content-box; position: absolute; left: 0; top: 0; z-index: 1;'\n    );\n\n    // Apply a ponyfill if ResizeObserver is not implemented by browser.\n    if (this.ResizeObserver === undefined) {\n        if (window.ResizeObserver !== undefined) {\n            this.ResizeObserver = window.ResizeObserver;\n        } else {\n            var obs = _JSXTOOLS_RESIZE_OBSERVER({});\n            this.ResizeObserver = obs.ResizeObserver;\n        }\n    }\n\n    this.resizeObserverInstance = new this.ResizeObserver(function (entries) {\n        var nentries = entries.length;\n        for (var i = 0; i < nentries; i++) {\n            var entry = entries[i];\n            var width, height;\n            if (entry.contentBoxSize) {\n                if (entry.contentBoxSize instanceof Array) {\n                    // Chrome 84 implements new version of spec.\n                    width = entry.contentBoxSize[0].inlineSize;\n                    height = entry.contentBoxSize[0].blockSize;\n                } else {\n                    // Firefox implements old version of spec.\n                    width = entry.contentBoxSize.inlineSize;\n                    height = entry.contentBoxSize.blockSize;\n                }\n            } else {\n                // Chrome <84 implements even older version of spec.\n                width = entry.contentRect.width;\n                height = entry.contentRect.height;\n            }\n\n            // Keep the size of the canvas and rubber band canvas in sync with\n            // the canvas container.\n            if (entry.devicePixelContentBoxSize) {\n                // Chrome 84 implements new version of spec.\n                canvas.setAttribute(\n                    'width',\n                    entry.devicePixelContentBoxSize[0].inlineSize\n                );\n                canvas.setAttribute(\n                    'height',\n                    entry.devicePixelContentBoxSize[0].blockSize\n                );\n            } else {\n                canvas.setAttribute('width', width * fig.ratio);\n                canvas.setAttribute('height', height * fig.ratio);\n            }\n            canvas.setAttribute(\n                'style',\n                'width: ' + width + 'px; height: ' + height + 'px;'\n            );\n\n            rubberband_canvas.setAttribute('width', width);\n            rubberband_canvas.setAttribute('height', height);\n\n            // And update the size in Python. We ignore the initial 0/0 size\n            // that occurs as the element is placed into the DOM, which should\n            // otherwise not happen due to the minimum size styling.\n            if (fig.ws.readyState == 1 && width != 0 && height != 0) {\n                fig.request_resize(width, height);\n            }\n        }\n    });\n    this.resizeObserverInstance.observe(canvas_div);\n\n    function on_mouse_event_closure(name) {\n        return function (event) {\n            return fig.mouse_event(event, name);\n        };\n    }\n\n    rubberband_canvas.addEventListener(\n        'mousedown',\n        on_mouse_event_closure('button_press')\n    );\n    rubberband_canvas.addEventListener(\n        'mouseup',\n        on_mouse_event_closure('button_release')\n    );\n    rubberband_canvas.addEventListener(\n        'dblclick',\n        on_mouse_event_closure('dblclick')\n    );\n    // Throttle sequential mouse events to 1 every 20ms.\n    rubberband_canvas.addEventListener(\n        'mousemove',\n        on_mouse_event_closure('motion_notify')\n    );\n\n    rubberband_canvas.addEventListener(\n        'mouseenter',\n        on_mouse_event_closure('figure_enter')\n    );\n    rubberband_canvas.addEventListener(\n        'mouseleave',\n        on_mouse_event_closure('figure_leave')\n    );\n\n    canvas_div.addEventListener('wheel', function (event) {\n        if (event.deltaY < 0) {\n            event.step = 1;\n        } else {\n            event.step = -1;\n        }\n        on_mouse_event_closure('scroll')(event);\n    });\n\n    canvas_div.appendChild(canvas);\n    canvas_div.appendChild(rubberband_canvas);\n\n    this.rubberband_context = rubberband_canvas.getContext('2d');\n    this.rubberband_context.strokeStyle = '#000000';\n\n    this._resize_canvas = function (width, height, forward) {\n        if (forward) {\n            canvas_div.style.width = width + 'px';\n            canvas_div.style.height = height + 'px';\n        }\n    };\n\n    // Disable right mouse context menu.\n    this.rubberband_canvas.addEventListener('contextmenu', function (_e) {\n        event.preventDefault();\n        return false;\n    });\n\n    function set_focus() {\n        canvas.focus();\n        canvas_div.focus();\n    }\n\n    window.setTimeout(set_focus, 100);\n};\n\nmpl.figure.prototype._init_toolbar = function () {\n    var fig = this;\n\n    var toolbar = document.createElement('div');\n    toolbar.classList = 'mpl-toolbar';\n    this.root.appendChild(toolbar);\n\n    function on_click_closure(name) {\n        return function (_event) {\n            return fig.toolbar_button_onclick(name);\n        };\n    }\n\n    function on_mouseover_closure(tooltip) {\n        return function (event) {\n            if (!event.currentTarget.disabled) {\n                return fig.toolbar_button_onmouseover(tooltip);\n            }\n        };\n    }\n\n    fig.buttons = {};\n    var buttonGroup = document.createElement('div');\n    buttonGroup.classList = 'mpl-button-group';\n    for (var toolbar_ind in mpl.toolbar_items) {\n        var name = mpl.toolbar_items[toolbar_ind][0];\n        var tooltip = mpl.toolbar_items[toolbar_ind][1];\n        var image = mpl.toolbar_items[toolbar_ind][2];\n        var method_name = mpl.toolbar_items[toolbar_ind][3];\n\n        if (!name) {\n            /* Instead of a spacer, we start a new button group. */\n            if (buttonGroup.hasChildNodes()) {\n                toolbar.appendChild(buttonGroup);\n            }\n            buttonGroup = document.createElement('div');\n            buttonGroup.classList = 'mpl-button-group';\n            continue;\n        }\n\n        var button = (fig.buttons[name] = document.createElement('button'));\n        button.classList = 'mpl-widget';\n        button.setAttribute('role', 'button');\n        button.setAttribute('aria-disabled', 'false');\n        button.addEventListener('click', on_click_closure(method_name));\n        button.addEventListener('mouseover', on_mouseover_closure(tooltip));\n\n        var icon_img = document.createElement('img');\n        icon_img.src = '_images/' + image + '.png';\n        icon_img.srcset = '_images/' + image + '_large.png 2x';\n        icon_img.alt = tooltip;\n        button.appendChild(icon_img);\n\n        buttonGroup.appendChild(button);\n    }\n\n    if (buttonGroup.hasChildNodes()) {\n        toolbar.appendChild(buttonGroup);\n    }\n\n    var fmt_picker = document.createElement('select');\n    fmt_picker.classList = 'mpl-widget';\n    toolbar.appendChild(fmt_picker);\n    this.format_dropdown = fmt_picker;\n\n    for (var ind in mpl.extensions) {\n        var fmt = mpl.extensions[ind];\n        var option = document.createElement('option');\n        option.selected = fmt === mpl.default_extension;\n        option.innerHTML = fmt;\n        fmt_picker.appendChild(option);\n    }\n\n    var status_bar = document.createElement('span');\n    status_bar.classList = 'mpl-message';\n    toolbar.appendChild(status_bar);\n    this.message = status_bar;\n};\n\nmpl.figure.prototype.request_resize = function (x_pixels, y_pixels) {\n    // Request matplotlib to resize the figure. Matplotlib will then trigger a resize in the client,\n    // which will in turn request a refresh of the image.\n    this.send_message('resize', { width: x_pixels, height: y_pixels });\n};\n\nmpl.figure.prototype.send_message = function (type, properties) {\n    properties['type'] = type;\n    properties['figure_id'] = this.id;\n    this.ws.send(JSON.stringify(properties));\n};\n\nmpl.figure.prototype.send_draw_message = function () {\n    if (!this.waiting) {\n        this.waiting = true;\n        this.ws.send(JSON.stringify({ type: 'draw', figure_id: this.id }));\n    }\n};\n\nmpl.figure.prototype.handle_save = function (fig, _msg) {\n    var format_dropdown = fig.format_dropdown;\n    var format = format_dropdown.options[format_dropdown.selectedIndex].value;\n    fig.ondownload(fig, format);\n};\n\nmpl.figure.prototype.handle_resize = function (fig, msg) {\n    var size = msg['size'];\n    if (size[0] !== fig.canvas.width || size[1] !== fig.canvas.height) {\n        fig._resize_canvas(size[0], size[1], msg['forward']);\n        fig.send_message('refresh', {});\n    }\n};\n\nmpl.figure.prototype.handle_rubberband = function (fig, msg) {\n    var x0 = msg['x0'] / fig.ratio;\n    var y0 = (fig.canvas.height - msg['y0']) / fig.ratio;\n    var x1 = msg['x1'] / fig.ratio;\n    var y1 = (fig.canvas.height - msg['y1']) / fig.ratio;\n    x0 = Math.floor(x0) + 0.5;\n    y0 = Math.floor(y0) + 0.5;\n    x1 = Math.floor(x1) + 0.5;\n    y1 = Math.floor(y1) + 0.5;\n    var min_x = Math.min(x0, x1);\n    var min_y = Math.min(y0, y1);\n    var width = Math.abs(x1 - x0);\n    var height = Math.abs(y1 - y0);\n\n    fig.rubberband_context.clearRect(\n        0,\n        0,\n        fig.canvas.width / fig.ratio,\n        fig.canvas.height / fig.ratio\n    );\n\n    fig.rubberband_context.strokeRect(min_x, min_y, width, height);\n};\n\nmpl.figure.prototype.handle_figure_label = function (fig, msg) {\n    // Updates the figure title.\n    fig.header.textContent = msg['label'];\n};\n\nmpl.figure.prototype.handle_cursor = function (fig, msg) {\n    fig.rubberband_canvas.style.cursor = msg['cursor'];\n};\n\nmpl.figure.prototype.handle_message = function (fig, msg) {\n    fig.message.textContent = msg['message'];\n};\n\nmpl.figure.prototype.handle_draw = function (fig, _msg) {\n    // Request the server to send over a new figure.\n    fig.send_draw_message();\n};\n\nmpl.figure.prototype.handle_image_mode = function (fig, msg) {\n    fig.image_mode = msg['mode'];\n};\n\nmpl.figure.prototype.handle_history_buttons = function (fig, msg) {\n    for (var key in msg) {\n        if (!(key in fig.buttons)) {\n            continue;\n        }\n        fig.buttons[key].disabled = !msg[key];\n        fig.buttons[key].setAttribute('aria-disabled', !msg[key]);\n    }\n};\n\nmpl.figure.prototype.handle_navigate_mode = function (fig, msg) {\n    if (msg['mode'] === 'PAN') {\n        fig.buttons['Pan'].classList.add('active');\n        fig.buttons['Zoom'].classList.remove('active');\n    } else if (msg['mode'] === 'ZOOM') {\n        fig.buttons['Pan'].classList.remove('active');\n        fig.buttons['Zoom'].classList.add('active');\n    } else {\n        fig.buttons['Pan'].classList.remove('active');\n        fig.buttons['Zoom'].classList.remove('active');\n    }\n};\n\nmpl.figure.prototype.updated_canvas_event = function () {\n    // Called whenever the canvas gets updated.\n    this.send_message('ack', {});\n};\n\n// A function to construct a web socket function for onmessage handling.\n// Called in the figure constructor.\nmpl.figure.prototype._make_on_message_function = function (fig) {\n    return function socket_on_message(evt) {\n        if (evt.data instanceof Blob) {\n            var img = evt.data;\n            if (img.type !== 'image/png') {\n                /* FIXME: We get \"Resource interpreted as Image but\n                 * transferred with MIME type text/plain:\" errors on\n                 * Chrome.  But how to set the MIME type?  It doesn't seem\n                 * to be part of the websocket stream */\n                img.type = 'image/png';\n            }\n\n            /* Free the memory for the previous frames */\n            if (fig.imageObj.src) {\n                (window.URL || window.webkitURL).revokeObjectURL(\n                    fig.imageObj.src\n                );\n            }\n\n            fig.imageObj.src = (window.URL || window.webkitURL).createObjectURL(\n                img\n            );\n            fig.updated_canvas_event();\n            fig.waiting = false;\n            return;\n        } else if (\n            typeof evt.data === 'string' &&\n            evt.data.slice(0, 21) === 'data:image/png;base64'\n        ) {\n            fig.imageObj.src = evt.data;\n            fig.updated_canvas_event();\n            fig.waiting = false;\n            return;\n        }\n\n        var msg = JSON.parse(evt.data);\n        var msg_type = msg['type'];\n\n        // Call the  \"handle_{type}\" callback, which takes\n        // the figure and JSON message as its only arguments.\n        try {\n            var callback = fig['handle_' + msg_type];\n        } catch (e) {\n            console.log(\n                \"No handler for the '\" + msg_type + \"' message type: \",\n                msg\n            );\n            return;\n        }\n\n        if (callback) {\n            try {\n                // console.log(\"Handling '\" + msg_type + \"' message: \", msg);\n                callback(fig, msg);\n            } catch (e) {\n                console.log(\n                    \"Exception inside the 'handler_\" + msg_type + \"' callback:\",\n                    e,\n                    e.stack,\n                    msg\n                );\n            }\n        }\n    };\n};\n\n// from https://stackoverflow.com/questions/1114465/getting-mouse-location-in-canvas\nmpl.findpos = function (e) {\n    //this section is from http://www.quirksmode.org/js/events_properties.html\n    var targ;\n    if (!e) {\n        e = window.event;\n    }\n    if (e.target) {\n        targ = e.target;\n    } else if (e.srcElement) {\n        targ = e.srcElement;\n    }\n    if (targ.nodeType === 3) {\n        // defeat Safari bug\n        targ = targ.parentNode;\n    }\n\n    // pageX,Y are the mouse positions relative to the document\n    var boundingRect = targ.getBoundingClientRect();\n    var x = e.pageX - (boundingRect.left + document.body.scrollLeft);\n    var y = e.pageY - (boundingRect.top + document.body.scrollTop);\n\n    return { x: x, y: y };\n};\n\n/*\n * return a copy of an object with only non-object keys\n * we need this to avoid circular references\n * https://stackoverflow.com/a/24161582/3208463\n */\nfunction simpleKeys(original) {\n    return Object.keys(original).reduce(function (obj, key) {\n        if (typeof original[key] !== 'object') {\n            obj[key] = original[key];\n        }\n        return obj;\n    }, {});\n}\n\nmpl.figure.prototype.mouse_event = function (event, name) {\n    var canvas_pos = mpl.findpos(event);\n\n    if (name === 'button_press') {\n        this.canvas.focus();\n        this.canvas_div.focus();\n    }\n\n    var x = canvas_pos.x * this.ratio;\n    var y = canvas_pos.y * this.ratio;\n\n    this.send_message(name, {\n        x: x,\n        y: y,\n        button: event.button,\n        step: event.step,\n        guiEvent: simpleKeys(event),\n    });\n\n    /* This prevents the web browser from automatically changing to\n     * the text insertion cursor when the button is pressed.  We want\n     * to control all of the cursor setting manually through the\n     * 'cursor' event from matplotlib */\n    event.preventDefault();\n    return false;\n};\n\nmpl.figure.prototype._key_event_extra = function (_event, _name) {\n    // Handle any extra behaviour associated with a key event\n};\n\nmpl.figure.prototype.key_event = function (event, name) {\n    // Prevent repeat events\n    if (name === 'key_press') {\n        if (event.key === this._key) {\n            return;\n        } else {\n            this._key = event.key;\n        }\n    }\n    if (name === 'key_release') {\n        this._key = null;\n    }\n\n    var value = '';\n    if (event.ctrlKey && event.key !== 'Control') {\n        value += 'ctrl+';\n    }\n    else if (event.altKey && event.key !== 'Alt') {\n        value += 'alt+';\n    }\n    else if (event.shiftKey && event.key !== 'Shift') {\n        value += 'shift+';\n    }\n\n    value += 'k' + event.key;\n\n    this._key_event_extra(event, name);\n\n    this.send_message(name, { key: value, guiEvent: simpleKeys(event) });\n    return false;\n};\n\nmpl.figure.prototype.toolbar_button_onclick = function (name) {\n    if (name === 'download') {\n        this.handle_save(this, null);\n    } else {\n        this.send_message('toolbar_button', { name: name });\n    }\n};\n\nmpl.figure.prototype.toolbar_button_onmouseover = function (tooltip) {\n    this.message.textContent = tooltip;\n};\n\n///////////////// REMAINING CONTENT GENERATED BY embed_js.py /////////////////\n// prettier-ignore\nvar _JSXTOOLS_RESIZE_OBSERVER=function(A){var t,i=new WeakMap,n=new WeakMap,a=new WeakMap,r=new WeakMap,o=new Set;function s(e){if(!(this instanceof s))throw new TypeError(\"Constructor requires 'new' operator\");i.set(this,e)}function h(){throw new TypeError(\"Function is not a constructor\")}function c(e,t,i,n){e=0 in arguments?Number(arguments[0]):0,t=1 in arguments?Number(arguments[1]):0,i=2 in arguments?Number(arguments[2]):0,n=3 in arguments?Number(arguments[3]):0,this.right=(this.x=this.left=e)+(this.width=i),this.bottom=(this.y=this.top=t)+(this.height=n),Object.freeze(this)}function d(){t=requestAnimationFrame(d);var s=new WeakMap,p=new Set;o.forEach((function(t){r.get(t).forEach((function(i){var r=t instanceof window.SVGElement,o=a.get(t),d=r?0:parseFloat(o.paddingTop),f=r?0:parseFloat(o.paddingRight),l=r?0:parseFloat(o.paddingBottom),u=r?0:parseFloat(o.paddingLeft),g=r?0:parseFloat(o.borderTopWidth),m=r?0:parseFloat(o.borderRightWidth),w=r?0:parseFloat(o.borderBottomWidth),b=u+f,F=d+l,v=(r?0:parseFloat(o.borderLeftWidth))+m,W=g+w,y=r?0:t.offsetHeight-W-t.clientHeight,E=r?0:t.offsetWidth-v-t.clientWidth,R=b+v,z=F+W,M=r?t.width:parseFloat(o.width)-R-E,O=r?t.height:parseFloat(o.height)-z-y;if(n.has(t)){var k=n.get(t);if(k[0]===M&&k[1]===O)return}n.set(t,[M,O]);var S=Object.create(h.prototype);S.target=t,S.contentRect=new c(u,d,M,O),s.has(i)||(s.set(i,[]),p.add(i)),s.get(i).push(S)}))})),p.forEach((function(e){i.get(e).call(e,s.get(e),e)}))}return s.prototype.observe=function(i){if(i instanceof window.Element){r.has(i)||(r.set(i,new Set),o.add(i),a.set(i,window.getComputedStyle(i)));var n=r.get(i);n.has(this)||n.add(this),cancelAnimationFrame(t),t=requestAnimationFrame(d)}},s.prototype.unobserve=function(i){if(i instanceof window.Element&&r.has(i)){var n=r.get(i);n.has(this)&&(n.delete(this),n.size||(r.delete(i),o.delete(i))),n.size||r.delete(i),o.size||cancelAnimationFrame(t)}},A.DOMRectReadOnly=c,A.ResizeObserver=s,A.ResizeObserverEntry=h,A}; // eslint-disable-line\nmpl.toolbar_items = [[\"Home\", \"Reset original view\", \"fa fa-home icon-home\", \"home\"], [\"Back\", \"Back to previous view\", \"fa fa-arrow-left icon-arrow-left\", \"back\"], [\"Forward\", \"Forward to next view\", \"fa fa-arrow-right icon-arrow-right\", \"forward\"], [\"\", \"\", \"\", \"\"], [\"Pan\", \"Left button pans, Right button zooms\\nx/y fixes axis, CTRL fixes aspect\", \"fa fa-arrows icon-move\", \"pan\"], [\"Zoom\", \"Zoom to rectangle\\nx/y fixes axis\", \"fa fa-square-o icon-check-empty\", \"zoom\"], [\"\", \"\", \"\", \"\"], [\"Download\", \"Download plot\", \"fa fa-floppy-o icon-save\", \"download\"]];\n\nmpl.extensions = [\"eps\", \"jpeg\", \"pgf\", \"pdf\", \"png\", \"ps\", \"raw\", \"svg\", \"tif\"];\n\nmpl.default_extension = \"png\";/* global mpl */\n\nvar comm_websocket_adapter = function (comm) {\n    // Create a \"websocket\"-like object which calls the given IPython comm\n    // object with the appropriate methods. Currently this is a non binary\n    // socket, so there is still some room for performance tuning.\n    var ws = {};\n\n    ws.binaryType = comm.kernel.ws.binaryType;\n    ws.readyState = comm.kernel.ws.readyState;\n    function updateReadyState(_event) {\n        if (comm.kernel.ws) {\n            ws.readyState = comm.kernel.ws.readyState;\n        } else {\n            ws.readyState = 3; // Closed state.\n        }\n    }\n    comm.kernel.ws.addEventListener('open', updateReadyState);\n    comm.kernel.ws.addEventListener('close', updateReadyState);\n    comm.kernel.ws.addEventListener('error', updateReadyState);\n\n    ws.close = function () {\n        comm.close();\n    };\n    ws.send = function (m) {\n        //console.log('sending', m);\n        comm.send(m);\n    };\n    // Register the callback with on_msg.\n    comm.on_msg(function (msg) {\n        //console.log('receiving', msg['content']['data'], msg);\n        var data = msg['content']['data'];\n        if (data['blob'] !== undefined) {\n            data = {\n                data: new Blob(msg['buffers'], { type: data['blob'] }),\n            };\n        }\n        // Pass the mpl event to the overridden (by mpl) onmessage function.\n        ws.onmessage(data);\n    });\n    return ws;\n};\n\nmpl.mpl_figure_comm = function (comm, msg) {\n    // This is the function which gets called when the mpl process\n    // starts-up an IPython Comm through the \"matplotlib\" channel.\n\n    var id = msg.content.data.id;\n    // Get hold of the div created by the display call when the Comm\n    // socket was opened in Python.\n    var element = document.getElementById(id);\n    var ws_proxy = comm_websocket_adapter(comm);\n\n    function ondownload(figure, _format) {\n        window.open(figure.canvas.toDataURL());\n    }\n\n    var fig = new mpl.figure(id, ws_proxy, ondownload, element);\n\n    // Call onopen now - mpl needs it, as it is assuming we've passed it a real\n    // web socket which is closed, not our websocket->open comm proxy.\n    ws_proxy.onopen();\n\n    fig.parent_element = element;\n    fig.cell_info = mpl.find_output_cell(\"<div id='\" + id + \"'></div>\");\n    if (!fig.cell_info) {\n        console.error('Failed to find cell for figure', id, fig);\n        return;\n    }\n    fig.cell_info[0].output_area.element.on(\n        'cleared',\n        { fig: fig },\n        fig._remove_fig_handler\n    );\n};\n\nmpl.figure.prototype.handle_close = function (fig, msg) {\n    var width = fig.canvas.width / fig.ratio;\n    fig.cell_info[0].output_area.element.off(\n        'cleared',\n        fig._remove_fig_handler\n    );\n    fig.resizeObserverInstance.unobserve(fig.canvas_div);\n\n    // Update the output cell to use the data from the current canvas.\n    fig.push_to_output();\n    var dataURL = fig.canvas.toDataURL();\n    // Re-enable the keyboard manager in IPython - without this line, in FF,\n    // the notebook keyboard shortcuts fail.\n    IPython.keyboard_manager.enable();\n    fig.parent_element.innerHTML =\n        '<img src=\"' + dataURL + '\" width=\"' + width + '\">';\n    fig.close_ws(fig, msg);\n};\n\nmpl.figure.prototype.close_ws = function (fig, msg) {\n    fig.send_message('closing', msg);\n    // fig.ws.close()\n};\n\nmpl.figure.prototype.push_to_output = function (_remove_interactive) {\n    // Turn the data on the canvas into data in the output cell.\n    var width = this.canvas.width / this.ratio;\n    var dataURL = this.canvas.toDataURL();\n    this.cell_info[1]['text/html'] =\n        '<img src=\"' + dataURL + '\" width=\"' + width + '\">';\n};\n\nmpl.figure.prototype.updated_canvas_event = function () {\n    // Tell IPython that the notebook contents must change.\n    IPython.notebook.set_dirty(true);\n    this.send_message('ack', {});\n    var fig = this;\n    // Wait a second, then push the new image to the DOM so\n    // that it is saved nicely (might be nice to debounce this).\n    setTimeout(function () {\n        fig.push_to_output();\n    }, 1000);\n};\n\nmpl.figure.prototype._init_toolbar = function () {\n    var fig = this;\n\n    var toolbar = document.createElement('div');\n    toolbar.classList = 'btn-toolbar';\n    this.root.appendChild(toolbar);\n\n    function on_click_closure(name) {\n        return function (_event) {\n            return fig.toolbar_button_onclick(name);\n        };\n    }\n\n    function on_mouseover_closure(tooltip) {\n        return function (event) {\n            if (!event.currentTarget.disabled) {\n                return fig.toolbar_button_onmouseover(tooltip);\n            }\n        };\n    }\n\n    fig.buttons = {};\n    var buttonGroup = document.createElement('div');\n    buttonGroup.classList = 'btn-group';\n    var button;\n    for (var toolbar_ind in mpl.toolbar_items) {\n        var name = mpl.toolbar_items[toolbar_ind][0];\n        var tooltip = mpl.toolbar_items[toolbar_ind][1];\n        var image = mpl.toolbar_items[toolbar_ind][2];\n        var method_name = mpl.toolbar_items[toolbar_ind][3];\n\n        if (!name) {\n            /* Instead of a spacer, we start a new button group. */\n            if (buttonGroup.hasChildNodes()) {\n                toolbar.appendChild(buttonGroup);\n            }\n            buttonGroup = document.createElement('div');\n            buttonGroup.classList = 'btn-group';\n            continue;\n        }\n\n        button = fig.buttons[name] = document.createElement('button');\n        button.classList = 'btn btn-default';\n        button.href = '#';\n        button.title = name;\n        button.innerHTML = '<i class=\"fa ' + image + ' fa-lg\"></i>';\n        button.addEventListener('click', on_click_closure(method_name));\n        button.addEventListener('mouseover', on_mouseover_closure(tooltip));\n        buttonGroup.appendChild(button);\n    }\n\n    if (buttonGroup.hasChildNodes()) {\n        toolbar.appendChild(buttonGroup);\n    }\n\n    // Add the status bar.\n    var status_bar = document.createElement('span');\n    status_bar.classList = 'mpl-message pull-right';\n    toolbar.appendChild(status_bar);\n    this.message = status_bar;\n\n    // Add the close button to the window.\n    var buttongrp = document.createElement('div');\n    buttongrp.classList = 'btn-group inline pull-right';\n    button = document.createElement('button');\n    button.classList = 'btn btn-mini btn-primary';\n    button.href = '#';\n    button.title = 'Stop Interaction';\n    button.innerHTML = '<i class=\"fa fa-power-off icon-remove icon-large\"></i>';\n    button.addEventListener('click', function (_evt) {\n        fig.handle_close(fig, {});\n    });\n    button.addEventListener(\n        'mouseover',\n        on_mouseover_closure('Stop Interaction')\n    );\n    buttongrp.appendChild(button);\n    var titlebar = this.root.querySelector('.ui-dialog-titlebar');\n    titlebar.insertBefore(buttongrp, titlebar.firstChild);\n};\n\nmpl.figure.prototype._remove_fig_handler = function (event) {\n    var fig = event.data.fig;\n    if (event.target !== this) {\n        // Ignore bubbled events from children.\n        return;\n    }\n    fig.close_ws(fig, {});\n};\n\nmpl.figure.prototype._root_extra_style = function (el) {\n    el.style.boxSizing = 'content-box'; // override notebook setting of border-box.\n};\n\nmpl.figure.prototype._canvas_extra_style = function (el) {\n    // this is important to make the div 'focusable\n    el.setAttribute('tabindex', 0);\n    // reach out to IPython and tell the keyboard manager to turn it's self\n    // off when our div gets focus\n\n    // location in version 3\n    if (IPython.notebook.keyboard_manager) {\n        IPython.notebook.keyboard_manager.register_events(el);\n    } else {\n        // location in version 2\n        IPython.keyboard_manager.register_events(el);\n    }\n};\n\nmpl.figure.prototype._key_event_extra = function (event, _name) {\n    // Check for shift+enter\n    if (event.shiftKey && event.which === 13) {\n        this.canvas_div.blur();\n        // select the cell after this one\n        var index = IPython.notebook.find_cell_index(this.cell_info[0]);\n        IPython.notebook.select(index + 1);\n    }\n};\n\nmpl.figure.prototype.handle_save = function (fig, _msg) {\n    fig.ondownload(fig, null);\n};\n\nmpl.find_output_cell = function (html_output) {\n    // Return the cell and output element which can be found *uniquely* in the notebook.\n    // Note - this is a bit hacky, but it is done because the \"notebook_saving.Notebook\"\n    // IPython event is triggered only after the cells have been serialised, which for\n    // our purposes (turning an active figure into a static one), is too late.\n    var cells = IPython.notebook.get_cells();\n    var ncells = cells.length;\n    for (var i = 0; i < ncells; i++) {\n        var cell = cells[i];\n        if (cell.cell_type === 'code') {\n            for (var j = 0; j < cell.output_area.outputs.length; j++) {\n                var data = cell.output_area.outputs[j];\n                if (data.data) {\n                    // IPython >= 3 moved mimebundle to data attribute of output\n                    data = data.data;\n                }\n                if (data['text/html'] === html_output) {\n                    return [cell, data, j];\n                }\n            }\n        }\n    }\n};\n\n// Register the function which deals with the matplotlib target/channel.\n// The kernel may be null if the page has been refreshed.\nif (IPython.notebook.kernel !== null) {\n    IPython.notebook.kernel.comm_manager.register_target(\n        'matplotlib',\n        mpl.mpl_figure_comm\n    );\n}\n"
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "<IPython.core.display.HTML object>",
      "text/html": "<div id='06580a48-44f0-4a6f-914a-1c20b67bfd99'></div>"
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ]
  },
  {
   "cell_type": "code",
   "source": [
    "df = df[df['x'] != 0]\n",
    "df = df[df['y'] != 0]\n",
    "df = df[df['z'] != 0]"
   ],
   "metadata": {
    "id": "Okib0LkJaDtw",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 18,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n",
      "0\n"
     ]
    }
   ]
  },
  {
   "cell_type": "code",
   "source": [],
   "metadata": {
    "id": "F9bG8nfEaGFv",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 1,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [],
   "metadata": {
    "id": "-XWKqWmJn-CV",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 1,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "**3)Select 'price' a the target variable and 'x' as the predictor. Fit a linear regression model to the data, and output the model's summary.**\n",
    "\n",
    "**3.1) Is there evidence of a linear relationship between the target and the predictor variables ? What can you say regrading the statistical significance of the estimated coefficients ?**\n",
    "\n",
    "**3.2) How do you interpret the value of the coefficients ?**\n",
    "\n",
    "**3.3) What are the estimates' 95% confidence intervals, and how do you interpret them ?**"
   ],
   "metadata": {
    "id": "_s5Umk7Yh7YC",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "y, X = dmatrices('price ~ x', data=df, return_type='dataframe')\n",
    "\n",
    "mod = sm.OLS(y,X)\n",
    "\n",
    "res = mod.fit()\n",
    "\n",
    "print(res.summary())"
   ],
   "metadata": {
    "id": "R38E-trKm5SJ",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 23,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                  price   R-squared:                       0.787\n",
      "Model:                            OLS   Adj. R-squared:                  0.787\n",
      "Method:                 Least Squares   F-statistic:                 1.994e+05\n",
      "Date:                Tue, 22 Mar 2022   Prob (F-statistic):               0.00\n",
      "Time:                        14:09:44   Log-Likelihood:            -4.8184e+05\n",
      "No. Observations:               53920   AIC:                         9.637e+05\n",
      "Df Residuals:                   53918   BIC:                         9.637e+05\n",
      "Df Model:                           1                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept  -1.418e+04     41.327   -343.177      0.000   -1.43e+04   -1.41e+04\n",
      "x           3160.2360      7.077    446.578      0.000    3146.366    3174.106\n",
      "==============================================================================\n",
      "Omnibus:                    12156.158   Durbin-Watson:                   0.416\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):            34567.308\n",
      "Skew:                           1.190   Prob(JB):                         0.00\n",
      "Kurtosis:                       6.119   Cond. No.                         31.3\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "A unit increase of 'x' results in an increase of the averaged price by 3160.23$. "
   ],
   "metadata": {
    "id": "djjiFHMmheXw",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**4) Add 'y' as another predictor variable, fit the model and output its summary.**\n",
    "\n",
    "**4.1) Is there still evidence of a linear relationship between the target and predictor variables ?**\n",
    "\n",
    "**4.2) How do you interpret the coefficients ?**\n",
    "\n"
   ],
   "metadata": {
    "id": "839JBIPRjiAm",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "y, X = dmatrices('price ~ x+y', data=df, return_type='dataframe')\n",
    "\n",
    "mod = sm.OLS(y,X)\n",
    "\n",
    "res = mod.fit()\n",
    "\n",
    "print(res.summary())"
   ],
   "metadata": {
    "id": "8MLESTI2nGCX",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 27,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                  price   R-squared:                       0.787\n",
      "Model:                            OLS   Adj. R-squared:                  0.787\n",
      "Method:                 Least Squares   F-statistic:                 9.981e+04\n",
      "Date:                Tue, 22 Mar 2022   Prob (F-statistic):               0.00\n",
      "Time:                        14:32:08   Log-Likelihood:            -4.8182e+05\n",
      "No. Observations:               53920   AIC:                         9.636e+05\n",
      "Df Residuals:                   53917   BIC:                         9.637e+05\n",
      "Df Model:                           2                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept  -1.419e+04     41.333   -343.338      0.000   -1.43e+04   -1.41e+04\n",
      "x           2957.9034     31.783     93.064      0.000    2895.608    3020.199\n",
      "y            203.7694     31.206      6.530      0.000     142.605     264.934\n",
      "==============================================================================\n",
      "Omnibus:                    12156.483   Durbin-Watson:                   0.415\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):            34658.814\n",
      "Skew:                           1.189   Prob(JB):                         0.00\n",
      "Kurtosis:                       6.126   Cond. No.                         47.3\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**5) Add an interaction term between 'x' and 'y', refit the model and output its summary.**\n",
    "\n",
    "**5.1) Is there still evidence of a linear relationship between the target and predictor variables ?**\n",
    "\n",
    "**5.2) Does the model seems to be a better fit compared to the one with only 'x' and 'y' ?** \n",
    "\n",
    "**5.3) How do you interpret the coefficients ?**"
   ],
   "metadata": {
    "id": "L3KEoQ0dmJci",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "y, X = dmatrices('price ~ x*y', data=df, return_type='dataframe')\n",
    "\n",
    "mod = sm.OLS(y,X)\n",
    "\n",
    "res = mod.fit()\n",
    "\n",
    "print(res.summary())"
   ],
   "metadata": {
    "id": "f_RuymGVncgj",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 28,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                  price   R-squared:                       0.855\n",
      "Model:                            OLS   Adj. R-squared:                  0.855\n",
      "Method:                 Least Squares   F-statistic:                 1.064e+05\n",
      "Date:                Tue, 22 Mar 2022   Prob (F-statistic):               0.00\n",
      "Time:                        14:36:15   Log-Likelihood:            -4.7141e+05\n",
      "No. Observations:               53920   AIC:                         9.428e+05\n",
      "Df Residuals:                   53916   BIC:                         9.429e+05\n",
      "Df Model:                           3                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept   1.193e+04    167.407     71.257      0.000    1.16e+04    1.23e+04\n",
      "x           -500.3794     34.024    -14.707      0.000    -567.067    -433.692\n",
      "y          -5433.2565     43.740   -124.217      0.000   -5518.987   -5347.526\n",
      "x:y          762.9945      4.788    159.364      0.000     753.610     772.378\n",
      "==============================================================================\n",
      "Omnibus:                    20570.159   Durbin-Watson:                   1.266\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1646113.034\n",
      "Skew:                           0.947   Prob(JB):                         0.00\n",
      "Kurtosis:                      30.002   Cond. No.                         993.\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**6) Generate dummy variables out of the variables 'cut', 'color' and 'clarity'. Make sure that for each of those variables, one level was selected as the reference level (and consequently, that this level is not represented by a dummy variable).**\n",
    "\n",
    "**Why do we need to have k-1 dummy variables, when k is the number of levels ?**"
   ],
   "metadata": {
    "id": "Js9HhxqHsCNs",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "cat_var =['cut','color','clarity']\n",
    "df_dum = pd.get_dummies(df, columns=cat_var, drop_first=False)\n",
    "df_cont = df.drop(cat_var,axis=1)\n",
    "df_cat = pd.concat([df_dum,df_cont],axis=1)\n",
    "df_cat.rename('cut_Very Good','cut_Very_Good',axis=1,inplace=True)"
   ],
   "metadata": {
    "id": "bd9A56vMrfo5",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 29,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       carat  table  price     x     y     z  cut_Fair  cut_Good  cut_Ideal  \\\n",
      "1       0.23   55.0    326  3.95  3.98  2.43         0         0          1   \n",
      "2       0.21   61.0    326  3.89  3.84  2.31         0         0          0   \n",
      "3       0.23   65.0    327  4.05  4.07  2.31         0         1          0   \n",
      "4       0.29   58.0    334  4.20  4.23  2.63         0         0          0   \n",
      "5       0.31   58.0    335  4.34  4.35  2.75         0         1          0   \n",
      "...      ...    ...    ...   ...   ...   ...       ...       ...        ...   \n",
      "53936   0.72   57.0   2757  5.75  5.76  3.50         0         0          1   \n",
      "53937   0.72   55.0   2757  5.69  5.75  3.61         0         1          0   \n",
      "53938   0.70   60.0   2757  5.66  5.68  3.56         0         0          0   \n",
      "53939   0.86   58.0   2757  6.15  6.12  3.74         0         0          0   \n",
      "53940   0.75   55.0   2757  5.83  5.87  3.64         0         0          1   \n",
      "\n",
      "       cut_Premium  ...  clarity_VS1  clarity_VS2  clarity_VVS1  clarity_VVS2  \\\n",
      "1                0  ...            0            0             0             0   \n",
      "2                1  ...            0            0             0             0   \n",
      "3                0  ...            1            0             0             0   \n",
      "4                1  ...            0            1             0             0   \n",
      "5                0  ...            0            0             0             0   \n",
      "...            ...  ...          ...          ...           ...           ...   \n",
      "53936            0  ...            0            0             0             0   \n",
      "53937            0  ...            0            0             0             0   \n",
      "53938            0  ...            0            0             0             0   \n",
      "53939            1  ...            0            0             0             0   \n",
      "53940            0  ...            0            0             0             0   \n",
      "\n",
      "       carat  table  price     x     y     z  \n",
      "1       0.23   55.0    326  3.95  3.98  2.43  \n",
      "2       0.21   61.0    326  3.89  3.84  2.31  \n",
      "3       0.23   65.0    327  4.05  4.07  2.31  \n",
      "4       0.29   58.0    334  4.20  4.23  2.63  \n",
      "5       0.31   58.0    335  4.34  4.35  2.75  \n",
      "...      ...    ...    ...   ...   ...   ...  \n",
      "53936   0.72   57.0   2757  5.75  5.76  3.50  \n",
      "53937   0.72   55.0   2757  5.69  5.75  3.61  \n",
      "53938   0.70   60.0   2757  5.66  5.68  3.56  \n",
      "53939   0.86   58.0   2757  6.15  6.12  3.74  \n",
      "53940   0.75   55.0   2757  5.83  5.87  3.64  \n",
      "\n",
      "[53920 rows x 32 columns]\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**7) Refit the model using the dummy variables obtained from the variable 'color', and output its summary.**\n",
    "\n",
    "**1) How do you interpret the coefficients ?** \n",
    "\n",
    "**2) Are all coefficients significant ? if not, what does it mean ?**\n",
    "\n",
    "**3) Does the model seem to be a good fit ?**\n",
    "\n"
   ],
   "metadata": {
    "id": "j16LXvghtAPt",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "y, X = dmatrices('price ~ color', data=df, return_type='dataframe')\n",
    "\n",
    "mod = sm.OLS(y,X)\n",
    "\n",
    "res = mod.fit()\n",
    "\n",
    "print(res.summary())"
   ],
   "metadata": {
    "id": "wC8JCYqMtYgS",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 30,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                  price   R-squared:                       0.031\n",
      "Model:                            OLS   Adj. R-squared:                  0.031\n",
      "Method:                 Least Squares   F-statistic:                     289.8\n",
      "Date:                Tue, 22 Mar 2022   Prob (F-statistic):               0.00\n",
      "Time:                        14:48:44   Log-Likelihood:            -5.2270e+05\n",
      "No. Observations:               53920   AIC:                         1.045e+06\n",
      "Df Residuals:                   53913   BIC:                         1.045e+06\n",
      "Df Model:                           6                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept   3168.1064     47.685     66.438      0.000    3074.643    3261.570\n",
      "color[T.E]   -91.3540     62.017     -1.473      0.141    -212.908      30.201\n",
      "color[T.F]   556.9738     62.361      8.931      0.000     434.746     679.201\n",
      "color[T.G]   828.7701     60.324     13.739      0.000     710.535     947.005\n",
      "color[T.H]  1312.8357     64.266     20.428      0.000    1186.873    1438.798\n",
      "color[T.I]  1921.8676     71.521     26.871      0.000    1781.685    2062.050\n",
      "color[T.J]  2155.7116     88.088     24.472      0.000    1983.059    2328.364\n",
      "==============================================================================\n",
      "Omnibus:                    14688.045   Durbin-Watson:                   0.058\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33017.839\n",
      "Skew:                           1.575   Prob(JB):                         0.00\n",
      "Kurtosis:                       5.185   Cond. No.                         8.55\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**8) Refit the model using this time all predictor variables (at the exception of price, of course), and output its summary.**\n",
    "\n",
    "**What do you observe ? Does the model seem to be a better fit compared to the previous ones ? Are all coefficients still significant ?**"
   ],
   "metadata": {
    "id": "XjZ2AF9OdOg_",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "y, X = dmatrices('price ~ x+y+z+color+carat+cut+clarity', data=df, return_type='dataframe')\n",
    "\n",
    "mod = sm.OLS(y,X)\n",
    "\n",
    "res = mod.fit()\n",
    "\n",
    "print(res.summary())"
   ],
   "metadata": {
    "id": "fTr9nQ-8r4jI",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 32,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                  price   R-squared:                       0.920\n",
      "Model:                            OLS   Adj. R-squared:                  0.920\n",
      "Method:                 Least Squares   F-statistic:                 2.943e+04\n",
      "Date:                Tue, 22 Mar 2022   Prob (F-statistic):               0.00\n",
      "Time:                        14:53:42   Log-Likelihood:            -4.5553e+05\n",
      "No. Observations:               53920   AIC:                         9.111e+05\n",
      "Df Residuals:                   53898   BIC:                         9.113e+05\n",
      "Df Model:                          21                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "====================================================================================\n",
      "                       coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------------\n",
      "Intercept        -3198.8593     96.089    -33.291      0.000   -3387.194   -3010.525\n",
      "color[T.E]        -208.6919     17.884    -11.669      0.000    -243.744    -173.639\n",
      "color[T.F]        -268.2409     18.087    -14.831      0.000    -303.691    -232.790\n",
      "color[T.G]        -480.9586     17.704    -27.167      0.000    -515.658    -446.259\n",
      "color[T.H]        -985.2696     18.822    -52.347      0.000   -1022.161    -948.378\n",
      "color[T.I]       -1475.8881     21.151    -69.777      0.000   -1517.345   -1434.431\n",
      "color[T.J]       -2381.3789     26.120    -91.169      0.000   -2432.575   -2330.183\n",
      "cut[T.Good]        665.0712     33.020     20.141      0.000     600.352     729.791\n",
      "cut[T.Ideal]      1018.0683     30.261     33.643      0.000     958.757    1077.379\n",
      "cut[T.Premium]     896.6116     30.694     29.212      0.000     836.452     956.771\n",
      "cut[T.Very Good]   856.2029     30.821     27.780      0.000     795.794     916.612\n",
      "clarity[T.IF]     5378.3577     51.002    105.454      0.000    5278.393    5478.322\n",
      "clarity[T.SI1]    3687.2883     43.699     84.379      0.000    3601.638    3772.939\n",
      "clarity[T.SI2]    2730.3778     43.879     62.226      0.000    2644.375    2816.380\n",
      "clarity[T.VS1]    4608.3357     44.584    103.364      0.000    4520.951    4695.720\n",
      "clarity[T.VS2]    4292.1990     43.905     97.760      0.000    4206.144    4378.254\n",
      "clarity[T.VVS1]   5032.2098     47.172    106.678      0.000    4939.753    5124.667\n",
      "clarity[T.VVS2]   4976.1138     45.879    108.461      0.000    4886.190    5066.038\n",
      "x                 -937.2817     31.955    -29.332      0.000    -999.913    -874.651\n",
      "y                   51.0354     19.390      2.632      0.008      13.031      89.040\n",
      "z                 -331.9381     32.729    -10.142      0.000    -396.086    -267.790\n",
      "carat             1.139e+04     50.675    224.793      0.000    1.13e+04    1.15e+04\n",
      "==============================================================================\n",
      "Omnibus:                    14598.936   Durbin-Watson:                   1.196\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):           598338.864\n",
      "Skew:                           0.580   Prob(JB):                         0.00\n",
      "Kurtosis:                      19.278   Cond. No.                         239.\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**9) We will now select candidate features to fit our model using a forward selection strategy. To this end, we will define different entering criteria for our candidate features :**\n",
    "* Does the introduction of the feature decreases the MSE ? \n",
    "* Does the introduction of the feature decreases the AIC ? \n",
    "* Does the introduction of the feature decreases the BIC ? \n",
    " \n",
    "**To this end, define two new functions : neg_AIC(y_true, y_pred, n, k) and neg_BIC(y_true, y_pred, n, k) that respectively compute the negative AIC and BIC given the ground truth y values (y_true), the predicted y values (y_pred), the number of samples (n) and the number of predictors (k). The AIC and BIC can be computed as such :**\n",
    "\n",
    "* AIC = 2*k + n*log(mse) \n",
    "* BIC = n*log(mse) + k*log(n)"
   ],
   "metadata": {
    "id": "BJBVO1ojhWDy",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "import math\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import cross_val_score, cross_validate\n",
    "from sklearn.metrics import make_scorer\n",
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "\n",
    "def neg_AIC(y_true, y_pred, n, k):\n",
    "  mse = mean_squared_error(y_pred,y_true)\n",
    "  return -(2*k + n*math.log(mse))\n",
    "\n",
    "def neg_BIC(y_true, y_pred, n, k):\n",
    "  mse = mean_squared_error(y_pred,y_true)\n",
    "  return -(n*math.log(mse) + k*math.log(n))\n",
    "\n",
    "def forward_selection(df, model, target_column, columns, scoring_rule):\n",
    "  features_to_keep  = []\n",
    "  features_to_try = []  \n",
    "  best_score = -np.inf\n",
    "  cond = True\n",
    "  y = df[target_column].values\n",
    "  while len(columns) != 0 and cond is True:\n",
    "    cond = False\n",
    "    best_feat = None \n",
    "    for col in columns:  \n",
    "      features_to_try = features_to_keep + [col]   \n",
    "      X = get_predictors(df, features_to_try)\n",
    "      n, k = X.shape[0], X.shape[1]\n",
    "      if scoring_rule == 'aic':\n",
    "        scorer = make_scorer(neg_AIC, n=n, k=k, greater_is_better=True)\n",
    "      elif scoring_rule == 'bic':\n",
    "        scorer = make_scorer(neg_BIC, n=n, k=k, greater_is_better=True)\n",
    "      else:\n",
    "        scorer = scoring_rule\n",
    "      cv_results = cross_validate(model, X, y, scoring=scorer, cv=10)\n",
    "      score = cv_results['test_score'].mean()\n",
    "      if score > best_score: \n",
    "        best_feat = col\n",
    "        cond = True\n",
    "        best_score = score\n",
    "    if best_feat != None:\n",
    "      columns.remove(best_feat)\n",
    "      features_to_keep.append(best_feat)\n",
    "  return features_to_keep, best_score \n",
    "\n",
    "def get_predictors(df, cols):\n",
    "  cat_pred = []\n",
    "  cont_pred = []\n",
    "  for col in cols:\n",
    "    if isinstance(df[col].values[0], str):\n",
    "      cat_pred.append(col)\n",
    "    else:\n",
    "      cont_pred.append(col)\n",
    "    if len(cat_pred) != 0:\n",
    "      df_dummies = pd.get_dummies(df[cat_pred], drop_first=True)\n",
    "    else:\n",
    "      df_dummies  = pd.DataFrame() \n",
    "  df_cont = df[cont_pred] \n",
    "  df_cat = pd.concat([df_dummies, df_cont], axis=1)\n",
    "  \n",
    "  return df_cat.values  \n"
   ],
   "metadata": {
    "id": "EyQkfpnZud5e",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 33,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "**10) Use the function forward_selection() and the functions neg_AIC() and neg_BIC() to perform a forward selection on the dataframe features to see which subset of features is best to fit the target variable 'price'. Also, do a forward selection with an entering criterion defined as the MSE.**\n",
    "\n",
    "**For each selection, report the best subset of features obtained, as well as the score obtained. What do you observe ?**  "
   ],
   "metadata": {
    "id": "AS0VH-6mjFXV",
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "forward_selection(df, LinearRegression(), ['price'], df_cat.columns.copy(), 'bic')\n"
   ],
   "metadata": {
    "id": "GNKAsJxVu8jO",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 44,
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'cut_Fair'",
     "output_type": "error",
     "traceback": [
      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
      "\u001B[0;31mKeyError\u001B[0m                                  Traceback (most recent call last)",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3621\u001B[0m, in \u001B[0;36mIndex.get_loc\u001B[0;34m(self, key, method, tolerance)\u001B[0m\n\u001B[1;32m   3620\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m-> 3621\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_engine\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_loc\u001B[49m\u001B[43m(\u001B[49m\u001B[43mcasted_key\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m   3622\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m \u001B[38;5;28;01mas\u001B[39;00m err:\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:136\u001B[0m, in \u001B[0;36mpandas._libs.index.IndexEngine.get_loc\u001B[0;34m()\u001B[0m\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:163\u001B[0m, in \u001B[0;36mpandas._libs.index.IndexEngine.get_loc\u001B[0;34m()\u001B[0m\n",
      "File \u001B[0;32mpandas/_libs/hashtable_class_helper.pxi:5198\u001B[0m, in \u001B[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001B[0;34m()\u001B[0m\n",
      "File \u001B[0;32mpandas/_libs/hashtable_class_helper.pxi:5206\u001B[0m, in \u001B[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001B[0;34m()\u001B[0m\n",
      "\u001B[0;31mKeyError\u001B[0m: 'cut_Fair'",
      "\nThe above exception was the direct cause of the following exception:\n",
      "\u001B[0;31mKeyError\u001B[0m                                  Traceback (most recent call last)",
      "Input \u001B[0;32mIn [44]\u001B[0m, in \u001B[0;36m<module>\u001B[0;34m\u001B[0m\n\u001B[1;32m      1\u001B[0m model \u001B[38;5;241m=\u001B[39m LinearRegression(fit_intercept\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m)\n\u001B[0;32m----> 2\u001B[0m \u001B[43mforward_selection\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m[\u001B[49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[38;5;124;43mprice\u001B[39;49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[43m]\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mdf_cat\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcolumns\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcopy\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[38;5;124;43mneg_mean_squared_error\u001B[39;49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[43m)\u001B[49m\n",
      "Input \u001B[0;32mIn [33]\u001B[0m, in \u001B[0;36mforward_selection\u001B[0;34m(df, model, target_column, columns, scoring_rule)\u001B[0m\n\u001B[1;32m     22\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m col \u001B[38;5;129;01min\u001B[39;00m columns:  \n\u001B[1;32m     23\u001B[0m   features_to_try \u001B[38;5;241m=\u001B[39m features_to_keep \u001B[38;5;241m+\u001B[39m [col]   \n\u001B[0;32m---> 24\u001B[0m   X \u001B[38;5;241m=\u001B[39m \u001B[43mget_predictors\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfeatures_to_try\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m     25\u001B[0m   n, k \u001B[38;5;241m=\u001B[39m X\u001B[38;5;241m.\u001B[39mshape[\u001B[38;5;241m0\u001B[39m], X\u001B[38;5;241m.\u001B[39mshape[\u001B[38;5;241m1\u001B[39m]\n\u001B[1;32m     26\u001B[0m   \u001B[38;5;28;01mif\u001B[39;00m scoring_rule \u001B[38;5;241m==\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124maic\u001B[39m\u001B[38;5;124m'\u001B[39m:\n",
      "Input \u001B[0;32mIn [33]\u001B[0m, in \u001B[0;36mget_predictors\u001B[0;34m(df, cols)\u001B[0m\n\u001B[1;32m     45\u001B[0m cont_pred \u001B[38;5;241m=\u001B[39m []\n\u001B[1;32m     46\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m col \u001B[38;5;129;01min\u001B[39;00m cols:\n\u001B[0;32m---> 47\u001B[0m   \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(\u001B[43mdf\u001B[49m\u001B[43m[\u001B[49m\u001B[43mcol\u001B[49m\u001B[43m]\u001B[49m\u001B[38;5;241m.\u001B[39mvalues[\u001B[38;5;241m0\u001B[39m], \u001B[38;5;28mstr\u001B[39m):\n\u001B[1;32m     48\u001B[0m     cat_pred\u001B[38;5;241m.\u001B[39mappend(col)\n\u001B[1;32m     49\u001B[0m   \u001B[38;5;28;01melse\u001B[39;00m:\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/core/frame.py:3505\u001B[0m, in \u001B[0;36mDataFrame.__getitem__\u001B[0;34m(self, key)\u001B[0m\n\u001B[1;32m   3503\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcolumns\u001B[38;5;241m.\u001B[39mnlevels \u001B[38;5;241m>\u001B[39m \u001B[38;5;241m1\u001B[39m:\n\u001B[1;32m   3504\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_getitem_multilevel(key)\n\u001B[0;32m-> 3505\u001B[0m indexer \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcolumns\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_loc\u001B[49m\u001B[43m(\u001B[49m\u001B[43mkey\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m   3506\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m is_integer(indexer):\n\u001B[1;32m   3507\u001B[0m     indexer \u001B[38;5;241m=\u001B[39m [indexer]\n",
      "File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3623\u001B[0m, in \u001B[0;36mIndex.get_loc\u001B[0;34m(self, key, method, tolerance)\u001B[0m\n\u001B[1;32m   3621\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_engine\u001B[38;5;241m.\u001B[39mget_loc(casted_key)\n\u001B[1;32m   3622\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m \u001B[38;5;28;01mas\u001B[39;00m err:\n\u001B[0;32m-> 3623\u001B[0m     \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m(key) \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01merr\u001B[39;00m\n\u001B[1;32m   3624\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m:\n\u001B[1;32m   3625\u001B[0m     \u001B[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001B[39;00m\n\u001B[1;32m   3626\u001B[0m     \u001B[38;5;66;03m#  InvalidIndexError. Otherwise we fall through and re-raise\u001B[39;00m\n\u001B[1;32m   3627\u001B[0m     \u001B[38;5;66;03m#  the TypeError.\u001B[39;00m\n\u001B[1;32m   3628\u001B[0m     \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_check_indexing_error(key)\n",
      "\u001B[0;31mKeyError\u001B[0m: 'cut_Fair'"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "**11) Looking at the scatter plot of the variable 'price' against the variable 'x', a linear model might not be the best fit to explain the relation between the two variables.  Using a transformation of the variable 'x', try to obtain a better fit. Plot the linear regression line and the one obtained using the transformation of 'x'.**"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [],
   "metadata": {
    "id": "xMrxh-fuBh7Y",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": 2,
   "outputs": []
  }
 ]
}