895 lines
74 KiB
Plaintext
895 lines
74 KiB
Plaintext
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0,
|
|
"metadata": {
|
|
"colab": {
|
|
"name": "Lab7_exercises",
|
|
"provenance": []
|
|
},
|
|
"kernelspec": {
|
|
"name": "python3",
|
|
"display_name": "Python 3"
|
|
},
|
|
"language_info": {
|
|
"name": "python"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**In this lab about linear regression, we'll be working with the library [StatsModel](https://www.statsmodels.org/stable/index.html), which provides numerous classes and functions for the estimation of statistical models.**\n",
|
|
"\n",
|
|
"**The dataset that we'll be considering is 'diamond.csv' [[1]](https://www.kaggle.com/datasets/shivam2503/diamonds), which contains several information about diverse diamonds, such as their dimensions, the quality of their cuts, their prices, etc... The goal of the lab will be to define linear regression models to best estimate diamonds prices using a bunch of predictor variables, and to understand the meaning of the obtained coefficients.**\n",
|
|
"\n",
|
|
"**Dataset's column information :**\n",
|
|
"\n",
|
|
"* 'price' : price in US dollars.\n",
|
|
"* 'carat' : weight of the diamond. \n",
|
|
"* 'cut' : quality of the cut (Fair, Good, Very Good, Premium, Ideal)\n",
|
|
"* 'color' : diamond's color's, from J (worst) to D (best).\n",
|
|
"* 'clarity' : how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))\n",
|
|
"* 'x' : length in mm.\n",
|
|
"* 'y' : width in mm. \n",
|
|
"* 'z' : height in mm.\n",
|
|
"* 'table' : width of top of the diamond relative to its widest point. \n",
|
|
"\n"
|
|
],
|
|
"metadata": {
|
|
"id": "5FnAdl_9MJ7-",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**Import necessary libraries**"
|
|
],
|
|
"metadata": {
|
|
"id": "u6jRRhX7PZQD",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {
|
|
"id": "ujsqLRDOa1-s",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import statsmodels.api as sm\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd \n",
|
|
"from patsy import dmatrices\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"%matplotlib notebook"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**1) Load the dataset, take a look at its properties (shape, data type, etc...). Be careful to set the dataframe incides correctly. Check for missing values, and replace them appropriately if any are present.**\n",
|
|
"\n",
|
|
"**Drop the column 'depth'**"
|
|
],
|
|
"metadata": {
|
|
"id": "Nd9BGtrOPfq3",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"file = '../data/diamonds.csv'\n",
|
|
"\n",
|
|
"##Read dataframe##\n",
|
|
"\n",
|
|
"df = pd.read_csv(file,index_col=0)\n",
|
|
"df= df.astype({'cut':'category','color':'category','clarity':'category'})\n",
|
|
"df.drop(['depth'],axis=1,inplace=True)\n",
|
|
"\n",
|
|
"print(df.head())\n",
|
|
"print(df.info())\n",
|
|
"print(df.isna().sum())"
|
|
],
|
|
"metadata": {
|
|
"id": "z0QYojPwf0-s",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 16,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" carat cut color clarity table price x y z\n",
|
|
"1 0.23 Ideal E SI2 55.0 326 3.95 3.98 2.43\n",
|
|
"2 0.21 Premium E SI1 61.0 326 3.89 3.84 2.31\n",
|
|
"3 0.23 Good E VS1 65.0 327 4.05 4.07 2.31\n",
|
|
"4 0.29 Premium I VS2 58.0 334 4.20 4.23 2.63\n",
|
|
"5 0.31 Good J SI2 58.0 335 4.34 4.35 2.75\n",
|
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|
"Int64Index: 53940 entries, 1 to 53940\n",
|
|
"Data columns (total 9 columns):\n",
|
|
" # Column Non-Null Count Dtype \n",
|
|
"--- ------ -------------- ----- \n",
|
|
" 0 carat 53940 non-null float64 \n",
|
|
" 1 cut 53940 non-null category\n",
|
|
" 2 color 53940 non-null category\n",
|
|
" 3 clarity 53940 non-null category\n",
|
|
" 4 table 53940 non-null float64 \n",
|
|
" 5 price 53940 non-null int64 \n",
|
|
" 6 x 53940 non-null float64 \n",
|
|
" 7 y 53940 non-null float64 \n",
|
|
" 8 z 53940 non-null float64 \n",
|
|
"dtypes: category(3), float64(5), int64(1)\n",
|
|
"memory usage: 3.0 MB\n",
|
|
"None\n",
|
|
"carat 0\n",
|
|
"cut 0\n",
|
|
"color 0\n",
|
|
"clarity 0\n",
|
|
"table 0\n",
|
|
"price 0\n",
|
|
"x 0\n",
|
|
"y 0\n",
|
|
"z 0\n",
|
|
"dtype: int64\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**2) Generate scatter plots of the variable 'price' against the variables 'x', 'y' and 'z'. Do you notice anything strange ? How would you handle such cases ?** "
|
|
],
|
|
"metadata": {
|
|
"id": "an3oV-dUZtP-",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"x1 = df['x'].values\n",
|
|
"x2 = df['y'].values\n",
|
|
"x3 = df['z'].values\n",
|
|
"\n",
|
|
"y = df['price'].values\n",
|
|
"\n",
|
|
"fig = plt.figure()\n",
|
|
"ax = plt.axes(projection ='3d')\n",
|
|
"ax.scatter(x1,x2,x3)\n",
|
|
"ax.set_xlabel('x')\n",
|
|
"ax.set_ylabel('y')\n",
|
|
"ax.set_zlabel('z')\n",
|
|
"img = ax.scatter(x1, x2, x3, c=y, cmap='YlOrRd', alpha=1)\n",
|
|
"\n",
|
|
"plt.show()\n"
|
|
],
|
|
"metadata": {
|
|
"id": "AE9Eo-7_Z_0G",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 17,
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": "<IPython.core.display.Javascript object>",
|
|
"application/javascript": "/* Put everything inside the global mpl namespace */\n/* global mpl */\nwindow.mpl = {};\n\nmpl.get_websocket_type = function () {\n if (typeof WebSocket !== 'undefined') {\n return WebSocket;\n } else if (typeof MozWebSocket !== 'undefined') {\n return MozWebSocket;\n } else {\n alert(\n 'Your browser does not have WebSocket support. ' +\n 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n 'Firefox 4 and 5 are also supported but you ' +\n 'have to enable WebSockets in about:config.'\n );\n }\n};\n\nmpl.figure = function (figure_id, websocket, ondownload, parent_element) {\n this.id = figure_id;\n\n this.ws = websocket;\n\n this.supports_binary = this.ws.binaryType !== undefined;\n\n if (!this.supports_binary) {\n var warnings = document.getElementById('mpl-warnings');\n if (warnings) {\n warnings.style.display = 'block';\n warnings.textContent =\n 'This browser does not support binary websocket messages. ' +\n 'Performance may be slow.';\n }\n }\n\n this.imageObj = new Image();\n\n this.context = undefined;\n this.message = undefined;\n this.canvas = undefined;\n this.rubberband_canvas = undefined;\n this.rubberband_context = undefined;\n this.format_dropdown = undefined;\n\n this.image_mode = 'full';\n\n this.root = document.createElement('div');\n this.root.setAttribute('style', 'display: inline-block');\n this._root_extra_style(this.root);\n\n parent_element.appendChild(this.root);\n\n this._init_header(this);\n this._init_canvas(this);\n this._init_toolbar(this);\n\n var fig = this;\n\n this.waiting = false;\n\n this.ws.onopen = function () {\n fig.send_message('supports_binary', { value: fig.supports_binary });\n fig.send_message('send_image_mode', {});\n if (fig.ratio !== 1) {\n fig.send_message('set_device_pixel_ratio', {\n device_pixel_ratio: fig.ratio,\n });\n }\n fig.send_message('refresh', {});\n };\n\n this.imageObj.onload = function () {\n if (fig.image_mode === 'full') {\n // Full images could contain transparency (where diff images\n // almost always do), so we need to clear the canvas so that\n // there is no ghosting.\n fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n }\n fig.context.drawImage(fig.imageObj, 0, 0);\n };\n\n this.imageObj.onunload = function () {\n fig.ws.close();\n };\n\n this.ws.onmessage = this._make_on_message_function(this);\n\n this.ondownload = ondownload;\n};\n\nmpl.figure.prototype._init_header = function () {\n var titlebar = document.createElement('div');\n titlebar.classList =\n 'ui-dialog-titlebar ui-widget-header ui-corner-all ui-helper-clearfix';\n var titletext = document.createElement('div');\n titletext.classList = 'ui-dialog-title';\n titletext.setAttribute(\n 'style',\n 'width: 100%; text-align: center; padding: 3px;'\n );\n titlebar.appendChild(titletext);\n this.root.appendChild(titlebar);\n this.header = titletext;\n};\n\nmpl.figure.prototype._canvas_extra_style = function (_canvas_div) {};\n\nmpl.figure.prototype._root_extra_style = function (_canvas_div) {};\n\nmpl.figure.prototype._init_canvas = function () {\n var fig = this;\n\n var canvas_div = (this.canvas_div = document.createElement('div'));\n canvas_div.setAttribute(\n 'style',\n 'border: 1px solid #ddd;' +\n 'box-sizing: content-box;' +\n 'clear: both;' +\n 'min-height: 1px;' +\n 'min-width: 1px;' +\n 'outline: 0;' +\n 'overflow: hidden;' +\n 'position: relative;' +\n 'resize: both;'\n );\n\n function on_keyboard_event_closure(name) {\n return function (event) {\n return fig.key_event(event, name);\n };\n }\n\n canvas_div.addEventListener(\n 'keydown',\n on_keyboard_event_closure('key_press')\n );\n canvas_div.addEventListener(\n 'keyup',\n on_keyboard_event_closure('key_release')\n );\n\n this._canvas_extra_style(canvas_div);\n this.root.appendChild(canvas_div);\n\n var canvas = (this.canvas = document.createElement('canvas'));\n canvas.classList.add('mpl-canvas');\n canvas.setAttribute('style', 'box-sizing: content-box;');\n\n this.context = canvas.getContext('2d');\n\n var backingStore =\n this.context.backingStorePixelRatio ||\n this.context.webkitBackingStorePixelRatio ||\n this.context.mozBackingStorePixelRatio ||\n this.context.msBackingStorePixelRatio ||\n this.context.oBackingStorePixelRatio ||\n this.context.backingStorePixelRatio ||\n 1;\n\n this.ratio = (window.devicePixelRatio || 1) / backingStore;\n\n var rubberband_canvas = (this.rubberband_canvas = document.createElement(\n 'canvas'\n ));\n rubberband_canvas.setAttribute(\n 'style',\n 'box-sizing: content-box; position: absolute; left: 0; top: 0; z-index: 1;'\n );\n\n // Apply a ponyfill if ResizeObserver is not implemented by browser.\n if (this.ResizeObserver === undefined) {\n if (window.ResizeObserver !== undefined) {\n this.ResizeObserver = window.ResizeObserver;\n } else {\n var obs = _JSXTOOLS_RESIZE_OBSERVER({});\n this.ResizeObserver = obs.ResizeObserver;\n }\n }\n\n this.resizeObserverInstance = new this.ResizeObserver(function (entries) {\n var nentries = entries.length;\n for (var i = 0; i < nentries; i++) {\n var entry = entries[i];\n var width, height;\n if (entry.contentBoxSize) {\n if (entry.contentBoxSize instanceof Array) {\n // Chrome 84 implements new version of spec.\n width = entry.contentBoxSize[0].inlineSize;\n height = entry.contentBoxSize[0].blockSize;\n } else {\n // Firefox implements old version of spec.\n width = entry.contentBoxSize.inlineSize;\n height = entry.contentBoxSize.blockSize;\n }\n } else {\n // Chrome <84 implements even older version of spec.\n width = entry.contentRect.width;\n height = entry.contentRect.height;\n }\n\n // Keep the size of the canvas and rubber band canvas in sync with\n // the canvas container.\n if (entry.devicePixelContentBoxSize) {\n // Chrome 84 implements new version of spec.\n canvas.setAttribute(\n 'width',\n entry.devicePixelContentBoxSize[0].inlineSize\n );\n canvas.setAttribute(\n 'height',\n entry.devicePixelContentBoxSize[0].blockSize\n );\n } else {\n canvas.setAttribute('width', width * fig.ratio);\n canvas.setAttribute('height', height * fig.ratio);\n }\n canvas.setAttribute(\n 'style',\n 'width: ' + width + 'px; height: ' + height + 'px;'\n );\n\n rubberband_canvas.setAttribute('width', width);\n rubberband_canvas.setAttribute('height', height);\n\n // And update the size in Python. We ignore the initial 0/0 size\n // that occurs as the element is placed into the DOM, which should\n // otherwise not happen due to the minimum size styling.\n if (fig.ws.readyState == 1 && width != 0 && height != 0) {\n fig.request_resize(width, height);\n }\n }\n });\n this.resizeObserverInstance.observe(canvas_div);\n\n function on_mouse_event_closure(name) {\n return function (event) {\n return fig.mouse_event(event, name);\n };\n }\n\n rubberband_canvas.addEventListener(\n 'mousedown',\n on_mouse_event_closure('button_press')\n );\n rubberband_canvas.addEventListener(\n 'mouseup',\n on_mouse_event_closure('button_release')\n );\n rubberband_canvas.addEventListener(\n 'dblclick',\n on_mouse_event_closure('dblclick')\n );\n // Throttle sequential mouse events to 1 every 20ms.\n rubberband_canvas.addEventListener(\n 'mousemove',\n on_mouse_event_closure('motion_notify')\n );\n\n rubberband_canvas.addEventListener(\n 'mouseenter',\n on_mouse_event_closure('figure_enter')\n );\n rubberband_canvas.addEventListener(\n 'mouseleave',\n on_mouse_event_closure('figure_leave')\n );\n\n canvas_div.addEventListener('wheel', function (event) {\n if (event.deltaY < 0) {\n event.step = 1;\n } else {\n event.step = -1;\n }\n on_mouse_event_closure('scroll')(event);\n });\n\n canvas_div.appendChild(canvas);\n canvas_div.appendChild(rubberband_canvas);\n\n this.rubberband_context = rubberband_canvas.getContext('2d');\n this.rubberband_context.strokeStyle = '#000000';\n\n this._resize_canvas = function (width, height, forward) {\n if (forward) {\n canvas_div.style.width = width + 'px';\n canvas_div.style.height = height + 'px';\n }\n };\n\n // Disable right mouse context menu.\n this.rubberband_canvas.addEventListener('contextmenu', function (_e) {\n event.preventDefault();\n return false;\n });\n\n function set_focus() {\n canvas.focus();\n canvas_div.focus();\n }\n\n window.setTimeout(set_focus, 100);\n};\n\nmpl.figure.prototype._init_toolbar = function () {\n var fig = this;\n\n var toolbar = document.createElement('div');\n toolbar.classList = 'mpl-toolbar';\n this.root.appendChild(toolbar);\n\n function on_click_closure(name) {\n return function (_event) {\n return fig.toolbar_button_onclick(name);\n };\n }\n\n function on_mouseover_closure(tooltip) {\n return function (event) {\n if (!event.currentTarget.disabled) {\n return fig.toolbar_button_onmouseover(tooltip);\n }\n };\n }\n\n fig.buttons = {};\n var buttonGroup = document.createElement('div');\n buttonGroup.classList = 'mpl-button-group';\n for (var toolbar_ind in mpl.toolbar_items) {\n var name = mpl.toolbar_items[toolbar_ind][0];\n var tooltip = mpl.toolbar_items[toolbar_ind][1];\n var image = mpl.toolbar_items[toolbar_ind][2];\n var method_name = mpl.toolbar_items[toolbar_ind][3];\n\n if (!name) {\n /* Instead of a spacer, we start a new button group. */\n if (buttonGroup.hasChildNodes()) {\n toolbar.appendChild(buttonGroup);\n }\n buttonGroup = document.createElement('div');\n buttonGroup.classList = 'mpl-button-group';\n continue;\n }\n\n var button = (fig.buttons[name] = document.createElement('button'));\n button.classList = 'mpl-widget';\n button.setAttribute('role', 'button');\n button.setAttribute('aria-disabled', 'false');\n button.addEventListener('click', on_click_closure(method_name));\n button.addEventListener('mouseover', on_mouseover_closure(tooltip));\n\n var icon_img = document.createElement('img');\n icon_img.src = '_images/' + image + '.png';\n icon_img.srcset = '_images/' + image + '_large.png 2x';\n icon_img.alt = tooltip;\n button.appendChild(icon_img);\n\n buttonGroup.appendChild(button);\n }\n\n if (buttonGroup.hasChildNodes()) {\n toolbar.appendChild(buttonGroup);\n }\n\n var fmt_picker = document.createElement('select');\n fmt_picker.classList = 'mpl-widget';\n toolbar.appendChild(fmt_picker);\n this.format_dropdown = fmt_picker;\n\n for (var ind in mpl.extensions) {\n var fmt = mpl.extensions[ind];\n var option = document.createElement('option');\n option.selected = fmt === mpl.default_extension;\n option.innerHTML = fmt;\n fmt_picker.appendChild(option);\n }\n\n var status_bar = document.createElement('span');\n status_bar.classList = 'mpl-message';\n toolbar.appendChild(status_bar);\n this.message = status_bar;\n};\n\nmpl.figure.prototype.request_resize = function (x_pixels, y_pixels) {\n // Request matplotlib to resize the figure. Matplotlib will then trigger a resize in the client,\n // which will in turn request a refresh of the image.\n this.send_message('resize', { width: x_pixels, height: y_pixels });\n};\n\nmpl.figure.prototype.send_message = function (type, properties) {\n properties['type'] = type;\n properties['figure_id'] = this.id;\n this.ws.send(JSON.stringify(properties));\n};\n\nmpl.figure.prototype.send_draw_message = function () {\n if (!this.waiting) {\n this.waiting = true;\n this.ws.send(JSON.stringify({ type: 'draw', figure_id: this.id }));\n }\n};\n\nmpl.figure.prototype.handle_save = function (fig, _msg) {\n var format_dropdown = fig.format_dropdown;\n var format = format_dropdown.options[format_dropdown.selectedIndex].value;\n fig.ondownload(fig, format);\n};\n\nmpl.figure.prototype.handle_resize = function (fig, msg) {\n var size = msg['size'];\n if (size[0] !== fig.canvas.width || size[1] !== fig.canvas.height) {\n fig._resize_canvas(size[0], size[1], msg['forward']);\n fig.send_message('refresh', {});\n }\n};\n\nmpl.figure.prototype.handle_rubberband = function (fig, msg) {\n var x0 = msg['x0'] / fig.ratio;\n var y0 = (fig.canvas.height - msg['y0']) / fig.ratio;\n var x1 = msg['x1'] / fig.ratio;\n var y1 = (fig.canvas.height - msg['y1']) / fig.ratio;\n x0 = Math.floor(x0) + 0.5;\n y0 = Math.floor(y0) + 0.5;\n x1 = Math.floor(x1) + 0.5;\n y1 = Math.floor(y1) + 0.5;\n var min_x = Math.min(x0, x1);\n var min_y = Math.min(y0, y1);\n var width = Math.abs(x1 - x0);\n var height = Math.abs(y1 - y0);\n\n fig.rubberband_context.clearRect(\n 0,\n 0,\n fig.canvas.width / fig.ratio,\n fig.canvas.height / fig.ratio\n );\n\n fig.rubberband_context.strokeRect(min_x, min_y, width, height);\n};\n\nmpl.figure.prototype.handle_figure_label = function (fig, msg) {\n // Updates the figure title.\n fig.header.textContent = msg['label'];\n};\n\nmpl.figure.prototype.handle_cursor = function (fig, msg) {\n fig.rubberband_canvas.style.cursor = msg['cursor'];\n};\n\nmpl.figure.prototype.handle_message = function (fig, msg) {\n fig.message.textContent = msg['message'];\n};\n\nmpl.figure.prototype.handle_draw = function (fig, _msg) {\n // Request the server to send over a new figure.\n fig.send_draw_message();\n};\n\nmpl.figure.prototype.handle_image_mode = function (fig, msg) {\n fig.image_mode = msg['mode'];\n};\n\nmpl.figure.prototype.handle_history_buttons = function (fig, msg) {\n for (var key in msg) {\n if (!(key in fig.buttons)) {\n continue;\n }\n fig.buttons[key].disabled = !msg[key];\n fig.buttons[key].setAttribute('aria-disabled', !msg[key]);\n }\n};\n\nmpl.figure.prototype.handle_navigate_mode = function (fig, msg) {\n if (msg['mode'] === 'PAN') {\n fig.buttons['Pan'].classList.add('active');\n fig.buttons['Zoom'].classList.remove('active');\n } else if (msg['mode'] === 'ZOOM') {\n fig.buttons['Pan'].classList.remove('active');\n fig.buttons['Zoom'].classList.add('active');\n } else {\n fig.buttons['Pan'].classList.remove('active');\n fig.buttons['Zoom'].classList.remove('active');\n }\n};\n\nmpl.figure.prototype.updated_canvas_event = function () {\n // Called whenever the canvas gets updated.\n this.send_message('ack', {});\n};\n\n// A function to construct a web socket function for onmessage handling.\n// Called in the figure constructor.\nmpl.figure.prototype._make_on_message_function = function (fig) {\n return function socket_on_message(evt) {\n if (evt.data instanceof Blob) {\n var img = evt.data;\n if (img.type !== 'image/png') {\n /* FIXME: We get \"Resource interpreted as Image but\n * transferred with MIME type text/plain:\" errors on\n * Chrome. But how to set the MIME type? It doesn't seem\n * to be part of the websocket stream */\n img.type = 'image/png';\n }\n\n /* Free the memory for the previous frames */\n if (fig.imageObj.src) {\n (window.URL || window.webkitURL).revokeObjectURL(\n fig.imageObj.src\n );\n }\n\n fig.imageObj.src = (window.URL || window.webkitURL).createObjectURL(\n img\n );\n fig.updated_canvas_event();\n fig.waiting = false;\n return;\n } else if (\n typeof evt.data === 'string' &&\n evt.data.slice(0, 21) === 'data:image/png;base64'\n ) {\n fig.imageObj.src = evt.data;\n fig.updated_canvas_event();\n fig.waiting = false;\n return;\n }\n\n var msg = JSON.parse(evt.data);\n var msg_type = msg['type'];\n\n // Call the \"handle_{type}\" callback, which takes\n // the figure and JSON message as its only arguments.\n try {\n var callback = fig['handle_' + msg_type];\n } catch (e) {\n console.log(\n \"No handler for the '\" + msg_type + \"' message type: \",\n msg\n );\n return;\n }\n\n if (callback) {\n try {\n // console.log(\"Handling '\" + msg_type + \"' message: \", msg);\n callback(fig, msg);\n } catch (e) {\n console.log(\n \"Exception inside the 'handler_\" + msg_type + \"' callback:\",\n e,\n e.stack,\n msg\n );\n }\n }\n };\n};\n\n// from https://stackoverflow.com/questions/1114465/getting-mouse-location-in-canvas\nmpl.findpos = function (e) {\n //this section is from http://www.quirksmode.org/js/events_properties.html\n var targ;\n if (!e) {\n e = window.event;\n }\n if (e.target) {\n targ = e.target;\n } else if (e.srcElement) {\n targ = e.srcElement;\n }\n if (targ.nodeType === 3) {\n // defeat Safari bug\n targ = targ.parentNode;\n }\n\n // pageX,Y are the mouse positions relative to the document\n var boundingRect = targ.getBoundingClientRect();\n var x = e.pageX - (boundingRect.left + document.body.scrollLeft);\n var y = e.pageY - (boundingRect.top + document.body.scrollTop);\n\n return { x: x, y: y };\n};\n\n/*\n * return a copy of an object with only non-object keys\n * we need this to avoid circular references\n * https://stackoverflow.com/a/24161582/3208463\n */\nfunction simpleKeys(original) {\n return Object.keys(original).reduce(function (obj, key) {\n if (typeof original[key] !== 'object') {\n obj[key] = original[key];\n }\n return obj;\n }, {});\n}\n\nmpl.figure.prototype.mouse_event = function (event, name) {\n var canvas_pos = mpl.findpos(event);\n\n if (name === 'button_press') {\n this.canvas.focus();\n this.canvas_div.focus();\n }\n\n var x = canvas_pos.x * this.ratio;\n var y = canvas_pos.y * this.ratio;\n\n this.send_message(name, {\n x: x,\n y: y,\n button: event.button,\n step: event.step,\n guiEvent: simpleKeys(event),\n });\n\n /* This prevents the web browser from automatically changing to\n * the text insertion cursor when the button is pressed. We want\n * to control all of the cursor setting manually through the\n * 'cursor' event from matplotlib */\n event.preventDefault();\n return false;\n};\n\nmpl.figure.prototype._key_event_extra = function (_event, _name) {\n // Handle any extra behaviour associated with a key event\n};\n\nmpl.figure.prototype.key_event = function (event, name) {\n // Prevent repeat events\n if (name === 'key_press') {\n if (event.key === this._key) {\n return;\n } else {\n this._key = event.key;\n }\n }\n if (name === 'key_release') {\n this._key = null;\n }\n\n var value = '';\n if (event.ctrlKey && event.key !== 'Control') {\n value += 'ctrl+';\n }\n else if (event.altKey && event.key !== 'Alt') {\n value += 'alt+';\n }\n else if (event.shiftKey && event.key !== 'Shift') {\n value += 'shift+';\n }\n\n value += 'k' + event.key;\n\n this._key_event_extra(event, name);\n\n this.send_message(name, { key: value, guiEvent: simpleKeys(event) });\n return false;\n};\n\nmpl.figure.prototype.toolbar_button_onclick = function (name) {\n if (name === 'download') {\n this.handle_save(this, null);\n } else {\n this.send_message('toolbar_button', { name: name });\n }\n};\n\nmpl.figure.prototype.toolbar_button_onmouseover = function (tooltip) {\n this.message.textContent = tooltip;\n};\n\n///////////////// REMAINING CONTENT GENERATED BY embed_js.py /////////////////\n// prettier-ignore\nvar _JSXTOOLS_RESIZE_OBSERVER=function(A){var t,i=new WeakMap,n=new WeakMap,a=new WeakMap,r=new WeakMap,o=new Set;function s(e){if(!(this instanceof s))throw new TypeError(\"Constructor requires 'new' operator\");i.set(this,e)}function h(){throw new TypeError(\"Function is not a constructor\")}function c(e,t,i,n){e=0 in arguments?Number(arguments[0]):0,t=1 in arguments?Number(arguments[1]):0,i=2 in arguments?Number(arguments[2]):0,n=3 in arguments?Number(arguments[3]):0,this.right=(this.x=this.left=e)+(this.width=i),this.bottom=(this.y=this.top=t)+(this.height=n),Object.freeze(this)}function d(){t=requestAnimationFrame(d);var s=new WeakMap,p=new Set;o.forEach((function(t){r.get(t).forEach((function(i){var r=t instanceof window.SVGElement,o=a.get(t),d=r?0:parseFloat(o.paddingTop),f=r?0:parseFloat(o.paddingRight),l=r?0:parseFloat(o.paddingBottom),u=r?0:parseFloat(o.paddingLeft),g=r?0:parseFloat(o.borderTopWidth),m=r?0:parseFloat(o.borderRightWidth),w=r?0:parseFloat(o.borderBottomWidth),b=u+f,F=d+l,v=(r?0:parseFloat(o.borderLeftWidth))+m,W=g+w,y=r?0:t.offsetHeight-W-t.clientHeight,E=r?0:t.offsetWidth-v-t.clientWidth,R=b+v,z=F+W,M=r?t.width:parseFloat(o.width)-R-E,O=r?t.height:parseFloat(o.height)-z-y;if(n.has(t)){var k=n.get(t);if(k[0]===M&&k[1]===O)return}n.set(t,[M,O]);var S=Object.create(h.prototype);S.target=t,S.contentRect=new c(u,d,M,O),s.has(i)||(s.set(i,[]),p.add(i)),s.get(i).push(S)}))})),p.forEach((function(e){i.get(e).call(e,s.get(e),e)}))}return s.prototype.observe=function(i){if(i instanceof window.Element){r.has(i)||(r.set(i,new Set),o.add(i),a.set(i,window.getComputedStyle(i)));var n=r.get(i);n.has(this)||n.add(this),cancelAnimationFrame(t),t=requestAnimationFrame(d)}},s.prototype.unobserve=function(i){if(i instanceof window.Element&&r.has(i)){var n=r.get(i);n.has(this)&&(n.delete(this),n.size||(r.delete(i),o.delete(i))),n.size||r.delete(i),o.size||cancelAnimationFrame(t)}},A.DOMRectReadOnly=c,A.ResizeObserver=s,A.ResizeObserverEntry=h,A}; // eslint-disable-line\nmpl.toolbar_items = [[\"Home\", \"Reset original view\", \"fa fa-home icon-home\", \"home\"], [\"Back\", \"Back to previous view\", \"fa fa-arrow-left icon-arrow-left\", \"back\"], [\"Forward\", \"Forward to next view\", \"fa fa-arrow-right icon-arrow-right\", \"forward\"], [\"\", \"\", \"\", \"\"], [\"Pan\", \"Left button pans, Right button zooms\\nx/y fixes axis, CTRL fixes aspect\", \"fa fa-arrows icon-move\", \"pan\"], [\"Zoom\", \"Zoom to rectangle\\nx/y fixes axis\", \"fa fa-square-o icon-check-empty\", \"zoom\"], [\"\", \"\", \"\", \"\"], [\"Download\", \"Download plot\", \"fa fa-floppy-o icon-save\", \"download\"]];\n\nmpl.extensions = [\"eps\", \"jpeg\", \"pgf\", \"pdf\", \"png\", \"ps\", \"raw\", \"svg\", \"tif\"];\n\nmpl.default_extension = \"png\";/* global mpl */\n\nvar comm_websocket_adapter = function (comm) {\n // Create a \"websocket\"-like object which calls the given IPython comm\n // object with the appropriate methods. Currently this is a non binary\n // socket, so there is still some room for performance tuning.\n var ws = {};\n\n ws.binaryType = comm.kernel.ws.binaryType;\n ws.readyState = comm.kernel.ws.readyState;\n function updateReadyState(_event) {\n if (comm.kernel.ws) {\n ws.readyState = comm.kernel.ws.readyState;\n } else {\n ws.readyState = 3; // Closed state.\n }\n }\n comm.kernel.ws.addEventListener('open', updateReadyState);\n comm.kernel.ws.addEventListener('close', updateReadyState);\n comm.kernel.ws.addEventListener('error', updateReadyState);\n\n ws.close = function () {\n comm.close();\n };\n ws.send = function (m) {\n //console.log('sending', m);\n comm.send(m);\n };\n // Register the callback with on_msg.\n comm.on_msg(function (msg) {\n //console.log('receiving', msg['content']['data'], msg);\n var data = msg['content']['data'];\n if (data['blob'] !== undefined) {\n data = {\n data: new Blob(msg['buffers'], { type: data['blob'] }),\n };\n }\n // Pass the mpl event to the overridden (by mpl) onmessage function.\n ws.onmessage(data);\n });\n return ws;\n};\n\nmpl.mpl_figure_comm = function (comm, msg) {\n // This is the function which gets called when the mpl process\n // starts-up an IPython Comm through the \"matplotlib\" channel.\n\n var id = msg.content.data.id;\n // Get hold of the div created by the display call when the Comm\n // socket was opened in Python.\n var element = document.getElementById(id);\n var ws_proxy = comm_websocket_adapter(comm);\n\n function ondownload(figure, _format) {\n window.open(figure.canvas.toDataURL());\n }\n\n var fig = new mpl.figure(id, ws_proxy, ondownload, element);\n\n // Call onopen now - mpl needs it, as it is assuming we've passed it a real\n // web socket which is closed, not our websocket->open comm proxy.\n ws_proxy.onopen();\n\n fig.parent_element = element;\n fig.cell_info = mpl.find_output_cell(\"<div id='\" + id + \"'></div>\");\n if (!fig.cell_info) {\n console.error('Failed to find cell for figure', id, fig);\n return;\n }\n fig.cell_info[0].output_area.element.on(\n 'cleared',\n { fig: fig },\n fig._remove_fig_handler\n );\n};\n\nmpl.figure.prototype.handle_close = function (fig, msg) {\n var width = fig.canvas.width / fig.ratio;\n fig.cell_info[0].output_area.element.off(\n 'cleared',\n fig._remove_fig_handler\n );\n fig.resizeObserverInstance.unobserve(fig.canvas_div);\n\n // Update the output cell to use the data from the current canvas.\n fig.push_to_output();\n var dataURL = fig.canvas.toDataURL();\n // Re-enable the keyboard manager in IPython - without this line, in FF,\n // the notebook keyboard shortcuts fail.\n IPython.keyboard_manager.enable();\n fig.parent_element.innerHTML =\n '<img src=\"' + dataURL + '\" width=\"' + width + '\">';\n fig.close_ws(fig, msg);\n};\n\nmpl.figure.prototype.close_ws = function (fig, msg) {\n fig.send_message('closing', msg);\n // fig.ws.close()\n};\n\nmpl.figure.prototype.push_to_output = function (_remove_interactive) {\n // Turn the data on the canvas into data in the output cell.\n var width = this.canvas.width / this.ratio;\n var dataURL = this.canvas.toDataURL();\n this.cell_info[1]['text/html'] =\n '<img src=\"' + dataURL + '\" width=\"' + width + '\">';\n};\n\nmpl.figure.prototype.updated_canvas_event = function () {\n // Tell IPython that the notebook contents must change.\n IPython.notebook.set_dirty(true);\n this.send_message('ack', {});\n var fig = this;\n // Wait a second, then push the new image to the DOM so\n // that it is saved nicely (might be nice to debounce this).\n setTimeout(function () {\n fig.push_to_output();\n }, 1000);\n};\n\nmpl.figure.prototype._init_toolbar = function () {\n var fig = this;\n\n var toolbar = document.createElement('div');\n toolbar.classList = 'btn-toolbar';\n this.root.appendChild(toolbar);\n\n function on_click_closure(name) {\n return function (_event) {\n return fig.toolbar_button_onclick(name);\n };\n }\n\n function on_mouseover_closure(tooltip) {\n return function (event) {\n if (!event.currentTarget.disabled) {\n return fig.toolbar_button_onmouseover(tooltip);\n }\n };\n }\n\n fig.buttons = {};\n var buttonGroup = document.createElement('div');\n buttonGroup.classList = 'btn-group';\n var button;\n for (var toolbar_ind in mpl.toolbar_items) {\n var name = mpl.toolbar_items[toolbar_ind][0];\n var tooltip = mpl.toolbar_items[toolbar_ind][1];\n var image = mpl.toolbar_items[toolbar_ind][2];\n var method_name = mpl.toolbar_items[toolbar_ind][3];\n\n if (!name) {\n /* Instead of a spacer, we start a new button group. */\n if (buttonGroup.hasChildNodes()) {\n toolbar.appendChild(buttonGroup);\n }\n buttonGroup = document.createElement('div');\n buttonGroup.classList = 'btn-group';\n continue;\n }\n\n button = fig.buttons[name] = document.createElement('button');\n button.classList = 'btn btn-default';\n button.href = '#';\n button.title = name;\n button.innerHTML = '<i class=\"fa ' + image + ' fa-lg\"></i>';\n button.addEventListener('click', on_click_closure(method_name));\n button.addEventListener('mouseover', on_mouseover_closure(tooltip));\n buttonGroup.appendChild(button);\n }\n\n if (buttonGroup.hasChildNodes()) {\n toolbar.appendChild(buttonGroup);\n }\n\n // Add the status bar.\n var status_bar = document.createElement('span');\n status_bar.classList = 'mpl-message pull-right';\n toolbar.appendChild(status_bar);\n this.message = status_bar;\n\n // Add the close button to the window.\n var buttongrp = document.createElement('div');\n buttongrp.classList = 'btn-group inline pull-right';\n button = document.createElement('button');\n button.classList = 'btn btn-mini btn-primary';\n button.href = '#';\n button.title = 'Stop Interaction';\n button.innerHTML = '<i class=\"fa fa-power-off icon-remove icon-large\"></i>';\n button.addEventListener('click', function (_evt) {\n fig.handle_close(fig, {});\n });\n button.addEventListener(\n 'mouseover',\n on_mouseover_closure('Stop Interaction')\n );\n buttongrp.appendChild(button);\n var titlebar = this.root.querySelector('.ui-dialog-titlebar');\n titlebar.insertBefore(buttongrp, titlebar.firstChild);\n};\n\nmpl.figure.prototype._remove_fig_handler = function (event) {\n var fig = event.data.fig;\n if (event.target !== this) {\n // Ignore bubbled events from children.\n return;\n }\n fig.close_ws(fig, {});\n};\n\nmpl.figure.prototype._root_extra_style = function (el) {\n el.style.boxSizing = 'content-box'; // override notebook setting of border-box.\n};\n\nmpl.figure.prototype._canvas_extra_style = function (el) {\n // this is important to make the div 'focusable\n el.setAttribute('tabindex', 0);\n // reach out to IPython and tell the keyboard manager to turn it's self\n // off when our div gets focus\n\n // location in version 3\n if (IPython.notebook.keyboard_manager) {\n IPython.notebook.keyboard_manager.register_events(el);\n } else {\n // location in version 2\n IPython.keyboard_manager.register_events(el);\n }\n};\n\nmpl.figure.prototype._key_event_extra = function (event, _name) {\n // Check for shift+enter\n if (event.shiftKey && event.which === 13) {\n this.canvas_div.blur();\n // select the cell after this one\n var index = IPython.notebook.find_cell_index(this.cell_info[0]);\n IPython.notebook.select(index + 1);\n }\n};\n\nmpl.figure.prototype.handle_save = function (fig, _msg) {\n fig.ondownload(fig, null);\n};\n\nmpl.find_output_cell = function (html_output) {\n // Return the cell and output element which can be found *uniquely* in the notebook.\n // Note - this is a bit hacky, but it is done because the \"notebook_saving.Notebook\"\n // IPython event is triggered only after the cells have been serialised, which for\n // our purposes (turning an active figure into a static one), is too late.\n var cells = IPython.notebook.get_cells();\n var ncells = cells.length;\n for (var i = 0; i < ncells; i++) {\n var cell = cells[i];\n if (cell.cell_type === 'code') {\n for (var j = 0; j < cell.output_area.outputs.length; j++) {\n var data = cell.output_area.outputs[j];\n if (data.data) {\n // IPython >= 3 moved mimebundle to data attribute of output\n data = data.data;\n }\n if (data['text/html'] === html_output) {\n return [cell, data, j];\n }\n }\n }\n }\n};\n\n// Register the function which deals with the matplotlib target/channel.\n// The kernel may be null if the page has been refreshed.\nif (IPython.notebook.kernel !== null) {\n IPython.notebook.kernel.comm_manager.register_target(\n 'matplotlib',\n mpl.mpl_figure_comm\n );\n}\n"
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": "<IPython.core.display.HTML object>",
|
|
"text/html": "<div id='06580a48-44f0-4a6f-914a-1c20b67bfd99'></div>"
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"df = df[df['x'] != 0]\n",
|
|
"df = df[df['y'] != 0]\n",
|
|
"df = df[df['z'] != 0]"
|
|
],
|
|
"metadata": {
|
|
"id": "Okib0LkJaDtw",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 18,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"8\n",
|
|
"0\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [],
|
|
"metadata": {
|
|
"id": "F9bG8nfEaGFv",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 1,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [],
|
|
"metadata": {
|
|
"id": "-XWKqWmJn-CV",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 1,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**3)Select 'price' a the target variable and 'x' as the predictor. Fit a linear regression model to the data, and output the model's summary.**\n",
|
|
"\n",
|
|
"**3.1) Is there evidence of a linear relationship between the target and the predictor variables ? What can you say regrading the statistical significance of the estimated coefficients ?**\n",
|
|
"\n",
|
|
"**3.2) How do you interpret the value of the coefficients ?**\n",
|
|
"\n",
|
|
"**3.3) What are the estimates' 95% confidence intervals, and how do you interpret them ?**"
|
|
],
|
|
"metadata": {
|
|
"id": "_s5Umk7Yh7YC",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"y, X = dmatrices('price ~ x', data=df, return_type='dataframe')\n",
|
|
"\n",
|
|
"mod = sm.OLS(y,X)\n",
|
|
"\n",
|
|
"res = mod.fit()\n",
|
|
"\n",
|
|
"print(res.summary())"
|
|
],
|
|
"metadata": {
|
|
"id": "R38E-trKm5SJ",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 23,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" OLS Regression Results \n",
|
|
"==============================================================================\n",
|
|
"Dep. Variable: price R-squared: 0.787\n",
|
|
"Model: OLS Adj. R-squared: 0.787\n",
|
|
"Method: Least Squares F-statistic: 1.994e+05\n",
|
|
"Date: Tue, 22 Mar 2022 Prob (F-statistic): 0.00\n",
|
|
"Time: 14:09:44 Log-Likelihood: -4.8184e+05\n",
|
|
"No. Observations: 53920 AIC: 9.637e+05\n",
|
|
"Df Residuals: 53918 BIC: 9.637e+05\n",
|
|
"Df Model: 1 \n",
|
|
"Covariance Type: nonrobust \n",
|
|
"==============================================================================\n",
|
|
" coef std err t P>|t| [0.025 0.975]\n",
|
|
"------------------------------------------------------------------------------\n",
|
|
"Intercept -1.418e+04 41.327 -343.177 0.000 -1.43e+04 -1.41e+04\n",
|
|
"x 3160.2360 7.077 446.578 0.000 3146.366 3174.106\n",
|
|
"==============================================================================\n",
|
|
"Omnibus: 12156.158 Durbin-Watson: 0.416\n",
|
|
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 34567.308\n",
|
|
"Skew: 1.190 Prob(JB): 0.00\n",
|
|
"Kurtosis: 6.119 Cond. No. 31.3\n",
|
|
"==============================================================================\n",
|
|
"\n",
|
|
"Notes:\n",
|
|
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"A unit increase of 'x' results in an increase of the averaged price by 3160.23$. "
|
|
],
|
|
"metadata": {
|
|
"id": "djjiFHMmheXw",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**4) Add 'y' as another predictor variable, fit the model and output its summary.**\n",
|
|
"\n",
|
|
"**4.1) Is there still evidence of a linear relationship between the target and predictor variables ?**\n",
|
|
"\n",
|
|
"**4.2) How do you interpret the coefficients ?**\n",
|
|
"\n"
|
|
],
|
|
"metadata": {
|
|
"id": "839JBIPRjiAm",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"y, X = dmatrices('price ~ x+y', data=df, return_type='dataframe')\n",
|
|
"\n",
|
|
"mod = sm.OLS(y,X)\n",
|
|
"\n",
|
|
"res = mod.fit()\n",
|
|
"\n",
|
|
"print(res.summary())"
|
|
],
|
|
"metadata": {
|
|
"id": "8MLESTI2nGCX",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 27,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" OLS Regression Results \n",
|
|
"==============================================================================\n",
|
|
"Dep. Variable: price R-squared: 0.787\n",
|
|
"Model: OLS Adj. R-squared: 0.787\n",
|
|
"Method: Least Squares F-statistic: 9.981e+04\n",
|
|
"Date: Tue, 22 Mar 2022 Prob (F-statistic): 0.00\n",
|
|
"Time: 14:32:08 Log-Likelihood: -4.8182e+05\n",
|
|
"No. Observations: 53920 AIC: 9.636e+05\n",
|
|
"Df Residuals: 53917 BIC: 9.637e+05\n",
|
|
"Df Model: 2 \n",
|
|
"Covariance Type: nonrobust \n",
|
|
"==============================================================================\n",
|
|
" coef std err t P>|t| [0.025 0.975]\n",
|
|
"------------------------------------------------------------------------------\n",
|
|
"Intercept -1.419e+04 41.333 -343.338 0.000 -1.43e+04 -1.41e+04\n",
|
|
"x 2957.9034 31.783 93.064 0.000 2895.608 3020.199\n",
|
|
"y 203.7694 31.206 6.530 0.000 142.605 264.934\n",
|
|
"==============================================================================\n",
|
|
"Omnibus: 12156.483 Durbin-Watson: 0.415\n",
|
|
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 34658.814\n",
|
|
"Skew: 1.189 Prob(JB): 0.00\n",
|
|
"Kurtosis: 6.126 Cond. No. 47.3\n",
|
|
"==============================================================================\n",
|
|
"\n",
|
|
"Notes:\n",
|
|
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**5) Add an interaction term between 'x' and 'y', refit the model and output its summary.**\n",
|
|
"\n",
|
|
"**5.1) Is there still evidence of a linear relationship between the target and predictor variables ?**\n",
|
|
"\n",
|
|
"**5.2) Does the model seems to be a better fit compared to the one with only 'x' and 'y' ?** \n",
|
|
"\n",
|
|
"**5.3) How do you interpret the coefficients ?**"
|
|
],
|
|
"metadata": {
|
|
"id": "L3KEoQ0dmJci",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"y, X = dmatrices('price ~ x*y', data=df, return_type='dataframe')\n",
|
|
"\n",
|
|
"mod = sm.OLS(y,X)\n",
|
|
"\n",
|
|
"res = mod.fit()\n",
|
|
"\n",
|
|
"print(res.summary())"
|
|
],
|
|
"metadata": {
|
|
"id": "f_RuymGVncgj",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 28,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" OLS Regression Results \n",
|
|
"==============================================================================\n",
|
|
"Dep. Variable: price R-squared: 0.855\n",
|
|
"Model: OLS Adj. R-squared: 0.855\n",
|
|
"Method: Least Squares F-statistic: 1.064e+05\n",
|
|
"Date: Tue, 22 Mar 2022 Prob (F-statistic): 0.00\n",
|
|
"Time: 14:36:15 Log-Likelihood: -4.7141e+05\n",
|
|
"No. Observations: 53920 AIC: 9.428e+05\n",
|
|
"Df Residuals: 53916 BIC: 9.429e+05\n",
|
|
"Df Model: 3 \n",
|
|
"Covariance Type: nonrobust \n",
|
|
"==============================================================================\n",
|
|
" coef std err t P>|t| [0.025 0.975]\n",
|
|
"------------------------------------------------------------------------------\n",
|
|
"Intercept 1.193e+04 167.407 71.257 0.000 1.16e+04 1.23e+04\n",
|
|
"x -500.3794 34.024 -14.707 0.000 -567.067 -433.692\n",
|
|
"y -5433.2565 43.740 -124.217 0.000 -5518.987 -5347.526\n",
|
|
"x:y 762.9945 4.788 159.364 0.000 753.610 772.378\n",
|
|
"==============================================================================\n",
|
|
"Omnibus: 20570.159 Durbin-Watson: 1.266\n",
|
|
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 1646113.034\n",
|
|
"Skew: 0.947 Prob(JB): 0.00\n",
|
|
"Kurtosis: 30.002 Cond. No. 993.\n",
|
|
"==============================================================================\n",
|
|
"\n",
|
|
"Notes:\n",
|
|
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**6) Generate dummy variables out of the variables 'cut', 'color' and 'clarity'. Make sure that for each of those variables, one level was selected as the reference level (and consequently, that this level is not represented by a dummy variable).**\n",
|
|
"\n",
|
|
"**Why do we need to have k-1 dummy variables, when k is the number of levels ?**"
|
|
],
|
|
"metadata": {
|
|
"id": "Js9HhxqHsCNs",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"cat_var =['cut','color','clarity']\n",
|
|
"df_dum = pd.get_dummies(df, columns=cat_var, drop_first=False)\n",
|
|
"df_cont = df.drop(cat_var,axis=1)\n",
|
|
"df_cat = pd.concat([df_dum,df_cont],axis=1)\n",
|
|
"df_cat.rename('cut_Very Good','cut_Very_Good',axis=1,inplace=True)"
|
|
],
|
|
"metadata": {
|
|
"id": "bd9A56vMrfo5",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 29,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" carat table price x y z cut_Fair cut_Good cut_Ideal \\\n",
|
|
"1 0.23 55.0 326 3.95 3.98 2.43 0 0 1 \n",
|
|
"2 0.21 61.0 326 3.89 3.84 2.31 0 0 0 \n",
|
|
"3 0.23 65.0 327 4.05 4.07 2.31 0 1 0 \n",
|
|
"4 0.29 58.0 334 4.20 4.23 2.63 0 0 0 \n",
|
|
"5 0.31 58.0 335 4.34 4.35 2.75 0 1 0 \n",
|
|
"... ... ... ... ... ... ... ... ... ... \n",
|
|
"53936 0.72 57.0 2757 5.75 5.76 3.50 0 0 1 \n",
|
|
"53937 0.72 55.0 2757 5.69 5.75 3.61 0 1 0 \n",
|
|
"53938 0.70 60.0 2757 5.66 5.68 3.56 0 0 0 \n",
|
|
"53939 0.86 58.0 2757 6.15 6.12 3.74 0 0 0 \n",
|
|
"53940 0.75 55.0 2757 5.83 5.87 3.64 0 0 1 \n",
|
|
"\n",
|
|
" cut_Premium ... clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 \\\n",
|
|
"1 0 ... 0 0 0 0 \n",
|
|
"2 1 ... 0 0 0 0 \n",
|
|
"3 0 ... 1 0 0 0 \n",
|
|
"4 1 ... 0 1 0 0 \n",
|
|
"5 0 ... 0 0 0 0 \n",
|
|
"... ... ... ... ... ... ... \n",
|
|
"53936 0 ... 0 0 0 0 \n",
|
|
"53937 0 ... 0 0 0 0 \n",
|
|
"53938 0 ... 0 0 0 0 \n",
|
|
"53939 1 ... 0 0 0 0 \n",
|
|
"53940 0 ... 0 0 0 0 \n",
|
|
"\n",
|
|
" carat table price x y z \n",
|
|
"1 0.23 55.0 326 3.95 3.98 2.43 \n",
|
|
"2 0.21 61.0 326 3.89 3.84 2.31 \n",
|
|
"3 0.23 65.0 327 4.05 4.07 2.31 \n",
|
|
"4 0.29 58.0 334 4.20 4.23 2.63 \n",
|
|
"5 0.31 58.0 335 4.34 4.35 2.75 \n",
|
|
"... ... ... ... ... ... ... \n",
|
|
"53936 0.72 57.0 2757 5.75 5.76 3.50 \n",
|
|
"53937 0.72 55.0 2757 5.69 5.75 3.61 \n",
|
|
"53938 0.70 60.0 2757 5.66 5.68 3.56 \n",
|
|
"53939 0.86 58.0 2757 6.15 6.12 3.74 \n",
|
|
"53940 0.75 55.0 2757 5.83 5.87 3.64 \n",
|
|
"\n",
|
|
"[53920 rows x 32 columns]\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**7) Refit the model using the dummy variables obtained from the variable 'color', and output its summary.**\n",
|
|
"\n",
|
|
"**1) How do you interpret the coefficients ?** \n",
|
|
"\n",
|
|
"**2) Are all coefficients significant ? if not, what does it mean ?**\n",
|
|
"\n",
|
|
"**3) Does the model seem to be a good fit ?**\n",
|
|
"\n"
|
|
],
|
|
"metadata": {
|
|
"id": "j16LXvghtAPt",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"y, X = dmatrices('price ~ color', data=df, return_type='dataframe')\n",
|
|
"\n",
|
|
"mod = sm.OLS(y,X)\n",
|
|
"\n",
|
|
"res = mod.fit()\n",
|
|
"\n",
|
|
"print(res.summary())"
|
|
],
|
|
"metadata": {
|
|
"id": "wC8JCYqMtYgS",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 30,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" OLS Regression Results \n",
|
|
"==============================================================================\n",
|
|
"Dep. Variable: price R-squared: 0.031\n",
|
|
"Model: OLS Adj. R-squared: 0.031\n",
|
|
"Method: Least Squares F-statistic: 289.8\n",
|
|
"Date: Tue, 22 Mar 2022 Prob (F-statistic): 0.00\n",
|
|
"Time: 14:48:44 Log-Likelihood: -5.2270e+05\n",
|
|
"No. Observations: 53920 AIC: 1.045e+06\n",
|
|
"Df Residuals: 53913 BIC: 1.045e+06\n",
|
|
"Df Model: 6 \n",
|
|
"Covariance Type: nonrobust \n",
|
|
"==============================================================================\n",
|
|
" coef std err t P>|t| [0.025 0.975]\n",
|
|
"------------------------------------------------------------------------------\n",
|
|
"Intercept 3168.1064 47.685 66.438 0.000 3074.643 3261.570\n",
|
|
"color[T.E] -91.3540 62.017 -1.473 0.141 -212.908 30.201\n",
|
|
"color[T.F] 556.9738 62.361 8.931 0.000 434.746 679.201\n",
|
|
"color[T.G] 828.7701 60.324 13.739 0.000 710.535 947.005\n",
|
|
"color[T.H] 1312.8357 64.266 20.428 0.000 1186.873 1438.798\n",
|
|
"color[T.I] 1921.8676 71.521 26.871 0.000 1781.685 2062.050\n",
|
|
"color[T.J] 2155.7116 88.088 24.472 0.000 1983.059 2328.364\n",
|
|
"==============================================================================\n",
|
|
"Omnibus: 14688.045 Durbin-Watson: 0.058\n",
|
|
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 33017.839\n",
|
|
"Skew: 1.575 Prob(JB): 0.00\n",
|
|
"Kurtosis: 5.185 Cond. No. 8.55\n",
|
|
"==============================================================================\n",
|
|
"\n",
|
|
"Notes:\n",
|
|
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**8) Refit the model using this time all predictor variables (at the exception of price, of course), and output its summary.**\n",
|
|
"\n",
|
|
"**What do you observe ? Does the model seem to be a better fit compared to the previous ones ? Are all coefficients still significant ?**"
|
|
],
|
|
"metadata": {
|
|
"id": "XjZ2AF9OdOg_",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"y, X = dmatrices('price ~ x+y+z+color+carat+cut+clarity', data=df, return_type='dataframe')\n",
|
|
"\n",
|
|
"mod = sm.OLS(y,X)\n",
|
|
"\n",
|
|
"res = mod.fit()\n",
|
|
"\n",
|
|
"print(res.summary())"
|
|
],
|
|
"metadata": {
|
|
"id": "fTr9nQ-8r4jI",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 32,
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" OLS Regression Results \n",
|
|
"==============================================================================\n",
|
|
"Dep. Variable: price R-squared: 0.920\n",
|
|
"Model: OLS Adj. R-squared: 0.920\n",
|
|
"Method: Least Squares F-statistic: 2.943e+04\n",
|
|
"Date: Tue, 22 Mar 2022 Prob (F-statistic): 0.00\n",
|
|
"Time: 14:53:42 Log-Likelihood: -4.5553e+05\n",
|
|
"No. Observations: 53920 AIC: 9.111e+05\n",
|
|
"Df Residuals: 53898 BIC: 9.113e+05\n",
|
|
"Df Model: 21 \n",
|
|
"Covariance Type: nonrobust \n",
|
|
"====================================================================================\n",
|
|
" coef std err t P>|t| [0.025 0.975]\n",
|
|
"------------------------------------------------------------------------------------\n",
|
|
"Intercept -3198.8593 96.089 -33.291 0.000 -3387.194 -3010.525\n",
|
|
"color[T.E] -208.6919 17.884 -11.669 0.000 -243.744 -173.639\n",
|
|
"color[T.F] -268.2409 18.087 -14.831 0.000 -303.691 -232.790\n",
|
|
"color[T.G] -480.9586 17.704 -27.167 0.000 -515.658 -446.259\n",
|
|
"color[T.H] -985.2696 18.822 -52.347 0.000 -1022.161 -948.378\n",
|
|
"color[T.I] -1475.8881 21.151 -69.777 0.000 -1517.345 -1434.431\n",
|
|
"color[T.J] -2381.3789 26.120 -91.169 0.000 -2432.575 -2330.183\n",
|
|
"cut[T.Good] 665.0712 33.020 20.141 0.000 600.352 729.791\n",
|
|
"cut[T.Ideal] 1018.0683 30.261 33.643 0.000 958.757 1077.379\n",
|
|
"cut[T.Premium] 896.6116 30.694 29.212 0.000 836.452 956.771\n",
|
|
"cut[T.Very Good] 856.2029 30.821 27.780 0.000 795.794 916.612\n",
|
|
"clarity[T.IF] 5378.3577 51.002 105.454 0.000 5278.393 5478.322\n",
|
|
"clarity[T.SI1] 3687.2883 43.699 84.379 0.000 3601.638 3772.939\n",
|
|
"clarity[T.SI2] 2730.3778 43.879 62.226 0.000 2644.375 2816.380\n",
|
|
"clarity[T.VS1] 4608.3357 44.584 103.364 0.000 4520.951 4695.720\n",
|
|
"clarity[T.VS2] 4292.1990 43.905 97.760 0.000 4206.144 4378.254\n",
|
|
"clarity[T.VVS1] 5032.2098 47.172 106.678 0.000 4939.753 5124.667\n",
|
|
"clarity[T.VVS2] 4976.1138 45.879 108.461 0.000 4886.190 5066.038\n",
|
|
"x -937.2817 31.955 -29.332 0.000 -999.913 -874.651\n",
|
|
"y 51.0354 19.390 2.632 0.008 13.031 89.040\n",
|
|
"z -331.9381 32.729 -10.142 0.000 -396.086 -267.790\n",
|
|
"carat 1.139e+04 50.675 224.793 0.000 1.13e+04 1.15e+04\n",
|
|
"==============================================================================\n",
|
|
"Omnibus: 14598.936 Durbin-Watson: 1.196\n",
|
|
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 598338.864\n",
|
|
"Skew: 0.580 Prob(JB): 0.00\n",
|
|
"Kurtosis: 19.278 Cond. No. 239.\n",
|
|
"==============================================================================\n",
|
|
"\n",
|
|
"Notes:\n",
|
|
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**9) We will now select candidate features to fit our model using a forward selection strategy. To this end, we will define different entering criteria for our candidate features :**\n",
|
|
"* Does the introduction of the feature decreases the MSE ? \n",
|
|
"* Does the introduction of the feature decreases the AIC ? \n",
|
|
"* Does the introduction of the feature decreases the BIC ? \n",
|
|
" \n",
|
|
"**To this end, define two new functions : neg_AIC(y_true, y_pred, n, k) and neg_BIC(y_true, y_pred, n, k) that respectively compute the negative AIC and BIC given the ground truth y values (y_true), the predicted y values (y_pred), the number of samples (n) and the number of predictors (k). The AIC and BIC can be computed as such :**\n",
|
|
"\n",
|
|
"* AIC = 2*k + n*log(mse) \n",
|
|
"* BIC = n*log(mse) + k*log(n)"
|
|
],
|
|
"metadata": {
|
|
"id": "BJBVO1ojhWDy",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"import math\n",
|
|
"from sklearn.linear_model import LinearRegression\n",
|
|
"from sklearn.model_selection import cross_val_score, cross_validate\n",
|
|
"from sklearn.metrics import make_scorer\n",
|
|
"from sklearn.metrics import mean_squared_error, r2_score\n",
|
|
"\n",
|
|
"def neg_AIC(y_true, y_pred, n, k):\n",
|
|
" mse = mean_squared_error(y_pred,y_true)\n",
|
|
" return -(2*k + n*math.log(mse))\n",
|
|
"\n",
|
|
"def neg_BIC(y_true, y_pred, n, k):\n",
|
|
" mse = mean_squared_error(y_pred,y_true)\n",
|
|
" return -(n*math.log(mse) + k*math.log(n))\n",
|
|
"\n",
|
|
"def forward_selection(df, model, target_column, columns, scoring_rule):\n",
|
|
" features_to_keep = []\n",
|
|
" features_to_try = [] \n",
|
|
" best_score = -np.inf\n",
|
|
" cond = True\n",
|
|
" y = df[target_column].values\n",
|
|
" while len(columns) != 0 and cond is True:\n",
|
|
" cond = False\n",
|
|
" best_feat = None \n",
|
|
" for col in columns: \n",
|
|
" features_to_try = features_to_keep + [col] \n",
|
|
" X = get_predictors(df, features_to_try)\n",
|
|
" n, k = X.shape[0], X.shape[1]\n",
|
|
" if scoring_rule == 'aic':\n",
|
|
" scorer = make_scorer(neg_AIC, n=n, k=k, greater_is_better=True)\n",
|
|
" elif scoring_rule == 'bic':\n",
|
|
" scorer = make_scorer(neg_BIC, n=n, k=k, greater_is_better=True)\n",
|
|
" else:\n",
|
|
" scorer = scoring_rule\n",
|
|
" cv_results = cross_validate(model, X, y, scoring=scorer, cv=10)\n",
|
|
" score = cv_results['test_score'].mean()\n",
|
|
" if score > best_score: \n",
|
|
" best_feat = col\n",
|
|
" cond = True\n",
|
|
" best_score = score\n",
|
|
" if best_feat != None:\n",
|
|
" columns.remove(best_feat)\n",
|
|
" features_to_keep.append(best_feat)\n",
|
|
" return features_to_keep, best_score \n",
|
|
"\n",
|
|
"def get_predictors(df, cols):\n",
|
|
" cat_pred = []\n",
|
|
" cont_pred = []\n",
|
|
" for col in cols:\n",
|
|
" if isinstance(df[col].values[0], str):\n",
|
|
" cat_pred.append(col)\n",
|
|
" else:\n",
|
|
" cont_pred.append(col)\n",
|
|
" if len(cat_pred) != 0:\n",
|
|
" df_dummies = pd.get_dummies(df[cat_pred], drop_first=True)\n",
|
|
" else:\n",
|
|
" df_dummies = pd.DataFrame() \n",
|
|
" df_cont = df[cont_pred] \n",
|
|
" df_cat = pd.concat([df_dummies, df_cont], axis=1)\n",
|
|
" \n",
|
|
" return df_cat.values \n"
|
|
],
|
|
"metadata": {
|
|
"id": "EyQkfpnZud5e",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 33,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**10) Use the function forward_selection() and the functions neg_AIC() and neg_BIC() to perform a forward selection on the dataframe features to see which subset of features is best to fit the target variable 'price'. Also, do a forward selection with an entering criterion defined as the MSE.**\n",
|
|
"\n",
|
|
"**For each selection, report the best subset of features obtained, as well as the score obtained. What do you observe ?** "
|
|
],
|
|
"metadata": {
|
|
"id": "AS0VH-6mjFXV",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"forward_selection(df, LinearRegression(), ['price'], df_cat.columns.copy(), 'bic')\n"
|
|
],
|
|
"metadata": {
|
|
"id": "GNKAsJxVu8jO",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 44,
|
|
"outputs": [
|
|
{
|
|
"ename": "KeyError",
|
|
"evalue": "'cut_Fair'",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
|
|
"\u001B[0;31mKeyError\u001B[0m Traceback (most recent call last)",
|
|
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3621\u001B[0m, in \u001B[0;36mIndex.get_loc\u001B[0;34m(self, key, method, tolerance)\u001B[0m\n\u001B[1;32m 3620\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m-> 3621\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_engine\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_loc\u001B[49m\u001B[43m(\u001B[49m\u001B[43mcasted_key\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 3622\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m \u001B[38;5;28;01mas\u001B[39;00m err:\n",
|
|
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:136\u001B[0m, in \u001B[0;36mpandas._libs.index.IndexEngine.get_loc\u001B[0;34m()\u001B[0m\n",
|
|
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:163\u001B[0m, in \u001B[0;36mpandas._libs.index.IndexEngine.get_loc\u001B[0;34m()\u001B[0m\n",
|
|
"File \u001B[0;32mpandas/_libs/hashtable_class_helper.pxi:5198\u001B[0m, in \u001B[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001B[0;34m()\u001B[0m\n",
|
|
"File \u001B[0;32mpandas/_libs/hashtable_class_helper.pxi:5206\u001B[0m, in \u001B[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001B[0;34m()\u001B[0m\n",
|
|
"\u001B[0;31mKeyError\u001B[0m: 'cut_Fair'",
|
|
"\nThe above exception was the direct cause of the following exception:\n",
|
|
"\u001B[0;31mKeyError\u001B[0m Traceback (most recent call last)",
|
|
"Input \u001B[0;32mIn [44]\u001B[0m, in \u001B[0;36m<module>\u001B[0;34m\u001B[0m\n\u001B[1;32m 1\u001B[0m model \u001B[38;5;241m=\u001B[39m LinearRegression(fit_intercept\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m)\n\u001B[0;32m----> 2\u001B[0m \u001B[43mforward_selection\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m[\u001B[49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[38;5;124;43mprice\u001B[39;49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[43m]\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mdf_cat\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcolumns\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcopy\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[38;5;124;43mneg_mean_squared_error\u001B[39;49m\u001B[38;5;124;43m'\u001B[39;49m\u001B[43m)\u001B[49m\n",
|
|
"Input \u001B[0;32mIn [33]\u001B[0m, in \u001B[0;36mforward_selection\u001B[0;34m(df, model, target_column, columns, scoring_rule)\u001B[0m\n\u001B[1;32m 22\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m col \u001B[38;5;129;01min\u001B[39;00m columns: \n\u001B[1;32m 23\u001B[0m features_to_try \u001B[38;5;241m=\u001B[39m features_to_keep \u001B[38;5;241m+\u001B[39m [col] \n\u001B[0;32m---> 24\u001B[0m X \u001B[38;5;241m=\u001B[39m \u001B[43mget_predictors\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfeatures_to_try\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 25\u001B[0m n, k \u001B[38;5;241m=\u001B[39m X\u001B[38;5;241m.\u001B[39mshape[\u001B[38;5;241m0\u001B[39m], X\u001B[38;5;241m.\u001B[39mshape[\u001B[38;5;241m1\u001B[39m]\n\u001B[1;32m 26\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m scoring_rule \u001B[38;5;241m==\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124maic\u001B[39m\u001B[38;5;124m'\u001B[39m:\n",
|
|
"Input \u001B[0;32mIn [33]\u001B[0m, in \u001B[0;36mget_predictors\u001B[0;34m(df, cols)\u001B[0m\n\u001B[1;32m 45\u001B[0m cont_pred \u001B[38;5;241m=\u001B[39m []\n\u001B[1;32m 46\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m col \u001B[38;5;129;01min\u001B[39;00m cols:\n\u001B[0;32m---> 47\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(\u001B[43mdf\u001B[49m\u001B[43m[\u001B[49m\u001B[43mcol\u001B[49m\u001B[43m]\u001B[49m\u001B[38;5;241m.\u001B[39mvalues[\u001B[38;5;241m0\u001B[39m], \u001B[38;5;28mstr\u001B[39m):\n\u001B[1;32m 48\u001B[0m cat_pred\u001B[38;5;241m.\u001B[39mappend(col)\n\u001B[1;32m 49\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n",
|
|
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/core/frame.py:3505\u001B[0m, in \u001B[0;36mDataFrame.__getitem__\u001B[0;34m(self, key)\u001B[0m\n\u001B[1;32m 3503\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcolumns\u001B[38;5;241m.\u001B[39mnlevels \u001B[38;5;241m>\u001B[39m \u001B[38;5;241m1\u001B[39m:\n\u001B[1;32m 3504\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_getitem_multilevel(key)\n\u001B[0;32m-> 3505\u001B[0m indexer \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcolumns\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_loc\u001B[49m\u001B[43m(\u001B[49m\u001B[43mkey\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 3506\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m is_integer(indexer):\n\u001B[1;32m 3507\u001B[0m indexer \u001B[38;5;241m=\u001B[39m [indexer]\n",
|
|
"File \u001B[0;32m~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3623\u001B[0m, in \u001B[0;36mIndex.get_loc\u001B[0;34m(self, key, method, tolerance)\u001B[0m\n\u001B[1;32m 3621\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_engine\u001B[38;5;241m.\u001B[39mget_loc(casted_key)\n\u001B[1;32m 3622\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m \u001B[38;5;28;01mas\u001B[39;00m err:\n\u001B[0;32m-> 3623\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m(key) \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01merr\u001B[39;00m\n\u001B[1;32m 3624\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m:\n\u001B[1;32m 3625\u001B[0m \u001B[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001B[39;00m\n\u001B[1;32m 3626\u001B[0m \u001B[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001B[39;00m\n\u001B[1;32m 3627\u001B[0m \u001B[38;5;66;03m# the TypeError.\u001B[39;00m\n\u001B[1;32m 3628\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_check_indexing_error(key)\n",
|
|
"\u001B[0;31mKeyError\u001B[0m: 'cut_Fair'"
|
|
]
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"**11) Looking at the scatter plot of the variable 'price' against the variable 'x', a linear model might not be the best fit to explain the relation between the two variables. Using a transformation of the variable 'x', try to obtain a better fit. Plot the linear regression line and the one obtained using the transformation of 'x'.**"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [],
|
|
"metadata": {
|
|
"id": "xMrxh-fuBh7Y",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": 2,
|
|
"outputs": []
|
|
}
|
|
]
|
|
} |