diff --git a/docs/example.ipynb b/docs/example.ipynb index e864d9e..7e1b572 100644 --- a/docs/example.ipynb +++ b/docs/example.ipynb @@ -11,9 +11,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.1.0\n" + ] + } + ], "source": [ "import linreg_ally\n", "\n", @@ -22,7 +30,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -38,7 +46,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -47,7 +55,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -56,7 +64,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -76,7 +84,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -92,7 +100,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -208,7 +216,7 @@ "4 140.0 3449 10.5 1970-01-01 USA " ] }, - "execution_count": 5, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -231,7 +239,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -251,7 +259,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -299,13 +307,13 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ - "
Pipeline(steps=[('preprocessor',\n",
+       "
Pipeline(steps=[('preprocessor',\n",
        "                 ColumnTransformer(transformers=[('standardscaler',\n",
        "                                                  StandardScaler(),\n",
        "                                                  ['Miles_per_Gallon',\n",
@@ -730,7 +738,7 @@
        "                                                 ('onehotencoder',\n",
        "                                                  OneHotEncoder(), ['Origin']),\n",
        "                                                 ('drop', 'drop', ['Name'])])),\n",
-       "                ('model', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
['Miles_per_Gallon', 'Cylinders', 'Displacement', 'Weight_in_lbs', 'Acceleration']
StandardScaler()
['Origin']
OneHotEncoder()
['Name']
drop
LinearRegression()
" ], "text/plain": [ "Pipeline(steps=[('preprocessor',\n", @@ -761,7 +769,7 @@ " ('model', LinearRegression())])" ] }, - "execution_count": 18, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -774,12 +782,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`scores` gives the R2 and negative mean squared error scores that we are interested to find out in order to understand how the model performs on test data." + "Scores give the R² and negative mean squared error scores that we are interested in finding out in order to understand how the model performs on the test data." ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -788,7 +796,7 @@ "{'r2': 0.8463952369304465}" ] }, - "execution_count": 19, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -813,13 +821,181 @@ "This is the end of this tutorial where you have seen how we use the `run_linear_regression` function in our package to preprocess data, run linear regression and output with scoring metrics." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Checking Normality and Homoscedasticity of Residuals\n", + "\n", + "A linear regression model assumes that residuals are normally distributed and have constant variance (homoscedasticity). To check whether these assumptions are met, we use the `qq_and_residuals_plot` function. This function generates:\n", + "\n", + "1. A Quantile-Quantile (Q-Q) plot to assess the normality of residuals.\n", + "2. A Residuals vs. Fitted Values plot to check for homoscedasticity." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `qq_and_residuals_plot` function takes two parameters: `y_actual` and `y_predicted`. These values were extracted from the linear regression model we previously created." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ - "# Merari - plot" + "# y_actual is y_test (true labels)\n", + "y_actual = y_test\n", + "\n", + "# y_predicted is obtained by predicting on X_test\n", + "y_predicted = best_model.predict(X_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that `y_actual` and `y_predicted` have been extracted, let's pass these parameters to the `qq_and_residuals_plot` function." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "" + ], + "text/plain": [ + "alt.HConcatChart(...)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from linreg_ally.plotting import qq_and_residuals_plot\n", + "\n", + "qq_and_residuals_plot(y_actual, y_predicted)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpreting the Q-Q Plot\n", + "\n", + "If the Q-Q plot shows a significant deviation from the red dashed line (which represents perfect normality), the residuals are not normally distributed. In our plot, a few points deviate from the line at the tails, suggesting potential skewness or outliers. However, since these deviations are minor, we can conclude that the residuals are approximately normal." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpreting the Residuals vs. Fitted Values Plot\n", + "\n", + "For the homoscedasticity assumption to hold, residuals should be randomly scattered around the red dashed line in the Residuals vs. Fitted Values plot. This would indicate that residual variance remains constant across all fitted values (homoscedasticity).\n", + "\n", + "However, in our case, the residuals cluster at different fitted value ranges, and the spread increases as the fitted values increase, suggesting that the variance is not constant (heteroscedasticity)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implications of Assumption Violations\n", + "\n", + "If the normality assumption is violated:\n", + "Ordinary Least Squares (OLS) regression still produces best linear unbiased estimates (BLUE) as long as independence and homoscedasticity hold. However, hypothesis tests and confidence intervals may be misleading if residuals deviate significantly from normality.\n", + "\n", + "If the homoscedasticity assumption is violated:\n", + "You can still fit a linear regression model, but you should interpret results with caution. The estimated coefficients remain unbiased, but standard errors and p-values become unreliable, affecting statistical inference." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "The `qq_and_residuals_plot` function is a valuable tool for assessing the normality and homoscedasticity assumptions in linear regression. If these assumptions are violated, you should consider corrective measures such as:\n", + "\n", + "- Transforming variables (e.g., logarithmic transformation),\n", + "- Using robust standard errors, or\n", + "- Exploring alternative models (e.g., weighted least squares, generalized least squares)." ] } ], @@ -839,7 +1015,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.8" + "version": "3.11.11" } }, "nbformat": 4,