class: center, middle, inverse, title-slide .title[ # Logistic Regression Basics ] .author[ ### Justin Post ] --- layout: false class: title-slide-section-red, middle # Logistic Regression Basics Justin Post --- layout: true <div class="my-footer"><img src="data:image/png;base64,#img/logo.png" style="height: 60px;"/></div> --- # Logistic Regression Model Used when you have a **binary** response variable - Consider just a binary response + What is the mean of the response? <!-- Here write out some 0 and 1's as the population values. Look at probability as the mean--> --- # Logistic Regression Model Suppose you have a predictor variable as well, call it `\(x\)` - Given two values of `\(x\)` we could model separate proportions `$$E(Y|x=x_1) = P(Y=1|x = x_1)$$` `$$E(Y|x=x_2) = P(Y=1|x = x_2)$$` --- # Logistic Regression Model Suppose you have a predictor variable as well, call it `\(x\)` - Given two values of `\(x\)` we could model separate proportions `$$E(Y|x=x_1) = P(Y=1|x = x_1)$$` `$$E(Y|x=x_2) = P(Y=1|x = x_2)$$` - For a continuous `\(x\)`, we could consider a SLR model `$$E(Y|x) = P(Y=1|x) = \beta_0+\beta_1x$$` --- # Linear Regression Isn't Appropriate - Consider data about [water potability](https://www.kaggle.com/code/leabenzvi/water-potability-classification) ```python import pandas as pd water = pd.read_csv("data/water_potability.csv") water.head() ``` ``` ## ph Hardness Solids ... Trihalomethanes Turbidity Potability ## 0 NaN 204.890455 20791.318981 ... 86.990970 2.963135 0 ## 1 3.716080 129.422921 18630.057858 ... 56.329076 4.500656 0 ## 2 8.099124 224.236259 19909.541732 ... 66.420093 3.055934 0 ## 3 8.316766 214.373394 22018.417441 ... 100.341674 4.628771 0 ## 4 9.092223 181.101509 17978.986339 ... 31.997993 4.075075 0 ## ## [5 rows x 10 columns] ``` --- # Potability Summary - Summarize water potability ```python water.Potability.value_counts() ``` ``` ## 0 1998 ## 1 1278 ## Name: Potability, dtype: int64 ``` ```python water.groupby("Potability")[["Hardness", "Chloramines"]].describe() ``` ``` ## Hardness ... Chloramines ## count mean std ... 50% 75% max ## Potability ... ## 0 1998.0 196.733292 31.057540 ... 7.090334 8.066462 12.653362 ## 1 1278.0 195.800744 35.547041 ... 7.215163 8.199261 13.127000 ## ## [2 rows x 16 columns] ``` --- # Linear Regression Isn't Appropriate - Plot SLR model fit ```python import seaborn as sns sns.regplot(x = water["Hardness"], y = water["Potability"]) ``` <img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-5-1.svg" width="400px" style="display: block; margin: auto;" /> --- # Linear Regression Isn't Appropriate - Plot SLR model fit with jittered points ```python import seaborn as sns sns.regplot(x = water["Hardness"], y = water["Potability"], y_jitter = 0.1) ``` <img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-6-3.svg" width="400px" style="display: block; margin: auto;" /> --- # Linear Regression Isn't Appropriate ```python import matplotlib.pyplot as plt import matplotlib.patches as mpatches water["Hardnessgroups"] = pd.cut(water['Hardness'], range(45, 335, 10)) props = water[["Hardnessgroups", "Potability"]] \ .groupby("Hardnessgroups") \ .agg(prop = ('Potability', 'mean'), counts = ('Potability', 'count')) sc = plt.scatter(pd.Series(range(50,330,10)), props.prop, s = props.counts) plt.xlabel("Hardness") plt.ylabel("Proportion of Potable Water") plt.ylim([-0.1, 1.1]) plt.legend(*sc.legend_elements("sizes", num=5, color = "blue")) plt.show() ``` --- # Linear Regression Isn't Appropriate ``` ## Text(0.5, 0, 'Hardness') ## Text(0, 0.5, 'Proportion of Potable Water') ## (-0.1, 1.1) ``` <img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-8-5.svg" width="450px" style="display: block; margin: auto;" /> --- # Logistic Regression - Response = success/failure, then modeling average number of successes for a given `\(x\)` is a probability! + predictions should never go below 0 + predictions should never go above 1 - Basic Logistic Regression models success probability using the *logistic function* `$$P(Y =1|x) = P(success|x) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$` --- # Logistic Regression <img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-9-7.svg" width="500px" style="display: block; margin: auto;" /> --- # Logistic Regression `$$P(Y =1|x) = P(success|x) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$` - The logistic regression model doesn't have a closed form solution (maximum likelihood often used to fit parameters) --- # Logistic Regression `$$P(Y =1|x) = P(success|x) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$` - The logistic regression model doesn't have a closed form solution (maximum likelihood often used to fit parameters) - Back-solving shows the *logit* or *log-odds* of success is linear in the parameters `$$log\left(\frac{P(success|x)}{1-P(success|x)}\right) = \beta_0+\beta_1 x$$` --- # Logistic Regression `$$P(Y =1|x) = P(success|x) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$` - The logistic regression model doesn't have a closed form solution (maximum likelihood often used to fit parameters) - Back-solving shows the *logit* or *log-odds* of success is linear in the parameters `$$log\left(\frac{P(success|x)}{1-P(success|x)}\right) = \beta_0+\beta_1 x$$` - Coefficient interpretation changes greatly from linear regression model! - `\(\beta_1\)` represents a change in the log-odds of success --- # Hypotheses of Interest For inference, what do you think would indicate that `\(x\)` is related to the probability of success here? --- # Fitting a Logistic Regression Model in Python - Use `sklearn` to fit model ```python from sklearn.linear_model import LogisticRegression ``` - Similar to fitting an MLR model, we create an instance and then use the `.fit()` method --- # Fitting a Logistic Regression Model in Python - Use `sklearn` to fit model ```python from sklearn.linear_model import LogisticRegression ``` - Similar to fitting an MLR model, we create an instance and then use the `.fit()` method ```python log_reg = LogisticRegression(penalty = 'none') log_reg.fit(X = water["Hardness"].values.reshape(-1,1), y = water["Potability"].values) ``` ```python print(log_reg.intercept_, log_reg.coef_) ``` ``` ## [-0.27748213] [[-0.00086296]] ``` --- # Prediction with a Logistic Regression Model - Still use the `.predict()` method to predict success or failure ```python import numpy as np log_reg.predict(np.array([[50], [150], [200], [250], [300]])) ``` ``` ## array([0, 0, 0, 0, 0], dtype=int64) ``` --- # Prediction with a Logistic Regression Model - Still use the `.predict()` method to predict success or failure ```python import numpy as np log_reg.predict(np.array([[50], [150], [200], [250], [300]])) ``` ``` ## array([0, 0, 0, 0, 0], dtype=int64) ``` - Also have `.predict_log_proba()` and `.predict_proba()` to obtain log probabilities and probabilities, respectively ```python log_reg.predict_proba(np.array([[50], [150], [200], [250], [300]])) #returns P(Y=0), P(Y=1) estimates for each value ``` ``` ## array([[0.57947776, 0.42052224], ## [0.60035045, 0.39964955], ## [0.61065667, 0.38934333], ## [0.62086496, 0.37913504], ## [0.63096734, 0.36903266]]) ``` --- # Plotting the Fit ```python sc = plt.scatter(pd.Series(range(50,330,10)), props.prop, s = props.counts) preds = log_reg.predict_proba(np.array(range(50,330)).reshape(-1,1)) plt.scatter(x = np.array(range(50,330)), y = preds[:,1]) plt.ylim([-0.1,1.1]); plt.xlabel("Hardness"); plt.ylabel("Proportion of Potable Water"); plt.show() ``` <img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-18-1.svg" width="400px" style="display: block; margin: auto;" /><img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-18-2.svg" width="400px" style="display: block; margin: auto;" /> --- # Truly is a sigmoid type function! ```python preds = log_reg.predict_proba(np.array(range(-2500,2500)).reshape(-1,1)) plt.scatter(pd.Series(range(50,330,10)), props.prop, s = props.counts) plt.scatter(x = np.array(range(-2500,2500)), y = preds[:,1]) plt.ylim([-0.1,1.1]); plt.xlabel("Hardness"); plt.ylabel("Proportion of Potable Water"); plt.show() ``` <img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-19-5.svg" width="400px" style="display: block; margin: auto;" /><img src="data:image/png;base64,#03-Logistic_Regression_Basics_files/figure-html/unnamed-chunk-19-6.svg" width="400px" style="display: block; margin: auto;" /> --- # Inference with a Logistic Regression Model - Not implemented in `sklearn`... can use `statsmodels` package! ```python import statsmodels.api as sm log_reg = sm.GLM(water["Potability"], water["Hardness"], family=sm.families.Binomial()) res = log_reg.fit() print(res.summary()) ``` ``` ## Generalized Linear Model Regression Results ## ============================================================================== ## Dep. Variable: Potability No. Observations: 3276 ## Model: GLM Df Residuals: 3275 ## Model Family: Binomial Df Model: 0 ## Link Function: Logit Scale: 1.0000 ## Method: IRLS Log-Likelihood: -2191.5 ## Date: Sat, 15 Mar 2025 Deviance: 4383.0 ## Time: 22:31:50 Pearson chi2: 3.28e+03 ## No. Iterations: 4 Pseudo R-squ. (CS): -0.0003092 ## Covariance Type: nonrobust ## ============================================================================== ## coef std err z P>|z| [0.025 0.975] ## ------------------------------------------------------------------------------ ## Hardness -0.0022 0.000 -12.421 0.000 -0.003 -0.002 ## ============================================================================== ``` --- # Recap - Logistic regression often a reasonable model for a binary response - Uses a sigmoid function to ensure valid predictions - Can predict success or failure using estimated probabilities + Usually predict success if probability `\(>\)` 0.5