Justin Post
Used when you have a binary response variable
Consider just a binary response
Suppose you have a predictor variable as well, call it x
E(Y|x=x1)=P(Y=1|x=x1) E(Y|x=x2)=P(Y=1|x=x2)
Suppose you have a predictor variable as well, call it x
E(Y|x=x1)=P(Y=1|x=x1) E(Y|x=x2)=P(Y=1|x=x2)
E(Y|x)=P(Y=1|x)=β0+β1x
import pandas as pdwater = pd.read_csv("data/water_potability.csv")water.head()
## ph Hardness Solids ... Trihalomethanes Turbidity Potability## 0 NaN 204.890455 20791.318981 ... 86.990970 2.963135 0## 1 3.716080 129.422921 18630.057858 ... 56.329076 4.500656 0## 2 8.099124 224.236259 19909.541732 ... 66.420093 3.055934 0## 3 8.316766 214.373394 22018.417441 ... 100.341674 4.628771 0## 4 9.092223 181.101509 17978.986339 ... 31.997993 4.075075 0## ## [5 rows x 10 columns]
water.Potability.value_counts()
## 0 1998## 1 1278## Name: Potability, dtype: int64
water.groupby("Potability")[["Hardness", "Chloramines"]].describe()
## Hardness ... Chloramines ## count mean std ... 50% 75% max## Potability ... ## 0 1998.0 196.733292 31.057540 ... 7.090334 8.066462 12.653362## 1 1278.0 195.800744 35.547041 ... 7.215163 8.199261 13.127000## ## [2 rows x 16 columns]
import seaborn as snssns.regplot(x = water["Hardness"], y = water["Potability"])
import seaborn as snssns.regplot(x = water["Hardness"], y = water["Potability"], y_jitter = 0.1)
import matplotlib.pyplot as pltimport matplotlib.patches as mpatcheswater["Hardnessgroups"] = pd.cut(water['Hardness'], range(45, 335, 10))props = water[["Hardnessgroups", "Potability"]] \ .groupby("Hardnessgroups") \ .agg(prop = ('Potability', 'mean'), counts = ('Potability', 'count'))sc = plt.scatter(pd.Series(range(50,330,10)), props.prop, s = props.counts)plt.xlabel("Hardness")plt.ylabel("Proportion of Potable Water")plt.ylim([-0.1, 1.1])plt.legend(*sc.legend_elements("sizes", num=5, color = "blue"))plt.show()
## Text(0.5, 0, 'Hardness')## Text(0, 0.5, 'Proportion of Potable Water')## (-0.1, 1.1)
Response = success/failure, then modeling average number of successes for a given x is a probability!
Basic Logistic Regression models success probability using the logistic function
P(Y=1|x)=P(success|x)=eβ0+β1x1+eβ0+β1x
P(Y=1|x)=P(success|x)=eβ0+β1x1+eβ0+β1x
P(Y=1|x)=P(success|x)=eβ0+β1x1+eβ0+β1x
The logistic regression model doesn't have a closed form solution (maximum likelihood often used to fit parameters)
Back-solving shows the logit or log-odds of success is linear in the parameters
log(P(success|x)1−P(success|x))=β0+β1x
P(Y=1|x)=P(success|x)=eβ0+β1x1+eβ0+β1x
The logistic regression model doesn't have a closed form solution (maximum likelihood often used to fit parameters)
Back-solving shows the logit or log-odds of success is linear in the parameters
log(P(success|x)1−P(success|x))=β0+β1x
Coefficient interpretation changes greatly from linear regression model!
β1 represents a change in the log-odds of success
For inference, what do you think would indicate that x is related to the probability of success here?
sklearn
to fit modelfrom sklearn.linear_model import LogisticRegression
.fit()
methodsklearn
to fit modelfrom sklearn.linear_model import LogisticRegression
.fit()
methodlog_reg = LogisticRegression(penalty = 'none')log_reg.fit(X = water["Hardness"].values.reshape(-1,1), y = water["Potability"].values)
print(log_reg.intercept_, log_reg.coef_)
## [-0.27748213] [[-0.00086296]]
.predict()
method to predict success or failureimport numpy as nplog_reg.predict(np.array([[50], [150], [200], [250], [300]]))
## array([0, 0, 0, 0, 0], dtype=int64)
.predict()
method to predict success or failureimport numpy as nplog_reg.predict(np.array([[50], [150], [200], [250], [300]]))
## array([0, 0, 0, 0, 0], dtype=int64)
.predict_log_proba()
and .predict_proba()
to obtain log probabilities and probabilities, respectivelylog_reg.predict_proba(np.array([[50], [150], [200], [250], [300]])) #returns P(Y=0), P(Y=1) estimates for each value
## array([[0.57947776, 0.42052224],## [0.60035045, 0.39964955],## [0.61065667, 0.38934333],## [0.62086496, 0.37913504],## [0.63096734, 0.36903266]])
sc = plt.scatter(pd.Series(range(50,330,10)), props.prop, s = props.counts)preds = log_reg.predict_proba(np.array(range(50,330)).reshape(-1,1))plt.scatter(x = np.array(range(50,330)), y = preds[:,1])plt.ylim([-0.1,1.1]); plt.xlabel("Hardness"); plt.ylabel("Proportion of Potable Water"); plt.show()
preds = log_reg.predict_proba(np.array(range(-2500,2500)).reshape(-1,1))plt.scatter(pd.Series(range(50,330,10)), props.prop, s = props.counts)plt.scatter(x = np.array(range(-2500,2500)), y = preds[:,1])plt.ylim([-0.1,1.1]); plt.xlabel("Hardness"); plt.ylabel("Proportion of Potable Water"); plt.show()
sklearn
... can use statsmodels
package!import statsmodels.api as smlog_reg = sm.GLM(water["Potability"], water["Hardness"], family=sm.families.Binomial())res = log_reg.fit()print(res.summary())
## Generalized Linear Model Regression Results ## ==============================================================================## Dep. Variable: Potability No. Observations: 3276## Model: GLM Df Residuals: 3275## Model Family: Binomial Df Model: 0## Link Function: Logit Scale: 1.0000## Method: IRLS Log-Likelihood: -2191.5## Date: Sat, 15 Mar 2025 Deviance: 4383.0## Time: 22:31:50 Pearson chi2: 3.28e+03## No. Iterations: 4 Pseudo R-squ. (CS): -0.0003092## Covariance Type: nonrobust ## ==============================================================================## coef std err z P>|z| [0.025 0.975]## ------------------------------------------------------------------------------## Hardness -0.0022 0.000 -12.421 0.000 -0.003 -0.002## ==============================================================================
Logistic regression often a reasonable model for a binary response
Uses a sigmoid function to ensure valid predictions
Can predict success or failure using estimated probabilities
Justin Post
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |