Friday, January 29, 2016

Coursera Regression Course - Assignment 2

A major life depression is associated with
a higher daily consumption of alcohol

All data from the NESARC Study

Categorical Explanatory Variable

We have a categorical variable encoded with No=0, implying that subjects with no major depression in life are at the baseline 0.
3549-3549
MAJORDEPLIFE
MAJOR DEPRESSION - LIFETIME (NON-HIERARCHICAL)
----------------------------------------------
35254 0. No
7839 1. Yes
----------------------------------------------

Quantitative Response Variable

The amount of ethanol (alcohol) consumed averagely in the past year is our response variable.
3675-3682
ETOTLCA2
AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL TYPES OF ALCOHOLIC BEVERAGES COMBINED
(NOTE: Users may wish to exclude outliers)
--------------------------------------------------
0.0003 - 219.9555 Ounces of ethanol
Blank Unknown
--------------------------------------------------

We see that there is a large number of abstinent people in the dataset. We still don't know if they are associated with depression or not. It would be interesting to see if they are more the one or the other.

Linear Regression Test

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               ETOTLCA2   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     18.86
Date:                Fri, 29 Jan 2016   Prob (F-statistic):           1.42e-05
Time:                        15:57:58   Log-Likelihood:                -59022.
No. Observations:               26655   AIC:                         1.180e+05
Df Residuals:                   26653   BIC:                         1.181e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        0.5366      0.015     35.377      0.000         0.507     0.566
MAJORDEPLIFE     0.1474      0.034      4.342      0.000         0.081     0.214
==============================================================================
Omnibus:                    84158.156   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      20279387185.774
Skew:                          50.216   Prob(JB):                         0.00
Kurtosis:                    4274.926   Cond. No.                         2.62
==============================================================================

The R-squared number is rather low, sugesting that only 0,1% of the "Etanolic Intake" can be explained by a "major depression in life". At least the F-statistic is so low that we can assume the small effect is statistically significant. 
Now I would not go further but lets assume that R-squared is higher, in that case we believe that a major depression in a lifetime can predict Beta=14.74% increase in alcohol intake, with a high statistical significance of P>0.000

Graph: a major life depression could increase average ethanol intake from 54% to 68%

Program Code and Output

In [1]:
%matplotlib inline

import numpy
import pandas
import statsmodels.api as sm
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf 
In [2]:
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%.2f'%x)
In [3]:
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
In [4]:
#setting variables you will be working with to numeric
data['p'] = pandas.to_numeric(data['MAJORDEPLIFE'], errors='coerce')
In [5]:
data['ETOTLCA2'] = pandas.to_numeric(data['ETOTLCA2'], errors='coerce')
In [6]:
subGroupEthanol = list(filter(lambda x: x > 10.0, data['ETOTLCA2'].dropna()))
seaborn.distplot(subGroupEthanol)
Out[6]:
In [7]:
reg1 = smf.ols('ETOTLCA2 ~ MAJORDEPLIFE', data=data).fit()
print (reg1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               ETOTLCA2   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     18.86
Date:                Fri, 29 Jan 2016   Prob (F-statistic):           1.42e-05
Time:                        17:11:29   Log-Likelihood:                -59022.
No. Observations:               26655   AIC:                         1.180e+05
Df Residuals:                   26653   BIC:                         1.181e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        0.5366      0.015     35.377      0.000         0.507     0.566
MAJORDEPLIFE     0.1474      0.034      4.342      0.000         0.081     0.214
==============================================================================
Omnibus:                    84158.156   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      20279387185.774
Skew:                          50.216   Prob(JB):                         0.00
Kurtosis:                    4274.926   Cond. No.                         2.62
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [8]:
# listwise deletion for calculating means for regression model observations
sub1 = data[['ETOTLCA2', 'MAJORDEPLIFE']].dropna()

# group means & sd
print ("Mean")
ds1 = sub1.groupby('MAJORDEPLIFE').mean()
print (ds1)
print ("Standard deviation")
ds2 = sub1.groupby('MAJORDEPLIFE').std()
print (ds2)

# bivariate bar graph
seaborn.factorplot(x="MAJORDEPLIFE", y="ETOTLCA2", data=sub1, kind="bar", ci=None)
plt.xlabel('Major Life Depression')
plt.ylabel('Mean Number AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR')
Mean
              ETOTLCA2
MAJORDEPLIFE          
0                 0.54
1                 0.68
Standard deviation
              ETOTLCA2
MAJORDEPLIFE          
0                 1.44
1                 4.03
Out[8]:
In [ ]: