Fitting a regression model
Consider the following regression model:
Lung¬_i=α+ β×CIG
We expect that with an increase in Cigarettes, the probability of lung cancer should increase as believed popularly. Mathematically, we want to test if β>0 for the above regression model. This study is, of course, important to see how critical the effect of cigarettes on lung cancer is.
Let’s define our hypothesis formally: -
H_0: Smoking doesn^' thave any effect on the likelihood of lung cancer
vs
H_a: H_0 is false and Smoking have a significant effect on lung cancer
Data used in the analysis
For this statistical analysis, our SPSS expert selected the data on the number of people who have lung cancer and the number of people who smoke cigarettes. The dataset is available here. The dataset has one categorical variable, i.e., state, and two numeric variables, CIG and LUNG. We have, in total, 44 data points. We’ll have LUNG as a variable of interest and CIG as an explanatory variable.

Figure 1: Pictorial representation of data
Descriptive Statistics
Before going into full analysis of the data, it’s good to see how the data is distributed—the summary of the data.
First, let’s have a look at the boxplot of the two quantitative variables.

Here we see that there are few outliers in the CIG variable on the top side. No such outlier case in the LUNG variable.
Now let’s have a look at the scatter plot of the two quantitative variables.

The plot has a linear regression line fitted already. By the looks of it, we see that datapoints are not perfectly linear but close to it, and hence linear regression study should make sense.
We’ll also look at the normal Q-Q plot of variables to see if they are normally distributed.


Here’s the normal Q-Q plot for the two variables. The plots look reasonable straight, and the conclusion made by our expert was that the two variables are normally distributed.
Correlation analysis
Here our experts performed the regression analysis to see if there’s really any significant relationship between CIG and LUNG. The model summary is attached below: -
Model Summaryb
the model R | R | R2 | Adjusted R2 | Std. error -Estimate |
.697a | .486 | .474 | 3.06607 |
The R^2of the model is 0.486, which is statistically significant at any reasonable level of significance. The p-value is ~0.
ANOVAa
Model | Sum of Squares | df | Mean Square | F | Sig. |
Regression Residual Total | 373.878 394.833 768.712
| 1 42 43
| 373.878 9.401
| 39.771 |
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Standard errorError Beta(β)
1 (Const) 6.472 2.141 3.023 .004
CIG .529 .084 .697 6.306 .000
The plot of residuals against the predicted is plotted. We see that there’s no visible pattern, and the quantities look uncorrelated.
The regression equation is:
Lung=6.472+0.529×CIG

So, we have the following hypothesis.
H_0:β=0 vs H_1:β≠0
P-value ~0, which suggestsrejectingour H_0. We test the hypothesis against the data and conclude that we reject our NULL hypothesis in favor of alternatives since there’s significant evidence of the same in the data.
Conclusion
After doing our regression study, our online SPSS tutors concluded that there’s significant evidence of CIG having the effect of LUNG. The coefficient is +ve, and the p-value of the test is very small, indicating the strong relationship between the two.
Of course, correlation is never a proof of causation, and we do not have individual-wise data to support the stronger claim (since different individuals might have lung cancer than those who smoke). So, the next step for research can be to have a longitudinal dataset of individuals and observe them for a longer period to have a better conclusion.