The primary regression
The year 1987 was chosen because this is a primary regression analysis to find the variables that impact the scrap and rework the most.
Results of regression
Dependent Variable  Independent Variable  Coefficient  Intercept  R Squared 
scrap  hrsemp  .0980794  5.148057  0.0354 
scrap  lhrsemp  .9011656  5.578901  0.0398 
lscrap  lhrsemp  .4329011  1.194021  0.1286 
scrap  tothrs  .0031348  4.366962  0.0001 
Rework  hrsemp  .0267415  4.025201  0.0046 
Rework rate  lhrsemp  .3159775  4.229571  0.0081 
Rework rate  lhrsemp  .2784642  .7324393  0.0429 
Rework rate  tothrs  .0331067  2.965147  0.0337 
The variables hrsemp, lhrsemp and tothrs were chosen for primary regression as these variables tend to have a logical relationship with the best working practices (which would positively impact the scrap rate and rework rate) as these variables can be used as a proxy for employee knowledge (from training)
When looking at the relationship between the dependent variables and the hours of training (both level and log scales) as expected, when the training hour per employee increases, the scrap and rework rate per 100 pieces decreases. But, the positive coefficient with total training hours is against common sense and may be due to the specification errors.
Misspecification in the regression results is due to multiple factors like nonlinearity, heteroscedasticity, choice of independent variables, omitted variable bias,and so on. So, to test for logarithmic robustness scale and correlated variables were chosen to see if we get a better Rsquared value.
The coefficient for lhrsemp changed significantly for the regression with rework. This may be because the introduction of tothrs as an additional variable might have eliminated an omitted variable bias.
The Final model
lscrap = 0.7324393 + 0.4329011 lhrsemp
lrework = 4.08778 + 1.886967 lhrsemp + 0.1045574 tothrs
The variabletothrs was left out of the independent variable affecting lscrap. The variable was left out because other models (without these variables) yielded a better Rsquared value
Grant
Again, the year 1987 was chosen by our Stata experts as the government won't have the 1988 (current) data and only has the data from the previous when they are evaluating the firms whether to provide grants or not
Looking at the probability that the effect on the grant is random, among these three variables the union and avgsalare the most significant variables that result in the grant for the employee training (As these have the least P values)
A coefficient of 0.1609977 for the union dummy states that firms that are unionised have 16.1% higher chance of getting the grant for the employee training than the firms that are not unionised.
In multiple linear regression, collinearity between dependent variables and omitting certain influential variables (omitted variable bias) are a problem as these might affect the error term as well as the estimates. We are running the simple multiple linear regression. But the dependent variable is also a categorical dummy variable. So, the normal interpretation of the coefficient of incremental change in the dependent variable doesn't work here.
For omitted variable bias we simply try to include other variables to see if they have a significant impact and for multicollinearity one or more variables can be removed from the model (For the categorical dependent variable, a logistical regression can be used)
The variables like the total number of employees and total training hours in 1987 can be used in estimating if the government would give the firm grants in 1988 for employee training.
Efficacy of grant
Regression model (1) :clscrap = 0.3170579 grant  0.0574357
Regression model (2): lscrap= 0.2539697 grant + 0.8311606 lscrap_1 + 0.021237
Here, in the model (1), the change in scrap rate was used instead of the actual scrap rate as this can help avoid if there was any selection bias in granting the firms with lower scrap rates. Alternatively, model (2), which directly takes the actual scrap rate as a dependent variable, can also be used, but lagged scrap rate is included in the model to eliminate the selection bias.
As mentioned above for the model(2),the lagged scrap rate was used in addition to the treatment dummy variable grant to eliminate the selection bias of government in granting the firms in estimating the treatment effect.
The assumption about the model is that by getting the government grant, the firms actually trained the employees, which in turn affects the scrap rate and rework rate. To feel more confident, we can add variables lscrap_1 and chrsemp to the model.
The regression results show that there is a 25% reduction in logarithmic scrap rate for the firms that received the grant with a significance level of 91%. So, I think that the grant program was fairly effective in reducing the scrap and rework rates.
Passenger demand regression model
All the variables, fare, dist, dist^2, and year are significant in explaining the passenger demand per day. As the distance increases, the passenger demand decreases quadratically( 0.5766912 distance + 0.0002252 distance^2), meaning that passengers prefer shorter distances more. A positive coefficient (31.86536) for the year suggests that the passenger demand is increasing on average every year, and hence the market is growing.
The coefficient of fare variable is 1.515604 on the passenger per day. The negative coefficient implies that as the fare increases, demand decreases, which is expected as per the economic theories.
As the fare increases, the number of passengers who are willing to pay that fare for a particular travel decreases as per economic theory. This negative relationship between fare and the number of passengers naturally results in a negative coefficient in a regression analysis.
Regression model: passen = 1.51560 fare  0.5766912 distance + 0.0002252 distance^2 +31.86536 year – 62509.17.
2SLS
Regression model (First stage):
fare = a1 + b1*bmktshr + c1*dist + d1*dist_2 + e1*year + u1
The variables dist, dist_2 (=dist^2), and year have to be included along with the instrumental variable fare. All the exogenous variables in the second stage have to be included in the first stage. Without including these variables, the bmktshr could be correlated with other variables, and hence the error term in the second stage could be correlated with these if it is not included in the first stage regression.
Regression model (Second stage):
passen = a2 + b2*farehat + c2*dist + d2*dist_2 + e2*year + u2
Our STATA tutor was particularly interested in the coefficient of farehat as the objective of the regression analysis was to find the price elasticity of demand
The estimates would have a higher standard error as in the second stage. The error term also contains the coefficient*(the difference between actual and predicted value). This results in a higher value of standard errors in the estimates.
Yes. The coefficient of airfare changed significantly in the 2SLS model when compared to the OLS model.
For being a valid instrument, the market share shouldn't directly impact the demand other than indirectly through the fare, and it should have a reasonable association with the fare. We can only test the second one using the correlation coefficient between them. But, it's difficult to test the instrument exogeneity rule as it is difficult to exclude the indirect effect during analysis (even including both variables in an OLS model would create multicollinearity problem)
The correlation coefficient of 0.1907 may not indicate a very strong relationship between bmktshr and fare, and hence it might be a weak instrument. But, the coefficient being negative indicates that a higher market share would mean a lower fare, which is as expected. Also, logically speaking a market share of the biggest airline can't have any direct correlation with the demand for that airline. So, both the requirements are fairly satisfied, and bmktshr may be used as an instrument for fare in the estimation of the passenger demand.

Grants and Training
 Primary Regression
 The year 1987 was chosen because this is a primary regression analysis to find the variables that impact the scrap and rework the most.
 Results of regression
Dependent Variable  Independent Variable  Coefficient  Intercept  R Squared 
scrap  hrsemp  .0980794  5.148057  0.0354 
scrap  lhrsemp  .9011656  5.578901  0.0398 
lscrap  lhrsemp  .4329011  1.194021  0.1286 
scrap  tothrs  .0031348  4.366962  0.0001 
Rework  hrsemp  .0267415  4.025201  0.0046 
Rework rate  lhrsemp  .3159775  4.229571  0.0081 
Rework rate  lhrsemp  .2784642  .7324393  0.0429 
Rework rate  tothrs  .0331067  2.965147  0.0337 
 When looking at the relationship between the dependent variables and the hours of training (both level and log scales) as expected, when the training hour per employee increases the scrap and rework rate per 100 pieces decreases. But, the positive coefficient with total training hours is against the common sense and may be due to the specification errors.
 Misspecification in the regression results are due to multiple factors like nonlinearity, heteroskasticity, choice of independent variables, omitted variable bias and so on. So, to test for robustness logirthmic scale and a correlated variables were chosen to see if we get a better Rsquared value.
 The coefficient for lhrsemp changed significantly for the regression with rework. This may be because the introduction of tothrs as an additional variable might have eliminated an omitted variable bias.
 Final model
 Grants
 Again, the year 1987 was chosen as the government don’t have the 1988 (current) data and only has the data from the from the previous when they are evaluating the firms whether to provide grants or not
 Looking at the probability that the effect on grant is random, among these three variables the union and avgsalare the most significant variables that results in the grant for the employee training (As these have the least P values)
 A coefficient of 0.1609977 for the union dummy states that firms that are unionized have a 16.1% higher chance of getting the grant for employee training than the firms that are not unionized.
 In multiple linear regression, collinearity between dependent variables and omitting certain influential variables (omitted variable bias) are a problem as these might affect the error term as well as the estimates. We are running the simple multiple linear regression. But the dependent variable is also a categorical dummy variable. So, the normal interpretation of the coefficient of incremental change in dependent variable doesn’t work here.
 For omitted variable bias we simply try to include other variables to see if the have a significant impact and for multicollinearity one or more variables can be removed from the model (For categorical dependent variable, a logistical regression can to be used)
 The variables like total number of employees and total training hours in 1987 can be used in estimating if the government would give the firm grants in 1988 for employee training.
 Efficacy of grant
 Regression model (1) : clscrap = 0.3170579 grant  0.0574357
 As mentioned above for model(2) lagged scrap rate was used in addition to the treatment dummy variable grant to eliminate the selection bias of the government in granting the firms in estimating the treatment effect.
 The assumption about the model is that by getting the government grant the firms actually trained the employees which in turn affects the scrap rate and rework rate. To feel more confident we can add variables lscrap_1 and chrsemp to the model.
 The regression results show that there is a 25% reduction in logarithmic scrap rate for the firms that received the grant with a significance level of 91%. So, I think that the grant program was fairly effective in reducing the scrap and rework rates.
 Airlines Demand
 Passenger demand
 All the variables, fare, dist, dist^2, and year are significant in explaining the passenger demand per day. As the distance increases the passenger demand decreases quadratically ( 0.5766912 distance + 0.0002252 distance^2) meaning that passengers prefer shorter distances more. A positive coefficient (31.86536) for the year suggests that the passenger demand is increasing on average every year and hence the market is growing.
 The coefficient of fare variable is 1.515604 on the passenger per day. The negative coefficient implies that as the fare increases demand decreases which are expected as per the economic theories.
 As the fare increases, the number of passengers who are willing to pay that fare for a particular travel decreases as per economic theory. This negative relationship between fare and the number of passengers naturally results in a negative coefficient in a regression analysis.
 2SLS
 Regression model (First stage):
 Regression model (Second stage):
 The estimates would have a higher standard error as in the second stage the error term also contains the coefficient*(difference between actual and predicted value). This results in a higher value of standard errors in the estimates.
 The coefficient of airfare changed significantly in the 2SLS model when compared to the OLS model.
 For being a valid instrument, the market share shouldn’t directly impact the demand other than indirectly through the fare and it should have a reasonable association with the fare. We can only test the second one using the correlation coefficient between them. But, it's difficult to test the instrument homogeneity rule as it is difficult to exclude the indirect effect during analysis (even including both variables in an OLS model would create a multicollinearity problem)
 The correlation coefficient of 0.1907 may not indicate a very strong relationship between bmktshr and fare and hence it might be a weak instrument. But, the coefficient being negative indicates that a higher market share would mean a lower fare which is as expected. Also, logically speaking a market share of the biggest airline can’t have any direct correlation with the demand for that airline. So, both the requirements are fairly satisfied and bmktshr may be used as an instrument for fare in the estimation of the passenger demand.