# Linear and predictive modelling

In this article, wetake you step by step of how you fit a linear model up to making the predictions and the conclusions. Be prepared for technical content.

## DATA CLEANING

We have four data sets, which are on emissions, buildings, population growth, and economic growth rate. We start with the cleaning emissions data set. Facilities and emission types disaggregate the data set. We will aggregate it by summing over the years so that I can have the total for each year. There are three sources of emission in the data set (air, water, and land). We sum the three to have total emissions. The year in the data set was written as 2018/2019, we create a new variable "year," which extract the lead year (2019 from 2018/2019), and we aggregated over this year variable. Finally, we have yearly data from 1999 to 2019. The code and the sample of final data produced by the online R tutor are shown in the chunk below.
library(stringr)
##DATA CLEANING
emission\$totalemission=rowSums(emission[,c("air_total_emission_kg","water_emission_kg","land_emission_kg")],na.rm=T)
emission\$year=as.numeric(str_sub(emission\$report_year,6,9))
emission<-emission[,c("totalemission","year")]
emission<-aggregate(emission,by=list(emission\$year),FUN=sum)
emission\$year=emission\$Group.1
emission=emission[,2:3]
##totalemission year
##1163652271 1999
##3241358298 2000
##3249103564 2001
##3952112597 2002
##3908653089 2003
##3968629002 2004
In this case, we clean the building's data set, which is made up of monthly data from 1983 to 2020. First, we remove the first nine rows that are not observations but provide information about the data and then extract the variable that we need. Then we average over the years to give a yearly time series data from 1983 to 2020. The code and the sample of final data are shown in the chunk below.
buildings<-buildings[10:450,]
buildings\$dwellings=as.numeric(buildings[,3])
buildings\$year=as.numeric(str_sub(buildings\$X,5,8))
buildings=buildings[,c(23,24)]
buildings=na.omit(buildings)
buildings=aggregate(buildings,by=list(buildings\$year),FUN=mean)
buildings=buildings[,c(2,3)]
## dwellings year
##1 9656.667 1983
##2 10250.667 1984
## 10009.333 1985
##4 8072.583 1986
##5 8364.750 1987
##6 11301.417 1988
Next, we clean the population growth data set, which is made up of quarterly data from 1981 to 2019. First, we remove the first 9 rows that are not observations but provide information about the data and then extract the variable that we need. Then we average over the years to give a yearly time series data from 1983 to 2020. The code and the sample of final data are shown in the chunk below.
pop=pop[10:163,]
pop\$population=as.numeric(pop\$Estimated.Resident.Population..ERP.....Australia..)
pop\$year=as.numeric(str_sub(pop\$X,5,8))
pop=pop[,16:17]
pop<-aggregate(pop,by=list(pop\$year),FUN=mean)
pop<-pop[,2:3]
## population year
##1 14988.70 1981
##2 15208.52 1982
##3 15415.55 1983
##4 15604.17 1984
##5 15816.33 1985
##6 16048.42 1986
There is no much to clean in the economic growth rate data set. We only create a year variable to denote the year of the observations.The code and the sample of final data are shown in the chunk below.
growth\$year=as.numeric(str_sub(growth\$Date,1,4))+1
growth=growth[,2:3]
## Growth.in.rael.GDP.per.capita year
## 1 0.2 1961
## 2 -1.1 1962
## 3 4.2 1963
## 4 5.01964
## 5 3.91965
## 6 0.4 1966
We have all my data cleaned,but the lengths are different. We see that observations for the years 1999 to 2017 are available for all variables. Thus we restrict my sample to this number and combine the data set. Moreover, we took the log of total emission, population, and the number of approved buildings while we did not take that of growth because it contains negative values.
growth=growth[39:57,]
pop=pop[19:37,]
emission=emission[1:19,]
buildings=buildings[17:35,]
data<-data.frame(growth,pop,emission,buildings)
data<-data[,c(2,1,3,5,7)]
data\$ltotalemission=log(data\$totalemission)
data\$ldwellings=log(data\$dwellings)
data\$lpopulation=log(data\$population)

### Exploratory data analysis

Here we have not fitted the model here. We are simply trying to get more information about the data. Our R assignment helperssay that data exploration is so vital for any analysis. This is true if the topic is on data analysis.
##time series plot of variables
par(mfrow=c(2,2))
plot(data\$year,data\$Growth.in.real.GDP.per.capita,ylab="GDP growth",type="l",col=1)
plot(data\$year,data\$population,ylab="Population",type="l",col=2)
plot(data\$year,data\$totalemission,ylab="total emissions (kg)",type="l",col=3)
plot(data\$year,data\$dwellings,ylab="Total approved Buildings",type="l",col=4) We see from the plot that total emissions jumped drastically from the initial period and reached a peak around 2008 then falls very slowly till date. Total approved buildings have fluctuated over time. The same can be seen for economic growth. However, the population has been increasing throughout time.
##scatterplot
par(mfrow=c(2,2))
plot(data\$Growth.in.real.GDP.per.capita,data\$totalemission,xlab="GDP growth", ylab="total emission",col=1)
plot(data\$population,data\$totalemission,xlab="Population", ylab="total emission",col=2)
plot(data\$dwellings,data\$totalemission,xlab="Total approved buildings ", ylab="total emission",col=3) The scatterplot of total emissions against the independent variables is shown above. We hardly see any pattern of relationship between total emission and the independent variables.
library(Hmisc)
corr.mat<-rcorr(as.matrix(data[,2:5]))
corr.mat
## Growth.in.rael.GDP.per.capita population
## Growth.in.rael.GDP.per.capita -0.62
## population-0.621.00
## totalemission0.32
## dwellings-0.010.09
## totalemissiondwellings
## Growth.in.rael.GDP.per.capita -0.43-0.01
## population 0.32 0.09
## totalemission-0.15
## dwellings -0.15 1.00
##
## n= 19
##
##
## P
##Growth.in.rael.GDP.per.capita population
## Growth.in.rael.GDP.per.capita 0.0044
## population0.0044
## totalemission 0.06320.1818
## dwellings 0.9821 0.7074
##totalemissiondwellings
## Growth.in.rael.GDP.per.capita 0.06320.9821
## population0.18180.7074
## totalemission0.5439
## dwellings 0.5439
The result shows that there is a negative medium correlation between GDP growth and total emissions (r=-0.43) and is significant at a 10% alpha level (p=0.0632). There is negative weak (r=-0.15) but insignificant (p=0.1818>0.05) between total emissions and the number of approved buildings. There exist positive medium (r=0.32) but insignificant (p=0.5439>0.05) relationship between total emissions and dwellings.

### Model fitting for prediction

In this section, we fit multiple linear regression equations to the data and interpret the results. Some of the key coefficients that we are interested in to make the conclusions are the adjusted r-squared.
model<-lm(ltotalemission~Growth.in.real.GDP.per.capita+lpopulation+ldwellings,data=data)
summary(model)
##
## Call:
## lm(formula = ltotalemission ~ Growth.in.real.GDP.per.capita +
## lpopulation + ldwellings, data = data)
##
## Residuals:
##Minimum 1q the Median 3qMaximum
## -0.76158-0.105470.00244 0.153190.32270
##
## Coefs:
## Estimated Std. Errors t values Pr(>|t|)
## (the Intercepts) 24.44494 10.78909 2.266 0.0387 *
## Growth.in.real.GDP.per.capita -0.110430.07472-1.4780.1601
## lpopulation0.341700.91284 0.3740.7134
## Idwellings - 0.62367 0.70515 -0.884 0.3904
## - - -
##
## Residual’s std. error: 0.2597 on 15 df
## Multiple R-squard: 0.2755, the Adjusted R-squard: 0.1305
## F-stat: 1.901 on 3 and 15 degree of freedom, p-value: 0.1729
The regression result above shows that the estimated regression model is ltotalemissions=24.44-0.11growth+0.34lpopulation-0.62ldwellings. The result shows that contrary to expectation, the number of approved buildings and growth in GDP has a negative effect on total emission. Specifically, a 1% increase in the number of approved buildings reduces emission by 0.62% reduction in emission, while a 1% increase in GDP growth reduces total emission by 11.04%. Finally, a 1% increase in population increases totals emission by 0.34%. However, all these variables are not significant as all have p-values greater than 0.05. The adjusted R-squared shows that 13.05% of the variation is explained by the independent variables.