# Data Analysis: Issues Affecting the World Today

Data analysis can be defined as the statistical process of exploring, analyzing, cleaning, and modeling data with the intent of identifying important information that can be used to support decision-making or arrive at useful conclusions. This unique process can also be used to identify the problems affecting the current world. With the right data analysis tools, what was once volumes of ambiguous information becomes a simple and clear decision point.

## DATA CLEANING

I have four data sets that are on emissions, buildings, population growth, and economic growth rate. I start with the cleaning emissions data set. The data set is disaggregated by facilities and emission types. I will aggregate it by summing over the years so that I can have the total for each year. There are three sources of emission in the data set (air, water, and land), I sum the three to have total emission. The year in the data set was written as 2018/2019, I create a new variable “year” which extract the lead year (2019 from 2018/2019) and aggregated over this year variable. Finally, I have yearly data from 1999 to 2019. The code and the sample of final data is shown in the chunk below

library(stringr)

##DATA CLEANING

emission<-read.csv("emissions.csv")

emission$totalemission=rowSums(emission[,c("air_total_emission_kg","water_emission_kg","land_emission_kg")],na.rm=T)

emission$year=as.numeric(str_sub(emission$report_year,6,9))

emission<-emission[,c("totalemission","year")]

emission<-aggregate(emission,by=list(emission$year),FUN=sum)

emission$year=emission$Group.1

emission=emission[,2:3]

head(emission)

## totalemission year

## 1 1163652271 1999

## 2 3241358298 2000

## 3 3249103564 2001

## 4 3952112597 2002

## 5 3908653089 2003

## 6 3968629002 2004

Next, I clean the building's data set which is made up of monthly data from 1983 to 2020. First I remove the first 9 rows that are not observations but provide information about the data and then extract the variable that I need. Then I average over years to give a yearly time series data from 1983 to 2020. The code and the sample of final data are shown in the chunk below.

buildings<-read.csv("buildings.csv",stringsAsFactors = F)

buildings<-buildings[10:450,]

buildings$dwellings=as.numeric(buildings[,3])

buildings$year=as.numeric(str_sub(buildings$X,5,8))

buildings=buildings[,c(23,24)]

buildings=na.omit(buildings)

buildings=aggregate(buildings,by=list(buildings$year),FUN=mean)

buildings=buildings[,c(2,3)]

head(buildings)

## dwellings year

## 1 9656.667 1983

## 2 10250.667 1984

## 3 10009.333 1985

## 4 8072.583 1986

## 5 8364.750 1987

## 6 11301.417 1988

Next, I clean the population growth data set which is made up of quarterly data from 1981 to 2019. First I remove the first 9 rows that are not observations but provide information about the data and then extract the variable that I need. Then I average over years to give a yearly time series data from 1983 to 2020. The code and the sample of final data are shown in the chunk below.

pop<-read.csv("population_growth.csv",stringsAsFactors=F)

pop=pop[10:163,]

pop$population=as.numeric(pop$Estimated.Resident.Population..ERP.....Australia..)

pop$year=as.numeric(str_sub(pop$X,5,8))

pop=pop[,16:17]

pop<-aggregate(pop,by=list(pop$year),FUN=mean)

pop<-pop[,2:3]

head(pop)

## population year

## 1 14988.70 1981

## 2 15208.52 1982

## 3 15415.55 1983

## 4 15604.17 1984

## 5 15816.33 1985

## 6 16048.42 1986

There is no much to clean in the economic growth rate data set. I only create a year variable to denote the year of the observations. The code and the sample of final data are shown in the chunk below.

growth<-read.csv("growth.csv")

growth$year=as.numeric(str_sub(growth$Date,1,4))+1

growth=growth[,2:3]

head(growth)

## Growth.in.real.GDP.per.capita year

## 1 0.2 1961

## 2 -1.1 1962

## 3 4.2 1963

## 4 5.0 1964

## 5 3.9 1965

## 6 0.4 1966

I have all my data cleaned but the lengths are different. I see that observations for the years 1999 to 2017 are available for all variables, thus I restrict my sample to this number and combine the data set. Moreover,I took the log of total emission, population, and a number of approved buildings while I did not take that of growth because it contains negative values.

growth=growth[39:57,]

pop=pop[19:37,]

emission=emission[1:19,]

buildings=buildings[17:35,]

data<-data.frame(growth,pop,emission,buildings)

data<-data[,c(2,1,3,5,7)]

data$ltotalemission=log(data$totalemission)

data$ldwellings=log(data$dwellings)

data$lpopulation=log(data$population)

## Time series analysis

##time series plot of variables

par(mfrow=c(2,2))

plot(data$year,data$Growth.in.real.GDP.per.capita,ylab="GDP growth",type="l",col=1)

plot(data$year,data$population,ylab="Population",type="l",col=2)

plot(data$year,data$totalemission,ylab="total emissions (kg)",type="l",col=3)

plot(data$year,data$dwellings,ylab="Total approved Buildings",type="l",col=4)

We see from the plot that total emissions jumped drastically from the initial period and reached a peak around 2008 then falls very slowly till date. The total of approved buildings has fluctuated over time. The same can be seen for economic growth. However, the population has been increasing throughout time.

##scatterplot

par(mfrow=c(2,2))

plot(data$Growth.in.real.GDP.per.capita,data$totalemission,xlab="GDP growth", ylab="total emission",col=1)

plot(data$population,data$totalemission,xlab="Population", ylab="total emission",col=2)

plot(data$dwellings,data$totalemission,xlab="Total approved buildings ", ylab="total emission",col=3)

The scatterplot of total emissions against the independent variables is shown above. We hardly see any pattern of relationship between total emission and the independent variables.

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

##

## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':

##

## format.pval, units

corr.mat<-rcorr(as.matrix(data[,2:5]))

corr.mat

## Growth.in.real.GDP.per.capita population

## Growth.in.real.GDP.per.capita 1.00 -0.62

## population -0.62 1.00

## totalemission -0.43 0.32

## dwellings -0.01 0.09

## totalemission dwellings

## Growth.in.real.GDP.per.capita -0.43 -0.01

## population 0.32 0.09

## totalemission 1.00 -0.15

## dwellings -0.15 1.00

##

## n= 19

##

##

## P

## Growth.in.real.GDP.per.capita population

## Growth.in.real.GDP.per.capita 0.0044

## population 0.0044

## totalemission 0.0632 0.1818

## dwellings 0.9821 0.7074

## totalemission dwellings

## Growth.in.real.GDP.per.capita 0.0632 0.9821

## population 0.1818 0.7074

## totalemission 0.5439

## dwellings 0.5439

The result shows that there is a negative medium correlation between GDP growth and total emissions (r=-0.43) and is significant at a 10% alpha level (p=0.0632). there is negative weak (r=-0.15) but insignificant (p=0.1818>0.05) between total emissions and the number of approved buildings. There exist positive medium (r=0.32) but insignificant (p=0.5439>0.05) relationship between total emissions and dwellings. For professional assistance on this topic, take our time series assignment help.

## Linear Programming

model<-lm(ltotalemission~Growth.in.real.GDP.per.capita+lpopulation+ldwellings,data=data)

summary(model)

##

## Call:

## lm(formula = ltotalemission ~ Growth.in.real.GDP.per.capita +

## lpopulation + ldwellings, data = data)

##

## Residuals:

## Min 1Q Median 3Q Max

## -0.76158 -0.10547 0.00244 0.15319 0.32270

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 24.44494 10.78909 2.266 0.0387 *

## Growth.in.real.GDP.per.capita -0.11043 0.07472 -1.478 0.1601

## lpopulation 0.34170 0.91284 0.374 0.7134

## ldwellings -0.62367 0.70515 -0.884 0.3904

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.2597 on 15 degrees of freedom

## Multiple R-squared: 0.2755, Adjusted R-squared: 0.1305

## F-statistic: 1.901 on 3 and 15 DF, p-value: 0.1729

The regression result above shows that the estimated regression model is ltotalemissions=24.44-0.11growth+0.34lpopulation-0.62ldwellings. The result shows that contrary to expectation, the number of approved buildings and growth in GDP hasa negative effect on total emission. Specifically, a 1% increase in the number of approved buildings reduces emission by 0.62% reduction in emission while a 1% increase in GDP growth reduces total emission by 11.04%. Finally, a 1% increase in population increases totals emission by 0.34%. However, all these variables are not significant as all have p-values greater than 0.05. The adjusted R-squared shows that 13.05% of the variation is explained by the independent variables. ###Predictive Modelling.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

##

## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':

##

## margin

set.seed(58)

rfm<-randomForest(ltotalemission~.,data=data[,c(2,6,7,8)],ntree=500,importance=T)

print(rfm)

##

## Call:

## randomForest(formula = ltotalemission ~ ., data = data[, c(2, 6, 7, 8)], ntree = 500, importance = T)

## Type of random forest: regression

## Number of trees: 500

## No. of variables tried at each split: 1

##

## Mean of squared residuals: 0.07360822

## % Var explained: -0.19

The mean of squared residuals is 0.0736 with 0.19 percent of variance explained.

importance(rfm)

## %IncMSE IncNodePurity

## Growth.in.real.GDP.per.capita 0.91483892 0.4931572

## ldwellings -0.09948159 0.1740888

## lpopulation 9.85744741 0.4757064

varImpPlot(rfm)

The result above shows that the population is the most important variable explaining total emissions as it has the highest mean decrease accuracy. Get assistance on this topic by contacting our

**linear programming assignment help**providers.### Conclusion

We have examined whether economic growth and structure contribute to environmental pollution using yearly time series data from 1999 to 2017. The result of the multiple linear regression model shows that none of the considered economic variables are significant which means there is no effect of economic growth and structure on the emission of a toxic substance to the environment. The limitation of this is that we cannot rule out the possibility of omitted variable bias given that the omnibus F-test is not significant which means the model needs to be improved upon by finding variables that may be significant but have been left out. However, this provides a good starting point in investigating this important topic.