Problem Description:
In this R Programming assignment, we analyze a dataset of 2400 responses to 10 interview questions using R. We begin by preparing our environment, ensuring the necessary packages are loaded. Next, we clean the data, eliminating invalid responses. Key variables, such as age, education, employment, and religious inclination, are categorized. Descriptive statistics are generated, and the normality of the data is assessed. Finally, a linear regression analysis is performed to explore the significance of select variables within the dataset.
Step 1: Setting Up the Environment
In R, the first step is to ensure that all the necessary packages are correctly installed and loaded into the library. For this project, we rely on key packages, including Janitor, dplyr, tidyverse, psych, and readxl.
R Code
# Load required packages
library(janitor)
library(dplyr)
library(tidyverse)
library(psych)
library(readxl)
Step 2: Data Import and Cleaning
The dataset consists of 2400 responses to 10 interview questions. It's crucial to clean the data by eliminating invalid responses such as "Don't Know," missing data, and those who refused to answer. We can achieve this using the subset() function in R, which results in the removal of 323 data points with problematic responses.
R Code
# Import the Excel dataset
data <- read_excel("your_dataset.xlsx")
# Clean the data by removing invalid responses
data_cleaned <- data %>%
subset(!(Question %in% c("Don't Know", "Missing", "Refused to Answer")))
Step 3: Variable Categorization
Selected variables, including age, education, employment, and religious inclination, need to be categorized for analysis. This is accomplished using the cut() function.
R Code
# Categorize selected variables
data_cleaned <- data_cleaned %>%
mutate(
Age_Group = cut(Age, breaks = c(18, 25, 35, 45, 55, 65, Inf),
labels = c("18-25", "26-35", "36-45", "46-55", "56-65", "66+")),
Education_Level = cut(Education, breaks = c(0, 8, 12, 16, 20, Inf),
labels = c("Primary", "High School", "Bachelor's", "Master's", "PhD")),
Employment_Status = cut(Employment, breaks = c(0, 1, 2, 3, Inf),
labels = c("Unemployed", "Part-time", "Full-time", "Self-employed")),
Religious_Level = cut(Religious, breaks = c(0, 1, 2, 3, Inf),
labels = c("Low", "Moderate", "High", "Very High"))
)
Step 4: Descriptive Statistics
Descriptive statistics provide insight into the characteristics of the variables. To obtain these statistics, we can use the summary() or describe() functions. describe() offers more detailed information about the variables.
R Code
# Generate descriptive statistics
descriptive_stats <- describe(data_cleaned)
Step 5: Normality Testing
To assess normality, a normality test can be applied. This helps determine whether the data follows a normal distribution.
R Code
# Perform normality test
normality_test_result <- shapiro.test(data_cleaned$Variable_of_Interest)
Step 6: Linear Regression Analysis
For linear regression analysis, we will select a subset of the data. The results of this analysis are shown below.
R Code
# Perform linear regression analysis
linear_model <- lm(Y_Variable ~ X1 + X2 + X3, data = data_cleaned)
# View the results of the linear regression
summary(linear_model)
Step 7: Significance Testing
Using a two-sided t-tailed test, we assess the significance of specific variables (e.g., Q52J, Q19A, Q1, and Q101) within the rejection region. Further variable testing can be conducted as needed.