Data Analysis and Linear Regression with R

Problem Description:

In this R Programming assignment, we analyze a dataset of 2400 responses to 10 interview questions using R. We begin by preparing our environment, ensuring the necessary packages are loaded. Next, we clean the data, eliminating invalid responses. Key variables, such as age, education, employment, and religious inclination, are categorized. Descriptive statistics are generated, and the normality of the data is assessed. Finally, a linear regression analysis is performed to explore the significance of select variables within the dataset.

Step 1: Setting Up the Environment

In R, the first step is to ensure that all the necessary packages are correctly installed and loaded into the library. For this project, we rely on key packages, including Janitor, dplyr, tidyverse, psych, and readxl.

R Code

# Load required packages
library(janitor)
library(dplyr)
library(tidyverse)
library(psych)
library(readxl)

Step 2: Data Import and Cleaning

The dataset consists of 2400 responses to 10 interview questions. It's crucial to clean the data by eliminating invalid responses such as "Don't Know," missing data, and those who refused to answer. We can achieve this using the subset() function in R, which results in the removal of 323 data points with problematic responses.

R Code

# Import the Excel dataset
data <- read_excel("your_dataset.xlsx")
# Clean the data by removing invalid responses
data_cleaned <- data %>%
subset(!(Question %in% c("Don't Know", "Missing", "Refused to Answer")))

Step 3: Variable Categorization

Selected variables, including age, education, employment, and religious inclination, need to be categorized for analysis. This is accomplished using the cut() function.

R Code

# Categorize selected variables
data_cleaned <- data_cleaned %>%
mutate(
Age_Group = cut(Age, breaks = c(18, 25, 35, 45, 55, 65, Inf),
labels = c("18-25", "26-35", "36-45", "46-55", "56-65", "66+")),
Education_Level = cut(Education, breaks = c(0, 8, 12, 16, 20, Inf),
labels = c("Primary", "High School", "Bachelor's", "Master's", "PhD")),
Employment_Status = cut(Employment, breaks = c(0, 1, 2, 3, Inf),
labels = c("Unemployed", "Part-time", "Full-time", "Self-employed")),
Religious_Level = cut(Religious, breaks = c(0, 1, 2, 3, Inf),
labels = c("Low", "Moderate", "High", "Very High"))
)

Step 4: Descriptive Statistics

Descriptive statistics provide insight into the characteristics of the variables. To obtain these statistics, we can use the summary() or describe() functions. describe() offers more detailed information about the variables.

R Code

# Generate descriptive statistics
descriptive_stats <- describe(data_cleaned)

Step 5: Normality Testing

To assess normality, a normality test can be applied. This helps determine whether the data follows a normal distribution.

R Code

# Perform normality test
normality_test_result <- shapiro.test(data_cleaned$Variable_of_Interest)

Step 6: Linear Regression Analysis

For linear regression analysis, we will select a subset of the data. The results of this analysis are shown below.

R Code

# Perform linear regression analysis
linear_model <- lm(Y_Variable ~ X1 + X2 + X3, data = data_cleaned)
# View the results of the linear regression
summary(linear_model)

Step 7: Significance Testing

Using a two-sided t-tailed test, we assess the significance of specific variables (e.g., Q52J, Q19A, Q1, and Q101) within the rejection region. Further variable testing can be conducted as needed.

Data Analysis and Linear Regression: A Comprehensive R Solution

Problem Description:

Step 1: Setting Up the Environment

Step 2: Data Import and Cleaning

Step 3: Variable Categorization

Step 4: Descriptive Statistics

Step 5: Normality Testing

Step 6: Linear Regression Analysis

Step 7: Significance Testing