# Using R-Programming for Statistical Data Analysis Assignments

Data structures form the foundation for data analysis in R with vectors being the most fundamental one, they are homogenous and one-dimensional. The c(), rep() and seq() can be used to create vectors. c() combines values of the same type, rep()uses repeated elements while seq() uses sequential elements. Square brackets [] are used for accessing, or subsetting, a vector by either specifying the range of numbers or using conditional selection. R allows the importing and exporting of data into other formats. read.csv() is the command used to read CSV data files while read.delim() reads data delimited using other characters (spaces or tabs). Importing tools can be found in the readr package. ## How is data stored in R program

Datasets in R are stored in a rectangular format known as data frames (matrix). Data frames can contain data of different types but the data must be of equal length. The view() function is used to view a data set with head() and tail() being used to specify the beginning and end of a data set respectively. Data frames are subset using the matrix notation [rows,colums]. The \$ operator can be used for selection. [] can used for further subsetting. col(data_frame) yields the column names. Names can be assigned to columns. dim() and str() are used to determine the number of rows and columns, and object structure respectively. Variables are added to data frames by declaring them as column variables of the matrix.

## What packages are essential when analyzing data in R The data package dplyr provides data management functions used to prepare data for analysis. filter() subsets rows based on a particular condition. select() keeps the variables needed in a dataset. rbind() is used for appending data frames as long as the variables are the same between the datasets. inner_join() and merge() provide a means for merging columns. The by= argument can be used to specify the condition for merging. NA represents missing values. Missing values can be removed by setting the argument na.rm as true.

mean(), std(), var() and med() return the mean, standard deviation, variance and median respectively. The summary() function when applied to a numeric vector returns the max, median, mean, min and the interquartile range. cor() provides a correlation matrix that can be used to assess whether 2 continuous variables are related linearly. table() is used to evaluate the frequency table of categorical variables. prop.table() is used for expressing frequencies as proportions. table () and prop.table () also serve as exploring tools for the relations that exist between categorical variables.

The stats package provides a set of tools for statistical analysis. chisq.test ()is used to test the independence between 2 categorical variables. The symbol * can be used to return the interaction and “main effects’ between 2 variables, for example y  a*b. The independent sample t-test used to model the relationship between the mean of a normally distributed variable and a two-group predictor is conducted using the function t.test (). lm () is used to fit a linear regression model. Extractor functions such as coef () for coefficients are used to pull out desired information rather than the detailed regression model. The likelihood ratio test anova () is used to compare the fit of nested models, allowing one to determine the suitability of adding or removing variables. plot () is used to return regression diagnostics (residual vs fitted, scale-location, normal q-q-plot of residuals and residual vs average plots) that can be used to test regression assumptions. The gim () function is used to model generalized linear models, including the logistic regression.

## Data Visualization in R

Data visualization is often the last step of statistical data analysis. Data visualization. functions include plot() for scatter plots, hist () for histogram, boxplot () for boxplots of the left sided variable by the right sided one and barplot () to show the frequencies of variables. The ggplot2 package can be used to generate publication-worrthy graphics. The ggplot2 uses variations of the syntax ggplot ( dataset. aes (x = xvar, y = yvar)) + geom_function ().