## Survival Analysis with R

### 1. Data Modification

### 2. Plotting and Modelling with Parametric Setup

### Kaplan-Mayer estimation of Survival Curve

#### Cox-Regression

## An Illustrated example with R

Step 1:

You observe that we have a variable ‘time’. This is our variable of interest that we wish to study. Look at another variable, ‘status.’ This is important while estimating parameters or survival function, as this indicates whether the observation is censored or not (1 indicates uncensored).

Step 2:

hist(veteran$time)

The above function helps us to plot the data and look at the distribution.

You can observe the sharp edge at the lowest values, which is an indication of the highly skewed data. Thus in this situation, it is wise to fit an Exponential distribution.

Next we get the estimated λ and the corresponding estimated survival curve.

lambda=sum(veteran$status==1)/sum(veteran$time)

mean=1/lambda

xx=sort(veteran$time)

ss=sort(exp(-lambda*xx),decreasing=TRUE)

plot(xx,ss,"l",xlab="Time",ylab="S(t)",main="Plot of Survival Function",col="blue",lwd=2)

You observe that the data fitted well and you can find the quantiles and mean to interpret on the time variable. You may like to point out some qauntiles using the abline function.

abline(h=0.4,lty=2)

Step 3:

This is where we will discuss about non-parametric estimation of Kaplam-Mayer.

km_fit <- survfit(Surv(time,status) ~1 , data=veteran)

summary(km_fit, times = c(0,1,20,30,40,50,60,70,80,90*(1:10)))

plot(km_fit)

The function survfit is an inbuilt function of the package survival. The first element requires the time variable and the status of being censored or not. The second part, of course, needs the name of the dataset. You can find a summary of the result, as presented below.

This gives us a detailed analysis of the number of items at risk at a particular point, the survival probabilities, their standard errors, and the confidence intervals of the survival probabilities.

Next, you get the K-M plot.

This is the non-parametric version of the estimated Survival function. You see that the 95% confidence interval is also plotted to get a better understanding. Also, you can point out the quantiles like the parametric estimation again with the help of the abline function.

If you want to plot the K-M function for two different groups on the same plot, you can modify the survfit function as follows, km_fit<- survfit(Surv(time,status) ~group_variable , data=veteran).

For an interactive plot, you can install ggplot and use the function autoplot.

Step 4:

Here we discuss the regression part of survival analysis.

cox <- coxph(Surv(time, status) ~ celltype+karno+age , data = veteran)

summary(cox)

cox_fit <- survfit(cox)

plot(cox_fit)

We chose celltype, Karno, and age as the explanatory variables and wished to fit the hazard function based on them. We get the summary here as follows.

Celltype was a categorical variable, and coefficients were obtained for each category. We observe the p-values under column Pr(>|z|) to determine the important variables (those whose p-values are very low). Another table is shown to give the exponential of the coefficients whose interpretation is given before. The test result given at the end indicates the overall performance of the model (lower the p-value, better is the model).

We also obtain the survival curve based on the estimated model.

This plot is also accompanied by the confidence intervals. You can make visually attractive by using autoplot.

That’s the basics about Survival Analysis in R, and with this, you have successfully completed the first survival analysis.