Survival Analysis with R 

The type of data analysis problem we encounter daily may sometimes include data that records time to occurrence of a certain event. For example, say we are testing the lifetime of a certain category of bulb light. That is, we want to record how much time the bulb light functions fully. To do this, we arrange 500 bulb lights, lit them at the same time, and record the time when any bulb stops working. It may happen that we do not have an infinite amount of time and hence want to stop, say after 5 days. The bulbs that are still working after 5 days were not recorded till they go off. The corresponding time of observations for these bulbs is recorded as 5 and is called censored observations. The variable of interest here is thus the time that we recorded for each of these 500 bulbs. We might use all these observations to predict the lifetime of a new bulb or just simply observe the mean, median, or any other quantile of the lifetimes.
Let us now see the different methods to work on the survival data.

1. Data Modification

To work in Survival Analysis in R, you should need to create a column indicating whether the observation is censored or not. Usually, the column is named ‘Status,’, and 1 indicates uncensored data, while 0 indicates censored. If your data contain more than 20% of censored observations, it is advisable not to proceed with the analysis further.

2. Plotting and Modelling with Parametric Setup

Generally, survival data can be modelled as Exponential, Weibull, or Gamma distribution. It is always advisable to first plot the data and look at the distribution. If it is extremely positively skewed, feel assured to fit an Exponential distribution. If otherwise, try with Weibull or Gamma.
After you detect the distribution, you can go for the parameter estimation. If the distribution is Exponential, the sole parameter λ can be estimated as
Where d is the number of observed data, x’s are the observed data, T is the time of censoring, and n is the total number of observations (censored or not).
For Weibull and Gamma distribution, more complex and iterative methods are used for estimation. It is to be noted that the R package does not provide any method of parametric estimation. Hence everything needs to be coded.
The parametric method is particularly useful if you wish to know the mean-time of survival, which is 1/λ for the case of Exponential distribution. Similarly, one can find the survival function as e^(-λ) and plot and compare it with the original plot. From the survival curve, it is easy to determine the median and other quantiles.

 Kaplan-Mayer estimation of Survival Curve 

Kaplan-Mayer estimation is the non-parametric method of estimation of the survival function. It is typically a decreasing function of time, which tells us the probability of survival after time t. It is denoted by S(t) and is very helpful in answering questions like ‘After what time did only 5% of the units survived?’ or ‘Within what time did 80% of the units ceased to exist?’.
Another way of working with K-M estimation is to compare groups. Suppose you wish to compare the survival time of men and women. You create two different K-M plots for each gender and plot them on the same graph. If the curve for Females looks higher, one can conclude that on overall analysis, women tend to have greater survival time.

 Cox-Regression

Another important thing to do about survival analysis is to find the important factors that affect survival time. This is simply the regression technique for Survival data.
Here instead of modelling the time variable itself, we model the hazard rate, which can be roughly translated as the probability of survival just at that time point. The model can be briefly stated as

Here, of course, z is the explanatory variable(s), and λ0(t) is just a baseline hazard function, which is not of much importance for analysis. β’s are coefficients that are estimated, and they have a good interpretation. The e^βis the amount of change in hazard function with a unit change in the covariate, z. Thus the statistical significance of β is very important to conclude on its influence on the hazard function.

An Illustrated example with R 

Before going into an analysis make sure that you install the survival package of R. It has many inbuilt tools that will help you with the analysis.
We will illustrate with the help of an inbuilt R dataset veteran which is under the survival package
library(survival)
head(veteran)


Step 1:

You observe that we have a variable ‘time’. This is our variable of interest that we wish to study. Look at another variable, ‘status.’ This is important while estimating parameters or survival function, as this indicates whether the observation is censored or not (1 indicates uncensored).

Step 2:

hist(veteran$time)

The above function helps us to plot the data and look at the distribution.

You can observe the sharp edge at the lowest values, which is an indication of the highly skewed data. Thus in this situation, it is wise to fit an Exponential distribution.

Next we get the estimated λ and the corresponding estimated survival curve.

lambda=sum(veteran$status==1)/sum(veteran$time)

mean=1/lambda

xx=sort(veteran$time)

ss=sort(exp(-lambda*xx),decreasing=TRUE)

plot(xx,ss,"l",xlab="Time",ylab="S(t)",main="Plot of Survival Function",col="blue",lwd=2)


You observe that the data fitted well and you can find the quantiles and mean to interpret on the time variable. You may like to point out some qauntiles using the abline function.

abline(h=0.4,lty=2)

Step 3:

This is where we will discuss about non-parametric estimation of Kaplam-Mayer.

km_fit <- survfit(Surv(time,status) ~1 , data=veteran)

summary(km_fit, times = c(0,1,20,30,40,50,60,70,80,90*(1:10)))

plot(km_fit)

The function survfit is an inbuilt function of the package survival. The first element requires the time variable and the status of being censored or not. The second part, of course, needs the name of the dataset. You can find a summary of the result, as presented below.


This gives us a detailed analysis of the number of items at risk at a particular point, the survival probabilities, their standard errors, and the confidence intervals of the survival probabilities.

Next, you get the K-M plot.


This is the non-parametric version of the estimated Survival function. You see that the 95% confidence interval is also plotted to get a better understanding. Also, you can point out the quantiles like the parametric estimation again with the help of the abline function.

If you want to plot the K-M function for two different groups on the same plot, you can modify the survfit function as follows, km_fit<- survfit(Surv(time,status) ~group_variable , data=veteran).

For an interactive plot, you can install ggplot and use the function autoplot.

Step 4:

Here we discuss the regression part of survival analysis.

cox <- coxph(Surv(time, status) ~ celltype+karno+age , data = veteran)

summary(cox)

cox_fit <- survfit(cox)

plot(cox_fit)

We chose celltype, Karno, and age as the explanatory variables and wished to fit the hazard function based on them. We get the summary here as follows.


Celltype was a categorical variable, and coefficients were obtained for each category. We observe the p-values under column Pr(>|z|) to determine the important variables (those whose p-values are very low). Another table is shown to give the exponential of the coefficients whose interpretation is given before. The test result given at the end indicates the overall performance of the model (lower the p-value, better is the model).

We also obtain the survival curve based on the estimated model.


This plot is also accompanied by the confidence intervals. You can make visually attractive by using autoplot.

That’s the basics about Survival Analysis in R, and with this, you have successfully completed the first survival analysis.