Predict survival on the Titanic using Random Forest

3 min readMay 16, 2019

In this blog, we will use a random forest machine learning model to predict survivals on Titanic train data step by step.

Importing libraries: we will use the following libraries

Load data: let's load the data by using pd.read_csv then file path and save it to train and test.

Let's check our data using .info(). There are some missing values in Embarked, Cabin, and Age columns.

Its time to do some data cleaning. let's drop the two rows with missing Embarked values in our train data.

Now its time to check the null values in Age columns for test and train data.

Since our goal is not to cover imputation and it will take a long time to do so, lets simply impute all missing ages to be 999 so we know these are the values we filled in the future.

Since there are so many missing values for Cabin column, let’s binarize that columns 1 for if there was value and 0 if it was null.

Let's check again our data

Now that our data looks clean we can start Feature Engineering: In this case, we use Dummies. To do so we use pd.get_dummies function and give the columns we want to dummy.

Let’s Dummy the Sex and Embarked columns, and using drop_first=True we drop the original columns and replace with the new dummy columns.

Now it’s time to prepare our data for modeling, our features (X)will be all columns and our target (y) will be Survived column. We can add as many features as we want for X, as long as they are important in predicting our y.