Python is just one of the many coding languages that exist and is used nowadays to build websites and software or analyse data. As a general-purpose language, it can be used for various types of programming, not just web development. This may include backend development, building software, and writing scripts. As it works for a range of web development tasks, it is an attractive option for anyone who needs versatility. It is also an open-source language, so anyone can access and use it for free.
In this blog article, we are going to show you how to present a story and draw conclusions from a dataset building a data visualization with Python. We are going to use data to tell a story about the Titanic and answer the question “What sort of people were likely to survive?” It seems some groups were more likely to survive than others, so we will look at if there are any patterns in a given dataset to help answer this question. We will work with two libraries in today’s workshop: Matplotlip and Seaborn and we will share some other useful materials with you at the end. All images are screenshots from the workshop so please feel free to watch along for full details.
What is a dataset?
A data set is simply a collection of data that is usually displayed as a table. The dataset we will be using today is a collection of data regarding the passengers and statistics of the titanic:
Each row on our dataset table represents an individual passenger of the Titanic. The columns of the table each represent a feature related to the passenger.There are 15 columns in total in our dataset but we will not use all of them in our workshop today.
Important functions to note
This function gives us a concise overview of all the data in the Dataset. This very useful when doing exploratory analysis because it shows us any rows that have missing data. Using the info function we can see in our dataset we only have age data about 714 of our passengers for example.
This function give information about used for calculating some statistical data like percentile, mean and std of the numerical values of the dataset.
Analyzing the data
Let's start by analyzing the class of the passengers. There were three classes of passengers on the Titanic and taking a look at the numbers of each might give us some insight as to which was more likely to survive...
We call the dataframe “titanic” and add the ‘pclass’ column in square brackets to display the passenger class column.
Visualise the data
Next, we want to plot some graphs and visualise the passenger class data using Matplotlib. The syntax we enter into our command line will determine the values plotted in our graph, the formatting and colour of our data points.
Even with the correct syntax entered we can see a scatter graph is not the right type of graph to display our data. Limited by the values in the passengers column (3) the scatters graph is restricted to only a few lines.
Now we know a scatter plot is not useful for our needs, let's try a different method. In order to count the number of passengers in each class maybe a bar graph will give us a better visualisation.
Great! Now we can see the amount of passengers in each class and we can start to make some insights: We can see that most of our passengers are from the 3rd class and the least amount of passengers were in the 2nd class. But, wait, maybe this still isn't clear enough for someone looking at our graph, we can fix that by editing the variable in our command line to increase the size, add a title and even a border to our graph:
That's much better. Now we have a clear visualisation of the distribution of passengers in each class. Next we want to analyze the gender of our passengers, so we will use the same method to create a bar chart of each gender:
We can see that most of the passengers are male. So we can already make an assumption that the majority of passengers were males in the 3rd class. We can use the “who” column in our dataset to filter the passengers by “man”, “woman” or “child”:
So let's prove our previous assumption that the majority of passengers are males in the 3rd class. To do this we can make some simple adjustments to our graph plot command to show us a grouping of the three age groups along with the class they are in:
We can clearly see our first assumption was correct, the majority of passengers on the Titanic were males with a 3rd class ticket! Let's look now at the age of the passengers to further our investigation:
Unlike our previous categories there are far more possible outcomes when we filter our passengers by age. This makes the data harder to read and we need to find a way to simplify our visualisation. The answer is to create a histogram. A histogram is an approximate representation of the distribution of numerical data - which means it will give us a much nicer visualization of our data. Lets try it out on our age query:
We can see that most of the people are between 18-40, young adults. Now we are starting to form a clearer picture now of the most common passenger persona but we still need to investigate further to prove it. Let's look at the distribution of age per class. We can do this by adding the hue parameter to our command line, this will divide our data into subclasses and display our histogram by age and class:
We can clearly see that most of the younger passengers are in the 3rd class whereas the majority of older passengers are in the 1st. So our final step in our passenger class investigation is to add our findings together and plot a graph that shows us the gender of the age distributions.
We know now that the majority of people on the Titanic were males, between 18-40 and in the 3rd class. So let’s use the skills we have learned in plotting graphs to start answering the questions of who is most likely to survive.
1) How many passengers died:
It is clear that the number of people survived is less than the number of people who died.
2) Which gender has a greater survival rate?
Males have a much smaller survival rate than females.
3) Survival level of each class. Money can't buy everything?
Sadly, yes it can. It is clear from our plot that third class passengers are much more likely to die.
After analysing our data we can see that passengers of Class 1 are given a high priority when it comes to rescue.
In our webinar our expert Karina takes the investigation even further by analyzing our dataset to discover what specific factors were more likely to help each passenger survive. We recommend watching along to see how these factors can be investigated and for some extra materials to help you on your data analytics journey.
Watch it here: