For illustration, I'm going to use a synthetic dataset with the contact information of 500 fictitious subjects from the US. Let's imagine that this is our client base. Here's what the dataset looks like:
clients.head()
As you can see, it includes information on each person's first name, last name, company name, address, city, county, state, zip code, phone numbers, email, and web address. Our first task is to check for missing data. You can use clients.info() to get an overview of the number of complete entries in each of the columns. However, if you want a clearer picture, here's how you can get the percentage of missing entries for each of the features in descending order:
# Getting percentange of missing data for each column
(clients.isnull().sum()/clients.isnull().count()).sort_values(ascending=False)
As you may recall, isnull() returns an array of True and False values that indicate whether a given entry is present or missing, respectively. In addition, True is considered as 1 and False is considered as 0 when we pass this boolean object to mathematical operations. Thus, clients.isnull().sum() gives us the number of missing values in each of the columns (the number of True values), while clients.isnull().count() is the total number of values in each column. After we divide the first value by the second and sort our results in descending order, we get the percentage of missing data entries for each column, starting with the column that has the most missing values. In our example, we see that we miss the second phone number for 51.6% of our clients.
There's a standard way to get a list of unique values for a particular column: clients['state'].unique(). However, if you have a huge dataset with millions of entries, you might prefer a much faster option:
# Checking unique values efficiently
clients['state'].drop_duplicates(keep="first", inplace=False).sort_values()
This way, you drop all the duplicates and keep only the first occurrence of each value. We've also sorted the results to check that each state is indeed mentioned only once.
Often, you might need to join several columns with a specific separator. Here's an easy way to do this:
# Joining columns with first and last name
clients['name'] = clients['first_name'] + ' ' + clients['last_name']
clients['name'].head()
As you can see, we combined the first_name and last_name columns into the name column, where the first and last names are separated by a space.
And what if we need to split columns instead? Here's an efficient way to split one column into two columns using the first space character in a data entry:
# Getting first name from the 'name' column
clients['f_name'] = clients['name'].str.split(' ', expand = True)[0]
# Getting last name from the 'name' column
clients['l_name'] = clients['name'].str.split(' ', expand = True)[1]
Now we save the first part of the name as the f_name column and the second part of the name as a separate l_name column.
Since we've practiced joining and splitting columns, you might have noticed that we now have two columns with the first name (first_name and f_name) and two columns with the last name (last_name and l_name). Let's quickly check if these columns are identical. First, note that you can use equals() to check the equality of columns or even entire datasets:
# Checking if two columns are identical with .equals()
clients['first_name'].equals(clients['f_name'])
True
You'll get a True or False answer. But what if you get False and want to know how many entries don't match? Here's a simple way to get this information:
# Checking how many entries in the initial column match the entries in the new column
(clients['first_name'] == clients['f_name']).sum()
500
We've started with getting the number of entries that do match. Here, we again utilize the fact that True is considered as 1 in our calculations. We see that 500 entries from the first_name column match the entries in the f_name column. You may recall that 500 is the total number of rows in our dataset, so this means all entries match. However, you may not always remember (or know) the total number of entries in your dataset. So, for our second example, we get the number of entries that do not match by subtracting the number of matching entries from the total number of entries:
# Checking how many entries in the initial column DO NOT match the entries in the new column
clients['last_name'].count() - (clients['last_name'] == clients['l_name']).sum()
0
To demonstrate how we can group data efficiently in pandas, let's first create a new column with the providers of email services. Here, we can use the trick for splitting columns that you're already familiar with:
# Creating new columb with the email service providers
clients['email_provider'] = clients['email'].str.split('@', expand = True)[1]
clients['email_provider'].head()
Now let's group the clients by state and email_provider:
# Grouping clients by state and email provider
clients.groupby('state')['email_provider'].value_counts()
We've now got a data frame that uses several levels of indexing to provide access to each observation (known as multi-indexing).
Sometimes, you'll prefer to transform one level of the index (like email_provider) into the columns of your data frame. That's exactly what unstack() does. It's better to explain this with an example. So, let's unstack our code above:
# Moving 'Mail providers' to the column names
clients.groupby('state')['email_provider'].value_counts().unstack().fillna(0)
As you can see, the values for the email service providers are now the columns of our data frame. Now it's time to move on to some other general Python tricks beyond pandas.
Try to leave a function as soon as you know it can do no more meaningful work. Doing this reduces the indentation of your program and makes it more readable. It also allows you to avoid nested if statements.
if positive_case:
if particular_example:
do_something
else:
raise exception
You can test the input in a few ways before carrying out your actions. Another approach is to raise the exception early and to carry out the main action in the else part of the loop.
if not positive_case:
raise exception
if not particular_example:
raise exception
do_something
Now you can see what this block of code is trying to achieve at first glance. You don’t need to follow the chain of logic in the conditionals. Also, you can clearly see when this function would raise an exception.
If you are a maths freak, you would surely love this next tip. You may have used sets in your lower classes. Remember something? Yeah, exactly, Unions and stuff. So, there are people like me who don’t like to use automated softwares sometimes. The reason for that is Security. Let’s take a simple example of Microsoft Excel. Some people tend to use excel only to the group and create a database. They just need that and good security for that. They are not interested in formatting the text, colour and stuff. So, what I do at those times, is I create my own python Programming software stack and create my own database. For some of my security reasons, I prefer Python over MYSql. So, coming back to my point of sets, Sets are extremely useful when creating databases. Especially when you want to find matches, create groups and other similar tasks. Following is a simple example of that.
A = {1, 2, 3, 3} A set([1, 2, 3]) B = {3, 4, 5, 6, 7} B set([3, 4, 5, 6, 7]) A | B set([1, 2, 3, 4, 5, 6, 7]) A & B set([3]) A - B set([1, 2]) B - A set([4, 5, 6, 7]) A ^ B set([1, 2, 4, 5, 6, 7]) (A ^ B) == ((A - B) | (B - A)) True