Back when I spent a lot of my time cleaning datasets, I had a bright yellow sticky note on my desk with lines of numpy and pandas syntax that I used all the time. This post is an aggregation of different useful numpy and pandas data manipulations that can served as a quick reference if you’re already experienced in these libraries, or as an introduction to new functionality if you’re just getting started.
Introduction
Quick intro if you’re new: numpy and pandas go hand-in-hand for tabular data cleaning and manipulation in Python. numpy is for array manipulation. pandas is built on top of numpy and enables you to label the rows and columns of an array and manipulate the labeled arrays in fancy ways. A dataframe is a labeled array, typically abbreviated df.
Imports
import numpy as np
import pandas as pd
Numpy – collecting a lot of arrays into one bigger array
In general, do NOT use np.concatenate, because it leads to a ton of memory fragmentation. I worked for years developing a convolutional neural network PyTorch repository and only late in the game realized that some very odd memory fragmentation and Python garbage collection issues were occurring due to use of np.concatenate. As soon as I updated the code to fill up a pre-allocated array instead of using np.concatenate, the performance improved so dramatically that I felt almost embarrassed.
For np.concatenate, usage went something like this (example assumes a 1d array; if 2d then you need to calculate the pre allocated array shape more carefully but that’s it):
for idx in range(len(filenames)): new_file = np.load(open(filenames[idx],'rb')) if idx==0: #initialize the aggregated variable using the first file all_files_agg = new_file else: #concatenate subsequent files onto the first array all_files_agg = np.concatenate((all_files_agg,new_file))
Whereas if you are using pre-allocated np arrays, you can do this instead, which is way faster and save a ton of memory issues:
for idx in range(len(filenames)): new_file = np.load(open(filenames[idx],'rb')) if idx==0: #initialize the aggregated variable using zeros, based on the total size it will be shape = (len(new_file)*len(filenames)) all_files_agg = np.zeros(shape) #Fill in the new array start_idx = len(new_file)*idx stop_idx = start_idx + len(new_file) all_files_agg[start_idx:stop_idx] = new_file
Creating DataFrames
When you create a dataframe, you can specify the data inside, the column labels, and the index labels:
pd.DataFrame(data = some_array, columns = ['feature_a', 'feature_b', 'feature_c'], index = ['example1', 'example2'])
You can also create an empty dataframe that just has column specifications, and fill in data later:
pd.DataFrame(columns = some_list)
Or you can specify only the index:
pd.DataFrame(index = some_list)
Reading in DataFrames
My most common way to read in a CSV would be by specifying the file path and that the first row was the header:
pd.read_csv('some/path/to/some_filename.csv', header=0)
You can use the sep argument to specify a different separator character than a comma, if you want. There’s also a function called read_table that assumes you have a tab separator.
pd.read_excel is also fantastic! You can read an Excel file into a pandas DataFrame.
Saving DataFrames
The most common way I saved dataframes to a file was with:
df.to_csv('path/to/somewhere.csv', header=True, index=False)
This means the header (column labels) will get saved in CSV, but the index (row labels) will not.
I also sometimes used:
df.to_csv('path/to/somewhere.csv', header=True, index=True)
if I cared about saving the index labels.
You don’t need to include the header and index arguments at all if you just want to implicitly use pandas’ default behavior for these.
Filtering by one column
There are a TON of different ways to filter data in pandas. Here are a few useful ones:
Filtering a dataframe to pick out only the rows where a certain column takes on value ‘thing’:
df= df[df['Colname']=='thing']
Filtering a dataframe to pick out only the rows where a certain column does NOT take on value ‘thing’:
df= df[df['Colname']!='thing']
Filtering a dataframe to pick out only the rows where a certain column is greater than 0:
df= df[df['Colname']>0]
(can also do >=, <= < and so on…)
Here’s one of my favorites for working with text data. You can filter on whether a value CONTAINS a string:
df=df[df['colname'].str.contains('substring')]
So for example, if we had:
df=df[df['tweet_text'].str.contains('happy')]
we’d be picking out all of the rows where column ‘tweet_text’ contains the word ‘happy’. That means it’ll match ‘happy’, ‘I am happy’, ‘Are you happy?’ and so on.
Filtering by multiple columns
You can read this Stackoverflow post for more info on filtering by two columns at once:
young_males = df[(df[Gender]=='Male') & (df[Year]==2014)]
Here’s another example:
values = all_pfts[(all_pfts['COMPONENT_NAME']==comp_name) & (all_pfts['REFERENCE_UNIT']==ref_units)]
Detecting NaNs
The pandas and numpy functions for detecting NaNs have similar names:
pandas isna
numpy isnan
Dealing with NaNs
If you want to get rid of NaNs, you have different options:
pandas dropna will enable you to get rid of entire rows or entire columns based on presence of NaNs;
pandas fillna will enable you to fill in NaNs with some other value (like 0, or 19384, or “hello” – whatever you want).
Example use of dropna:
result_val_df = all_pfts[['RESULT_VALUE','RESULT_VALUE_NUM']].dropna(axis='index',how='any')
Note that in pandas and numpy, NaNs are NOT considered equal to each other! For example np.nan != np.nan. That means if you are doing any kind of “==” check with a column that may have nans in it, you need to be careful.
So for example, you must do fill the NaNs before you can check if certain rows or columns are equal. E.g. you must first do this:
all_pfts['REFERENCE_UNIT'] = all_pfts['REFERENCE_UNIT'].fillna(value='NO_UNITS')
before you can do this:
relevant = all_pfts[(all_pfts['COMPONENT_NAME']==comp_name) & (all_pfts['REFERENCE_UNIT']==ref_units)]
(in the codebase from which this was pulled, ref_units might be NaN, but the above logic will fail if it’s represented as np.nan instead of the replaced string ‘NO_UNITS’)
Replacing Values
You can replace values using pandas.DataFrame.replace.
Modifying String Content (or, more generally, applying functions to contents of a dataframe)
You can apply functions to the contents of a dataframe. Example:
all_df_melted['Label'] = all_df_melted['Label'].apply(lambda x: x.replace('_',' '))
Sorting DataFrames
To sort a dataframe, use pandas.DataFrame.sort_values. To me this has been particularly useful with the ignore_index=True option.
That’s all, folks! Happy data cleaning.
Note: the Featured Image is modified from the Wikipedia, Creative Commons license image mosaic created with post-it notes