Data wrangling with Pandas

Pandas (pandas) provides a high-level interface to working with “labeled” or “relational” data. This is in contrast to Numpy that deals with raw matrices / arrays, and leaves any tracking of “labeling” up to the developer.

More specifically, pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Full tutorial resources

This page provides a brief introduction to Pandas.

There are many more exhaustive online tutorials available.

A good collection can be found at
the official tutorial page

You are encouraged to check out the material there once you have briefly acquainted yourself with pandas from the information on this page.

Series and Data-frames

To deal with these datasets, Pandas uses two main objects: Series (1d) and DataFrame (2d).

R users familiar with the data.frame concept will find that the Pandas DataFrame provides the same functionality plus more.

Reading data from CSV or Excel files

To create a DataFrame object from data in a CSV (Comma Separated Values) file, Pandas, provides a simple loading function:


df = pandas.read_csv('data.csv')

Similarly a table of data can be read from an Excel Worksheet using

df = pandas.read_excel('workbook.xlsx')

with optional (keyword) arguments like sheetname, header, skiprows to handle specifying the sheet to load, where column labels (headers) should be read from, and how many rows to skip before reading (for tables that start part-way down a spreadsheet).

Getting and Inspecting data

DataFrames contain utility member functions such as head, tail, and describe to provide a view of the initial rows, last rows, and a summary of the data they contain.

Specific rows/columns can be queried using column headers, e.g. if a DataFrame contains the headers ['A', 'B', 'C', 'D'] then column A can be returned as a 1d Series using

df['A']

Standard Python slice notation may also be used to select rows, e.g.

df[:3]

selects the first 3 rows of the DataFrame.

Exercise : Loading and inspecting data

Load and inspect the sample data from the following file

Excel data

Strings

Pandas can handle strings in tables of data, as well as numerical values.

For example if we have information such as

Filename,   Cells,  keywords
file1.tif,    120,  normal, rod-shaped
file2.tif,     98,  lysed
file3.tif,     40,  large, rod-shaped
file4.tif,    101,  spotty, rod-shaped
...

in an Excel spreadsheet, this could be loaded into a DataFrame as described above, and subsequently we can access the keywords column and query which files contained “rod-shaped” cells:

keywords = pd['keywords']       # Select keywords column as Series
is_rod_shaped = keywords.str.contains("rod-shaped") 

Slicing the first few rows would give:

print(is_rod_shaped[:3])

outputs

Filename
file1.tif,   True 
file2.tif,   False
file3.tif,   True 
Name: keywords, dtype: bool

Using string methods like contains allows us to easily incorporate string information from tables of data.

For further examples of the types of string operations available see here (a table of available methods is included at the bottom of the linked page).

Cleaning messy data

When working with data that has been entered by using third-party software, or via manual entry, we often encounter entries that are not readily usable within a numerical workflow.

Examples include numbers that are parsed as strings because of mistakes such as accidentally being entered as 10.0.1 (extra “.” at the end), and “NaN” values that are represented using other notation (e.g. “NA”, “??”).

Luckily Pandas import functions like read_csv include keyword arguments for specifying what nan values look like (na_values).

For example, if the data includes “NA” and “?”, we would use something like

df = pandas.read_csv("file1.csv", na_values=["NA", "??"])

If we need to deal with erroneous representation of numbers, we can first load the data as strings, and then parse the strings into the correct format;

df = pandas.read_csv("file1.csv", na_values=["NA", "??"], 
        dtype={"Column 10":str})

specifies that the series labeled “Column 10” should be handled as strings, and subsequently

num_dots = df['Column 10'].str.count("\.")      # Need backslash because count treats "." as regex wildcard otherwise!
wrong_strings = df['Column 10'][num_dots > 1]
right_strings = wrong_strings.str[-1::-1].str.replace("\.", "", n=1).str[-1::-1]
df['Column 10'][num_dots > 1] = right_strings

The unwieldly looking second-last expression is so long because the string method replace only acts from left to right, so we needed to sandwich it in the str[-1::-1] bits to reverse the string and then unreverse it, as we stated above that the first “.” was the correct one!

Timestamps

When working with system timestamps expressed in terns of seconds (e.g. Unix timestamps), e.g.

Filename,   Creation time
data1.xlsx,    1387295797
data2.xlsx,    1387295796
data3.xlsx,    1387295743
data4.xlsx,    1387295743

we can easily convert these into datetime data types using:

df['Creation time'] = pandas.to_datetime( df['Creation time'], unit='s')

to produce

Filename,   Creation time
data1.xlsx,    2013-12-17 15:56:37 
data2.xlsx,    2013-12-17 15:56:36
data3.xlsx,    2013-12-17 15:55:43
data4.xlsx,    2013-12-17 15:55:43

We can even now use string-like comparison to compare dates!

df = df[df['Creation time'] > '1970-01-01']

(this example selects all rows as they were all created since 1970!).

Plotting a column

Once loaded as described in the previous section, a dataframe can be plotted using pandas.DataFrame interface to matplotlib;

df.plot()

plots each series (column) of the DataFrame as a separate line.

Exericse : Plot the loaded data

Plot the data that was loaded in the previous exercise.

Saving to CSV or Excel

Much in the same way as a DataFrame be easily loaded from a CSV or Excel file, it can be written back just as easily:

df.to_csv('foo.csv')

Note: to_csv is a member-function of the DataFrame object, while read_csv was a module-level function of the pandas module.

Additional exercise resources

If you would like to further practice Pandas, please visit this resource.