Pandas (pandas
) provides a high-level interface to working with
“labeled” or “relational” data. This is in contrast to Numpy that
deals with raw matrices / arrays, and leaves any tracking of
“labeling” up to the developer.
More specifically, pandas is well suited for:
This page provides a brief introduction to Pandas.
There are many more exhaustive online tutorials available.
A good collection can be found at
the official tutorial page
You are encouraged to check out the material there once you have briefly acquainted yourself with pandas from the information on this page.
To deal with these datasets, Pandas uses two main objects: Series
(1d) and
DataFrame
(2d).
R users familiar with the data.frame
concept will find that the Pandas DataFrame
provides the same functionality plus more.
To create a DataFrame
object from data in a CSV (Comma Separated Values) file,
Pandas, provides a simple loading function:
df = pandas.read_csv('data.csv')
Similarly a table of data can be read from an Excel Worksheet using
df = pandas.read_excel('workbook.xlsx')
with optional (keyword) arguments like sheetname
, header
, skiprows
to
handle specifying the sheet to load, where column labels (headers)
should be read from, and how many rows to skip before reading (for tables that
start part-way down a spreadsheet).
DataFrame
s contain utility member functions such as head
, tail
, and
describe
to provide a view of the initial rows, last rows, and a summary
of the data they contain.
Specific rows/columns can be queried using column headers, e.g.
if a DataFrame
contains the headers ['A', 'B', 'C', 'D']
then column A can be returned as a 1d Series
using
df['A']
Standard Python slice notation may also be used to select rows, e.g.
df[:3]
selects the first 3 rows of the DataFrame
.
Load and inspect the sample data from the following file
Use the pandas.read_excel
function to load the spreadsheet, and then call the head
,
tail
, and describe
functions
Pandas can handle strings in tables of data, as well as numerical values.
For example if we have information such as
Filename, Cells, keywords
file1.tif, 120, normal, rod-shaped
file2.tif, 98, lysed
file3.tif, 40, large, rod-shaped
file4.tif, 101, spotty, rod-shaped
...
in an Excel spreadsheet, this could be loaded into a DataFrame
as
described above, and subsequently we can access the keywords column
and query which files contained “rod-shaped” cells:
keywords = pd['keywords'] # Select keywords column as Series
is_rod_shaped = keywords.str.contains("rod-shaped")
Slicing the first few rows would give:
print(is_rod_shaped[:3])
outputs
Filename
file1.tif, True
file2.tif, False
file3.tif, True
Name: keywords, dtype: bool
Using string methods like contains
allows us to easily
incorporate string information from tables of data.
For further examples of the types of string operations available see here (a table of available methods is included at the bottom of the linked page).
When working with data that has been entered by using third-party software, or via manual entry, we often encounter entries that are not readily usable within a numerical workflow.
Examples include numbers that are parsed as strings because of mistakes such as accidentally being entered as 10.0.1 (extra “.” at the end), and “NaN” values that are represented using other notation (e.g. “NA”, “??”).
Luckily Pandas import functions like read_csv
include keyword arguments
for specifying what nan
values look like (na_values
).
For example, if the data includes “NA” and “?”, we would use something like
df = pandas.read_csv("file1.csv", na_values=["NA", "??"])
If we need to deal with erroneous representation of numbers, we can first load the data as strings, and then parse the strings into the correct format;
df = pandas.read_csv("file1.csv", na_values=["NA", "??"],
dtype={"Column 10":str})
specifies that the series labeled “Column 10” should be handled as strings, and subsequently
num_dots = df['Column 10'].str.count("\.") # Need backslash because count treats "." as regex wildcard otherwise!
wrong_strings = df['Column 10'][num_dots > 1]
right_strings = wrong_strings.str[-1::-1].str.replace("\.", "", n=1).str[-1::-1]
df['Column 10'][num_dots > 1] = right_strings
The unwieldly looking second-last expression is so long because the string method replace
only acts from left to right, so we needed to sandwich it in the str[-1::-1]
bits to reverse the
string and then unreverse it, as we stated above that the first “.” was the correct one!
When working with system timestamps expressed in terns of seconds (e.g. Unix timestamps), e.g.
Filename, Creation time
data1.xlsx, 1387295797
data2.xlsx, 1387295796
data3.xlsx, 1387295743
data4.xlsx, 1387295743
we can easily convert these into datetime
data types using:
df['Creation time'] = pandas.to_datetime( df['Creation time'], unit='s')
to produce
Filename, Creation time
data1.xlsx, 2013-12-17 15:56:37
data2.xlsx, 2013-12-17 15:56:36
data3.xlsx, 2013-12-17 15:55:43
data4.xlsx, 2013-12-17 15:55:43
We can even now use string-like comparison to compare dates!
df = df[df['Creation time'] > '1970-01-01']
(this example selects all rows as they were all created since 1970!).
Once loaded as described in the previous section, a dataframe can be plotted
using pandas.DataFrame
interface to matplotlib
;
df.plot()
plots each series (column) of the DataFrame
as a separate line.
Plot the data that was loaded in the previous exercise.
Much in the same way as a DataFrame
be easily loaded from a CSV or Excel
file, it can be written back just as easily:
df.to_csv('foo.csv')
Note: to_csv
is a member-function of the DataFrame
object, while
read_csv
was a module-level function of the pandas
module.
If you would like to further practice Pandas, please visit this resource.