Lirico Blog - Data Science

What is a dataframe

Sun 24 April 2016

Back when Big Data was a genuinely new concept and Data Scientist's weren't rock stars, the SciPy ecosystem was beginning to mature. It felt as though Python deserved to dominate the data analysis world but uptake was still slow outside the realm of the physical sciences and engineering. Something seemed to be missing; and there was a new language on the block which was steeling all the attention: R. As well as having more mature statistical libraries, R had a killer feature for anyone doing ad hoc data invstigation: the Data Frame.

I remember the feeling of relief and optimism when this missing feature was addressed by the arival of Pandas, bringing data frames to Python. Finally we had a tool which could bring the power of Numpy to a more general audience and stem the flow of attention towards R.

Pandas data frames have the flexibility of R but used Numpy arrays underneath to give big efficiency gains for bulk computations. Pandas' success is based on how it manages to hide this implementation from the user, allowing them to see them in terms of what they are familiar with. To a software engineer they look like database tables; to a physicist they may look like annotated matrices; to many with a general numerate background they look like spreadsheets.

I want to demonstrate that data frames are an amalgamation of all of these different models with some clever manipulation under the hood to paper over the cracks. Even though a data frame seems completely general, there are a few restrictions that are not obvious. Understanding the data frame's restrictions also helps us see through its usability tricks to the true data model underneath.

Hidden typing

We know a Pandas data frame uses Numpy arrays underneath to optimise operations on its elements. We also know that Numpy arrays are homogenous arrays of a single type, usually a fixed-size numeric type such as np.int64 or np.float32. Although this could be limiting when dealing with messy data, Numpy has one "get out of jail free" feature here: the type can also be a pointer to an arbitrary Python object, therefore numpy arrays can be heterogenous arrays of arbitrary objects, with the caviat that object arrays throw away a lot of the optimisation goodness you get with homogenous arrays.

So how does Pandas deal with heterogeneous data without abandoning the efficiency of numpy homogeneous arrays?

import pandas as pd
import numpy as np
import IPython.display as D

Let's play around with a few simple data frames and see what types Pandas uses. We start with a data frame of 2 columns with different types, e.g. float and int. Pandas displays both columns in the same way but the underlying types are retained as can be seen using the dtypes attribute.

df = pd.DataFrame({'a': np.arange(0, 5).astype(np.float64), 'b': range(0, 5)})
print df.dtypes
df

a    float64
b      int64
dtype: object

	a	b
0	0	0
1	1	1
2	2	2
3	3	3
4	4	4

So far no surprises. Each column is a separate dtype. What happens if we swap the rows and columns by using the transpose() method or attribute T? We might hope that each row will have a separate dtype but Pandas seems to store dtypes by column so it's going to have to do something clever.

print df.T.dtypes
df.T

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

	0	1	2	3	4
a	0	1	2	3	4
b	0	1	2	3	4

Pandas keeps storing dtype by column and therefore has cast all columns to float. In Pandas each column must have a single type: you can't have a mix of int and float in a single column. This isn't at all apparent from what is displayed since trailing zeros are trimmed from floats but you can imagine getting some unexpected behaviour if you assume some values are int.

This behaviour is more pronounced if we add a column of a completely different type. What happens if there are strings in the mix?

df2 = pd.concat([df, pd.DataFrame({'c': list('abcde')})], axis=1)
print df2.dtypes
df2

a    float64
b      int64
c     object
dtype: object

	a	b	c
0	0	0	a
1	1	1	b
2	2	2	c
3	3	3	d
4	4	4	e

print df2.T.dtypes
df2.T

0    object
1    object
2    object
3    object
4    object
dtype: object

	0	1	2	3	4
a	0	1	2	3	4
b	0	1	2	3	4
c	a	b	c	d	e

All columns have been cast to type object, i.e. generic Python objects. This is bad news if you want to efficiently perform a computation on the data all the advantages of using Numpy arrays have evapourated in a single operation. When we are processing large data frames it can be important to realise what dtypes are being used and why. We'll look into the efficiency implications a bit more later but first let's go one level deeper into how Pandas uses numpy arrays under the hood.

Inside the data frame

When I want to understand the implementation of a data structure I often mess arount on the IPython prompt to find out what attributes and methods an object has. Using IPython autocomplete I can see our data frames have an attribute _data which looks promising. What do they look like?

print '== DF1 : 2 cols, float, int'
print df._data, '\n'
print '== DF1.T'
print df.T._data, '\n'
print '== DF2 : 3 cols, float, int, string'
print df2._data, '\n'
print '== DF2.T'
print df2.T._data, '\n'

== DF1 : 2 cols, float, int
BlockManager
Items: Index([u'a', u'b'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64

== DF1.T
BlockManager
Items: Int64Index([0, 1, 2, 3, 4], dtype='int64')
Axis 1: Index([u'a', u'b'], dtype='object')
FloatBlock: slice(0, 5, 1), 5 x 2, dtype: float64

== DF2 : 3 cols, float, int, string
BlockManager
Items: Index([u'a', u'b', u'c'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(2, 3, 1), 1 x 5, dtype: object

== DF2.T
BlockManager
Items: Int64Index([0, 1, 2, 3, 4], dtype='int64')
Axis 1: Index([u'a', u'b', u'c'], dtype='object')
ObjectBlock: slice(0, 5, 1), 5 x 3, dtype: object

It's worth taking a few moments to examine these structures. Each BlockManager contains an Items index, a Axis index and one or more typed blocks. The blocks appear to be a numpy array with a slice object which defines which of the frame's columns are stored in that block. Importantly:

Each block stores one or more columns but must span all rows
Each block has one dtype

With this structure Pandas should be able to store multiple columns in one block. Let's check this is the case.

df3 = pd.DataFrame({'a': np.arange(0, 5), 'b': range(0, 5)})
df3._data

BlockManager
Items: Index([u'a', u'b'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 2, 1), 2 x 5, dtype: int64

Intermission: NetCDF

If you are familiar with atmospheric sciencies or oceanography you are probably getting a sense of deja vu right now. The BlockManager looks suspiciously like how the CF-NetCF conventions are used to encode observation and model data in the environmental sciences. In both cases variables are stored in homogenous arrays where their domain is represented by separate coordinate axis variables. When variables share a domain they are associated with the same coordinate axes, similar to how we might align columns in a spreadsheet next to a common coordinate column.

Both Pandas and CF-NetCDF store data in a fundamentally different way to most relational databases which store rows consecutively. I will save full exploration of this for another post but I will try to wet your apetite by pointing out a further similarity with the emerging trend of Columnar Databases. In fact, it seems no coincidence that the inventor of Pandas has recently announced involvement in a new columnar technology Apache Arrow

Performance and dtypes

We know that what makes Pandas efficient is it's use of numpy arrays underneath, so it should be no surprise that it can be slower to calculate across collumns than across rows. To demonstrate this let's do a simple calculation on a column and compare it to the same data after it has been cast to object by transposing. We'll capture these different representations of the same data as a1 and a2.

df = pd.DataFrame({'a': np.linspace(0, 100, 10000)})
df['b'] = df['a'].astype(np.str)

a1 = df['a']
a2 = df.T.ix[0]
print 'Dtypes: a1 =', a1.dtype, ' a2 =', a2.dtype
print 'a1 == a2:', np.all(a1 == a2)  # Verify the data is the same

Dtypes: a1 = float64  a2 = object
a1 == a2: True

print '== a1'
%timeit np.sum(a1 * a1)
print '== a2'
%timeit np.sum(a2 * a2)

== a1
10000 loops, best of 3: 160 µs per loop
== a2
1000 loops, best of 3: 1.09 ms per loop

There is about 7-fold degradation in performance when processing the object array a2 compared to the float array a1 even for this trivial calculation.

Panels

Finally we should take a quick peek at Pandas' lesser known data structures that expand the data frame concept to more dimensions. Pandas defines Panel for 3D data and Panel4D for 4D data. The question is which Pandas concept expands increases in dimensionality: columns or rows?

wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
              major_axis=pd.date_range('1/1/2000', periods=5),
              minor_axis=['A', 'B', 'C', 'D'])

wp._data

BlockManager
Items: Index([u'Item1', u'Item2'], dtype='object')
Axis 1: <class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01, ..., 2000-01-05]
Length: 5, Freq: D, Timezone: None
Axis 2: Index([u'A', u'B', u'C', u'D'], dtype='object')
FloatBlock: slice(0, 2, 1), 2 x 5 x 4, dtype: float64

Summing up

I like to think of a Pandas DataFrame as a collection of 1D arrays (the columns) whose elements are alligned along a common axis (the index). The implementation of DataFrames stores columns together rather than rows and is therefore more similar to columnar databases or NetCDF than it is to traditional relational databases.

Pandas takes great efforts to provide a symetric interface with respect to the columns and rows, upcasting to Python Objects where necessary. This symetry breaks down when you are concerned about the types of your data, for instance when efficiency is important or when you want to enforce uniformity to ensure clean data.

The less common Pandas objects Panel and Panel4D extend the dimensionality of columns to multi-dimensional items, thus Panels are conceptually collections of 2D arrays with 2 common axes. This is made less obvious by changing the terminology from columns to items in the Panel interface.

Category: Data Science Tagged: pandas python

comments

Example of using esgf-pyclient to plot via OPeNDAP

Sun 24 February 2013

First import the required modules and select the CEDA ESGF-index node's search service.

from pyesgf.search import SearchConnection
from netCDF4 import Dataset
from mpl_toolkits.basemap import Basemap

CEDA_SERVICE = 'http://esgf-index1.ceda.ac.uk/esg-search/search'

Let's ask the question "How many datasets from the HadCM3 decadal2000 simulations do we have …

Category: Data Science Tagged: esgf opendap

comments

Wed 06 February 2013

This notebook describes the analysis of a snapshot of all replicas in the CMIP5 archive on February 6th 2013. A list of all dataset versions at each of the 3 replicating data centres is read, analysed and compared with the total number of datasets listed in the CMIP5 archive.

This …

Category: Data Science Tagged: esgf cmip5

comments

Tue 15 January 2013

I am increasingly excited about distributed version control and how it enables easy collaboration between software developers without technical and social barriers such as synchronisation of work and maintenance of control.</p>

The obvious question is how the DVC systems can be applied to scientific collaboration and my particular specialism …

Category: Data Science Tagged: esgf cmip5

comments

What is a dataframe

Sun 24 April 2016

Hidden typing

Inside the data frame

Intermission: NetCDF

Performance and dtypes

Panels

Summing up

Example of using esgf-pyclient to plot via OPeNDAP

Sun 24 February 2013

Analysis of replicated CMIP5 datasets

Wed 06 February 2013

Bringing Git to data archival

Tue 15 January 2013

What is a dataframe

Sun 24 April 2016

Hidden typing

Inside the data frame

Intermission: NetCDF

Performance and dtypes

Panels

Summing up

Example of using esgf-pyclient to plot via OPeNDAP

Sun 24 February 2013

Analysis of replicated CMIP5 datasets

Wed 06 February 2013

Bringing Git to data archival

Tue 15 January 2013

Page 1 of 1