A Detailed Overview of Pandas

29.09.2020 - Jay M. Patel - Reading time ~11 Minutes

Introduction

This is a first draft of an introductory level tutorial on using Pandas library (versions higher than 0.20) in Python 3.X. We have already covered core Python language in our one hour Python tutorial.

We are also going to stay away from machine learning and data science topics such as natural language processing algorithms etc. but you are encouraged to check out those tutorials by going to the menu and navigating to the tutorial of your choice.

Main Content

1.0 Introduction to Pandas Library

The best way work with Pandas is installing the Anaconda distribution for your programming language.

Pandas gives you an ability to work with complex data structures such as nest arrays without having to worry about low level implementation details. Its abstractions make working with mixed data types a breeze, similar to storing data in a excel spreadsheet where you may have one column for integers, other for date and multiple columns for strings. All of these can be packaged and manipulated in one simple data structure called a dataframe.

Data manipulations in Pandas library is pretty fast, since all the low level computations are in fact happening in C language based code, and the Python simply provides an interface API.

Right from early days of Pandas library, its creator Wes Mckinney made a wise decision to link Pandas closely with other scientific Python packages such as Numpy and as a result, its generally possible to work with any machine learning algorithms in sci-kit learn (sklearn) by simply calling Pandas dataframe based object rather than converting it into Numpy and than working with it. This compatibility also extends to how Pandas intrinsically handles missing data using NaN (not a number) from the Numpy library.

1.1 Maximum Limits on Reading Files as Pandas Objects

Pandas may be very efficient and quick way to read structured files such as CSV, SQL tables however, like all data types in Python, it will read it in your computer’s memory; hence the maximum file you can read using pandas should be about 4-6X less than the total memory available on your computer.

The official Pandas documentation goes through some very good suggestions on handling big data scale files in Pandas and in this tutorial we will assume you are able to fit all the data in memory without resorting to any of the workarounds described in that article.

2.0 Pandas Series

A one dimensional array in Pandas is known as series. The data can be hold data types such as scalars- integers, strings, floating point numbers, python objects such as ndarray, dict etc and axis labels are referred as index.

Unlike R, Pandas supports non unique index values however exception is raised if a operation is attempted which requires unique index values. dtype is inferred and the name is also assigned automatically in many cases.

NewSeries=pd.Series(data=None, index=None, dtype=None, name=None)

If data is ndarray, than index must be same length as data; if no explicit index is passed than one will automatically be created with values 0….len(data)-1.

If data is a dict than the values in data corresponding to keys is pulled out; otherwise, an index is constructed from sorted keys of the dict. Unlike dicts, pd.series supports slice operations; the values will get padded by NaNs if you exceed index.

pd.series are also very similar to 1 dimensional ndarrays; however, Series allows the user to specify the index explicitly and it can be non sequential integers or characters unlike ndarrays where the implicit index is always integers.

import numpy as np
import pandas as pd

#creating a sample dict

SampleDict = {'d':'bear', 'z':'lion', 'c':'tiger', 'a':'bat'}
print('-SampleDict is:\n', SampleDict)

# converting it to a pd.series

SampleDictSeries = pd.Series(SampleDict)
print ('-pd.series created from SampleDict is:\n',SampleDictSeries)
print('-Result of a slice operation a:c on SampleDictSeries is:\n', SampleDictSeries['a':'c'])

# creating another sample dict

SampleDict2={0 : 'p', 1 : 'a', 2 : 'n'}

# converting it into a dict, but with more indices than original dict

SampleDict2Series=pd.Series(pd.Series(SampleDict2, index=[0,1,2,3,4]))

print ('-The new series is:\n',SampleDict2Series)
#Output
-SampleDict is:
 {'d': 'bear', 'z': 'lion', 'c': 'tiger', 'a': 'bat'}
-pd.series created from SampleDict is:
 a      bat
c    tiger
d     bear
z     lion
dtype: object
-Result of a slice operation a:c on SampleDictSeries is:
 a      bat
c    tiger
dtype: object
-The new series is:
 0      p
1      a
2      n
3    NaN
4    NaN
dtype: object

3.0 Pandas DataFrame

DataFrame is a 2-dimensional size mutable labeled data structure with columns of potentially different types. It is analogous to R’s dataframe and is intuitively similar to a SQL table or a dict of the pandas series objects. DataFrame is not intended to work exactly as a 2D ndarray. Axis labels are index objects; unique row identifiers also referred to as index however, columns are columns.

NewDataframe = pa.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

data can be:

  • 2-D numpy ndarray (structured or homogenous)
  • Dict of 1D ndarrays, lists, dicts or Series
  • List of dicts, Series etc.
  • A pandas Series
  • Another DataFrame

index will default to np.arange(n) if no indexing information is provided. similarly, if no column labels are provided then columns will default to np.arange(n).

3.1 Creating a DataFrame

When a DataFrame is created by combining series, the resulting index will be union of both Series. In case of nested dicts, it will be first converted to a series.

You will get a value error if you try to create a Dataframe from a dict without specifying an index. Also, passing column labels in DataFrame will override the keys in the dict and will create a dataframe with NaN values if it cant match column name with keys in the dict.

SampleDict3={0 : 'p', 1 : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'}
# this will give an error : print (pd.DataFrame(SampleDict3))
print (pd.DataFrame(SampleDict3, index=[0]))
#Output
   0  1  2  3  4  5
0  p  a  n  d  a  s

Alternately, you can either create a list of one dict and pass it or use a series

SampleDict3={0 : 'p', 1 : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'}
print (pd.DataFrame([SampleDict3]))
#Output
   0  1  2  3  4  5
0  p  a  n  d  a  s

You can create a pd.DataFrame from a 2D np array, however, note that the first index of nparray refers to first row, whereas in pad dataframe it refers to the first column.

import numpy as np
import pandas as pd

a = [['p','a', 'n'], [4, 5, 6]]
np_a = np.array(a)
pd_a = pd.DataFrame(np_a)

print('np_a is:\n', np_a)
print('pd_a is:\n', pd_a)

print('first index of np_a is the entire first row:\n', np_a[0])
print('first index of pa_a is the entire first column:\n',pd_a[0])
#Output
np_a is:
 [['p' 'a' 'n']
 ['4' '5' '6']]
pd_a is:
    0  1  2
0  p  a  n
1  4  5  6
first index of np_a is the entire first row:
 ['p' 'a' 'n']
first index of pa_a is the entire first column:
 0    p
1    4
Name: 0, dtype: object

Create a Dataframe from a list of dicts:

SampleDict2={0 : 'p', 1 : 'a', 2 : 'n'}
SampleDict3={0 : 'p', 1 : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'}
print (pd.DataFrame([SampleDict2,SampleDict3]))
#Output
   0  1  2    3    4    5
0  p  a  n  NaN  NaN  NaN
1  p  a  n    d    a    s

Create a dataframe from a list of Series:

x=pd.Series({0 : 'p', 1 : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'})
print (pd.DataFrame([x]))
#Output

print (pd.DataFrame([x]))

   0  1  2  3  4  5
0  p  a  n  d  a  s

Just to reiterate that this is conceptually very different from creating a DataFrame with only one Series where the values get read in the column with index number 0. The output is practically same as one series except that the column gets an index numer of 0.

x=pd.Series({0 : 'p', 1 : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'})
print (x)

y = pd.DataFrame(x)+pd.DataFrame(x)
print(y)

#Output

0    p
1    a
2    n
3    d
4    a
5    s
dtype: object
    0
0  pp
1  aa
2  nn
3  dd
4  aa
5  ss

4.0 Selection and Slicing

4.1 By label (pd.loc)

We can select data by label using .loc, but for clarity, lets change our example by having some keys as strings so now our column names are composed of letters and integers.

Having only integers are valid labels too, however the refer to label in this case and not to the position.

when using .loc to slice by label, both start and stop are included contrary to usual python slices. read the official documentation for more information.

You can also pass a array of labels [‘a’,‘b’..]

It throws an error if you try to slice data by columns using “:” and the integer label (note that here we have mixed labels here).

x=pd.Series({'A' : 'n', 'B' : 'u', 2 : 'm', 3 : 'p', 4 : 'y'})
y=pd.Series({'A' : 'p', 'B' : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'})

NewPandasFrame=pd.DataFrame([x,y])

print ('NewPandasFrame\n', NewPandasFrame)

print ("NewPandasFrame.loc[1,'A']\n", NewPandasFrame.loc[1,'A'])

# Unlike regular indexing, ':1' means include 1 (so print both index 0 and 1 rows)

print ('NewPandasFrame.loc[:1,:]\n', NewPandasFrame.loc[:1,:])

# you can specify which columns to print

print ("NewPandasFrame.loc[:,['A','B',3]]\n", NewPandasFrame.loc[:,['A','B',3]])

print ("this statment: NewPandasFrame.loc[:1,'A':4]) throws an error: TypeError: cannot do slice indexing on <class #'pandas.core.indexes.base.Index'> with these indexers [4] of <class 'int'>")

# you can either slice with only characters such as below

print ("NewPandasFrame.loc[0:,'A':'B']\n", NewPandasFrame.loc[0:,'A':'B'])

# or just use integer labels

print ("NewPandasFrame.loc[:1,3]\n", NewPandasFrame.loc[:1,3])
#Output
NewPandasFrame
    A  B  2  3  4    5
0  n  u  m  p  y  NaN
1  p  a  n  d  a    s
NewPandasFrame.loc[1,'A']
 p
NewPandasFrame.loc[:1,:]
    A  B  2  3  4    5
0  n  u  m  p  y  NaN
1  p  a  n  d  a    s
NewPandasFrame.loc[:,['A','B',3]]
    A  B  3
0  n  u  p
1  p  a  d
this statment: NewPandasFrame.loc[:1,'A':4]) throws an error: TypeError: cannot do slice indexing on <class #'pandas.core.indexes.base.Index'> with these indexers [4] of <class 'int'>
NewPandasFrame.loc[0:,'A':'B']
    A  B
0  n  u
1  p  a
NewPandasFrame.loc[:1,3]
 0    p
1    d
Name: 3, dtype: object

4.2 By integer based location (pd.iloc)

This uses integer based indexing, irrespective of the labels; Unlike pd.loc(), and similar to slicing in other python objects; start:stop, stop is not included in pd.iloc().

x=pd.Series({0 : 'n', 1 : 'u', 2 : 'm', 3 : 'p', 4 : 'y'})
y=pd.Series({0 : 'p', 1 : 'a', 2 : 'n', 3 : 'd',4 : 'a', 5 : 's'})

z = pd.DataFrame([x,y])

print("dataframe z:\n", z)

print("z.loc[:,1:4]\n",z.loc[ : ,0:3])

print("z.iloc[:,1:4]\n", z.iloc[: ,0:3])
#Output
dataframe z:
    0  1  2  3  4    5
0  n  u  m  p  y  NaN
1  p  a  n  d  a    s
z.loc[:,1:4]
    0  1  2  3
0  n  u  m  p
1  p  a  n  d
z.iloc[:,1:4]
    0  1  2
0  n  u  m
1  p  a  n

4.3 pd.ix()

Important: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

5.0 Basic DataFrame manipulations

New columns can be created simply by a dict like assignments.

z = pd.DataFrame([x,y])
print ("Original Z is:\n", z)
z['6'] = ['a', 'b']
print("New Z is:\n", z)
z['sumof0&1'] = z[0]+z[1]
print("New Z with sumof0&1 column is:\n", z)

#Output

Original Z is:
    0  1  2  3  4    5
0  n  u  m  p  y  NaN
1  p  a  n  d  a    s
New Z is:
    0  1  2  3  4    5  6
0  n  u  m  p  y  NaN  a
1  p  a  n  d  a    s  b
New Z with sumof0&1 column is:
    0  1  2  3  4    5  6 sumof0&1
0  n  u  m  p  y  NaN  a       nu
1  p  a  n  d  a    s  b       pa

We can convert to ndarray (using attribute DataFrame.values), it will assume broadest possible datatype (object) when more than one are present in a given column. You can also use a builtin method astype to convert the dataframe into a specific dtype.

# dataframe z
print("dataframe Z\n", z.loc[:,0:4])
print(z.dtypes)
print ("ndarray from dataframe using .values \n", z.values)

x=pd.Series({0 : 34, 1 : 42, 2 : 455.9})
y=pd.Series({0 : 23, 1 : 45, 2 : 69})

df3 = pd.DataFrame([x,y])
print(df3.dtypes)
print ("ndarray from dataframe using .values \n", df3.values)
df4 = df3.astype('int64')
print("convert dtype of z to int64 \n", df4)
print(df4.dtypes)
#Output

dataframe Z
    0  1  2  3  4
0  n  u  m  p  y
1  p  a  n  d  a
0    object
1    object
2    object
3    object
4    object
5    object
dtype: object
ndarray from dataframe using .values 
 [['n' 'u' 'm' 'p' 'y' nan]
 ['p' 'a' 'n' 'd' 'a' 's']]
0    float64
1    float64
2    float64
dtype: object
ndarray from dataframe using .values 
 [[ 34.   42.  455.9]
 [ 23.   45.   69. ]]
convert dtype of z to int64 
     0   1    2
0  34  42  455
1  23  45   69
0    int64
1    int64
2    int64
dtype: object

5.1 Combining DataFrames by using join

todo

5.2 Combining DataFrames by using merge

todo

5.3 Combining Dataframes by using concat

todo

6.0 Loading Data into Pandas/ Export/Import of Data

todo

6.1 CSV

todo

6.2 JSON

todo

6.3 SQL

7.0 Advanced Pandas Functions

7.1 Describe and Info

todo

7.2 Transpose, Melt, Stack, Unstack

todo

7.3 Apply

todo

7.4 Groupby

todo

7.5 Pivot table

todo

comments powered by Disqus