# Pandas Introduction 

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

You can say pandas is extremely powerful version of Excel

In this section we are going to talk about 

* Introduction To pandas 
* Seies 
* DataFrames 
* Missing Data
* Merging , Joining , And Concatenating 
* Operations 
* Data Input and Output 

## Series 

Fristly we are going to talk about Series DataType .

A Series is very similar to numpy array , it is built on top of NumPy Array..
But Series can have axis labels , meaning it can be indexed by labels instead of just number location 

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

In [1]:
# Lets import numpy and pandas 
import numpy as np
import pandas as pd

In [2]:
# We can convert a list , numpy array , or dict to Series

labels = ['Shivendra','Ragavendra','Narendra']
my_list= [21,25,30]
arr=np.array([10,20,30])
d={'Shivendra':21,'Raghavendra':25,'Narendra':30}

In [4]:
# Using List 
pd.Series(data=my_list)

0    21
1    25
2    30
dtype: int64

In [5]:
pd.Series (data=my_list,index=labels )

Shivendra     21
Ragavendra    25
Narendra      30
dtype: int64

In [5]:
pd.Series(my_list,labels)

Shivendra     21
Ragavendra    25
Narendra      30
dtype: int64

In [7]:
# NumPy Array
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [8]:
pd.Series (data=arr,index=labels )

Shivendra     10
Ragavendra    20
Narendra      30
dtype: int32

In [9]:
# Dictonary
pd.Series (d)

Shivendra      21
Raghavendra    25
Narendra       30
dtype: int64

### Data In A Series 

A Pandas Series can hold a variety of Objects 

In [10]:
pd.Series (data=labels )

0     Shivendra
1    Ragavendra
2      Narendra
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [11]:
ser1= pd.Series ([1,2,3,4], index =['Chennai','Bihar','West Bengal','Rajasthan'])

In [13]:
ser1

Chennai        1
Bihar          2
West Bengal    3
Rajasthan      4
dtype: int64

In [14]:
ser2=pd.Series ([1,2,5,4],index=['Chennai','Bihar','Assam','Rajasthan'])

In [15]:
ser2

Chennai      1
Bihar        2
Assam        5
Rajasthan    4
dtype: int64

In [16]:
ser1['Chennai']

1

In [18]:
# Operations are then also done based off of index:
ser1+ser2

Assam          NaN
Bihar          4.0
Chennai        2.0
Rajasthan      8.0
West Bengal    NaN
dtype: float64

## DataFrames
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [19]:
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)

In [20]:
df=pd.DataFrame (randn(5,5),index='Chennai Bihar UtterPredesh Delhi Mumbai'.split(),columns ='SRM NIT_PATNA BHU IIT_DELHI IIT_Bombay'.split())

In [21]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [22]:
df['SRM']

Chennai         2.706850
Bihar          -0.319318
UtterPredesh    0.528813
Delhi           0.955057
Mumbai          0.302665
Name: SRM, dtype: float64

In [23]:
# We can pass a list of columns names 
df[['SRM' , 'BHU']]

Unnamed: 0,SRM,BHU
Chennai,2.70685,0.907969
Bihar,-0.319318,0.605965
UtterPredesh,0.528813,0.188695
Delhi,0.955057,1.978757
Mumbai,0.302665,-1.706086


In [25]:
df.SRM # SQL syntax

Chennai         2.706850
Bihar          -0.319318
UtterPredesh    0.528813
Delhi           0.955057
Mumbai          0.302665
Name: SRM, dtype: float64

In [26]:
# Dataframe  Columns are just Series

In [27]:
type(df['SRM'])

pandas.core.series.Series

In [30]:
# Creating a new columns 
df['UPES']=df['SRM']

In [35]:
df['Harshita']=df['SRM'] + df['BHU']

In [36]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay,UPES,Harshita
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118,2.70685,3.614819
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122,-0.319318,0.286647
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237,0.528813,0.717509
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509,0.955057,2.933814
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841,0.302665,-1.40342


In [32]:
df.drop('UPES',axis=1) # Axis = 1 for column

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [27]:
df # But again it will be appeared we need to use inplace to remove it parmanently

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay,UPES
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118,2.70685
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122,-0.319318
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237,0.528813
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509,0.955057
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841,0.302665


In [38]:
df.drop('UPES',axis=1,inplace =True )

In [40]:
df.drop('Harshita',axis=1,inplace = True)

In [41]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [44]:
df.drop('Delhi',axis=0)

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [45]:
df.loc['Chennai']

SRM           2.706850
NIT_PATNA     0.628133
BHU           0.907969
IIT_DELHI     0.503826
IIT_Bombay    0.651118
Name: Chennai, dtype: float64

In [46]:
# We can select based on indexing 
df.iloc[1]

SRM          -0.319318
NIT_PATNA    -0.848077
BHU           0.605965
IIT_DELHI    -2.018168
IIT_Bombay    0.740122
Name: Bihar, dtype: float64

In [47]:
df.loc['Bihar','NIT_PATNA']

-0.8480769834036315

In [48]:
df.loc[['Bihar','Mumbai'],['NIT_PATNA','IIT_DELHI']]

Unnamed: 0,NIT_PATNA,IIT_DELHI
Bihar,-0.848077,-2.018168
Mumbai,1.693723,-1.159119


In [49]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [50]:
df>0

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,True,True,True,True,True
Bihar,False,False,True,False,True
UtterPredesh,True,False,True,False,False
Delhi,True,True,True,True,True
Mumbai,True,True,False,False,False


In [51]:
df[df>0]

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,,,0.605965,,0.740122
UtterPredesh,0.528813,,0.188695,,
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,,,


In [53]:
df [df['SRM']>0] # It will not print Bihar Cz Bihar is having negetive number

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [55]:
df[df['SRM']>0]['BHU'] # It will not print Bihar data since it is having negetive number 

Chennai         0.907969
UtterPredesh    0.188695
Delhi           1.978757
Mumbai         -1.706086
Name: BHU, dtype: float64

In [56]:
df[df['BHU']>0]['SRM'] # It will not print mumbai's data

Chennai         2.706850
Bihar          -0.319318
UtterPredesh    0.528813
Delhi           0.955057
Name: SRM, dtype: float64

In [58]:
df[df['SRM']>0][['BHU','IIT_DELHI']]

Unnamed: 0,BHU,IIT_DELHI
Chennai,0.907969,0.503826
UtterPredesh,0.188695,-0.758872
Delhi,1.978757,2.605967
Mumbai,-1.706086,-1.159119


In [59]:
df[(df['SRM']>0.955)& df['BHU']>0]

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [60]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [62]:
df.reset_index()

Unnamed: 0,index,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
0,Chennai,2.70685,0.628133,0.907969,0.503826,0.651118
1,Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122
2,UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237
3,Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
4,Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [64]:
newind ='Tamil_Nadu BIHAR UP Delhi Maharastra'.split()

In [65]:
df['States']=newind

In [66]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay,States
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118,Tamil_Nadu
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122,BIHAR
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237,UP
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509,Delhi
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841,Maharastra


In [67]:
df.set_index('States')

Unnamed: 0_level_0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tamil_Nadu,2.70685,0.628133,0.907969,0.503826,0.651118
BIHAR,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UP,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Maharastra,0.302665,1.693723,-1.706086,-1.159119,-0.134841


In [68]:
df

Unnamed: 0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay,States
Chennai,2.70685,0.628133,0.907969,0.503826,0.651118,Tamil_Nadu
Bihar,-0.319318,-0.848077,0.605965,-2.018168,0.740122,BIHAR
UtterPredesh,0.528813,-0.589001,0.188695,-0.758872,-0.933237,UP
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509,Delhi
Mumbai,0.302665,1.693723,-1.706086,-1.159119,-0.134841,Maharastra


In [69]:
df.set_index('States',inplace=True)

In [70]:
df

Unnamed: 0_level_0,SRM,NIT_PATNA,BHU,IIT_DELHI,IIT_Bombay
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tamil_Nadu,2.70685,0.628133,0.907969,0.503826,0.651118
BIHAR,-0.319318,-0.848077,0.605965,-2.018168,0.740122
UP,0.528813,-0.589001,0.188695,-0.758872,-0.933237
Delhi,0.955057,0.190794,1.978757,2.605967,0.683509
Maharastra,0.302665,1.693723,-1.706086,-1.159119,-0.134841


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [71]:
# index Levels 
outside =['Big_data','Big_data','Big_data','AI','AI','AI']
inside =[1,2,3,1,2,3]
hier_index=list(zip(outside,inside))
hier_index=pd.MultiIndex.from_tuples(hier_index)

In [72]:
hier_index

MultiIndex([('Big_data', 1),
            ('Big_data', 2),
            ('Big_data', 3),
            (      'AI', 1),
            (      'AI', 2),
            (      'AI', 3)],
           )

In [74]:
df=pd.DataFrame(np.random.rand (6,2),index=hier_index,columns=['Core','volunteers'])

In [75]:
df

Unnamed: 0,Unnamed: 1,Core,volunteers
Big_data,1,0.701371,0.487635
Big_data,2,0.680678,0.521548
Big_data,3,0.043397,0.223937
AI,1,0.575205,0.120434
AI,2,0.500117,0.13801
AI,3,0.052808,0.178277


Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [77]:
df.loc['Big_data']

Unnamed: 0,Core,volunteers
1,0.701371,0.487635
2,0.680678,0.521548
3,0.043397,0.223937


In [78]:
df.loc['Big_data'].loc[1]

Core          0.701371
volunteers    0.487635
Name: 1, dtype: float64

In [79]:
df.index.names

FrozenList([None, None])

In [80]:
df.index.names=['Domain','S.NO']

In [81]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Core,volunteers
Domain,S.NO,Unnamed: 2_level_1,Unnamed: 3_level_1
Big_data,1,0.701371,0.487635
Big_data,2,0.680678,0.521548
Big_data,3,0.043397,0.223937
AI,1,0.575205,0.120434
AI,2,0.500117,0.13801
AI,3,0.052808,0.178277


In [82]:
df.xs('Big_data')

Unnamed: 0_level_0,Core,volunteers
S.NO,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.701371,0.487635
2,0.680678,0.521548
3,0.043397,0.223937


In [83]:
weather_data = {
    'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
    'temperature': [32,35,28,24,32,31],
    'windspeed': [6,7,2,7,4,2],
    'event': ['Rain', 'Sunny', 'Snow','Snow','Rain', 'Sunny']
}

In [84]:
df=pd.DataFrame(weather_data)

In [85]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [86]:
df.shape # rows, columns = df.shape

(6, 4)

In [87]:
df.head() # df.head(3)

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [88]:
df.tail() # df.tail(2)

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [89]:
df[1:3]

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow


## <font color='blue'>Columns</font>

In [90]:
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [91]:
df['day']

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [92]:
type(df['day'])

pandas.core.series.Series

In [94]:
df[['day','temperature']]

Unnamed: 0,day,temperature
0,1/1/2017,32
1,1/2/2017,35
2,1/3/2017,28
3,1/4/2017,24
4,1/5/2017,32
5,1/6/2017,31


## <font color='blue'>Operations On DataFrame</font>

In [97]:
df['temperature'].max()

35

In [98]:
df[df['temperature']>32]

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny


In [101]:
df['day'][df['temperature'] == df['temperature'].max()] # Kinda doing SQL in pandas

1    1/2/2017
Name: day, dtype: object

In [102]:
df['temperature'].std()

3.8297084310253524

In [103]:
df['event'].max() # But mean() won't work since data type is string

'Sunny'

In [78]:
df.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [104]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [105]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [106]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [107]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [83]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


In [108]:
df.fillna(value='shivendra')

Unnamed: 0,A,B,C
0,1,5,1
1,2,shivendra,2
2,shivendra,shivendra,3


In [109]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

# Groupby

The groupby method allows you to group rows of data together and call aggregate functions

In [86]:
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Shivendra','Abhishek','Sowjanya','Manish','Mini','Satya'],
       'Sales':[200,120,340,124,243,350]}

In [87]:
df = pd.DataFrame(data)

In [88]:
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Shivendra,200
1,GOOG,Abhishek,120
2,MSFT,Sowjanya,340
3,MSFT,Manish,124
4,FB,Mini,243
5,FB,Satya,350


** Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [89]:
df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F12523CB08>

In [90]:
#You can save this object as a new variable:
by_comp = df.groupby("Company")
#And then call aggregate methods off the object:
by_comp.mean()


Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [91]:
df.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [92]:
#More examples of aggregate methods:
by_comp.std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


In [93]:
by_comp.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Satya,350
GOOG,Shivendra,200
MSFT,Sowjanya,340


In [94]:
by_comp.min()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Mini,243
GOOG,Abhishek,120
MSFT,Manish,124


In [95]:
by_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [96]:
by_comp.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


In [97]:
by_comp.describe().transpose()['GOOG']

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

# Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this we will discuss these 3 methods with examples.

____

In [98]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

In [99]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

In [100]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [101]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [102]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [103]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


## Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [104]:
pd.concat([df1,df2,df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [105]:
pd.concat([df1,df2,df3],axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


# Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [106]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [107]:
df['col2'].unique()

array([444, 555, 666], dtype=int64)

In [108]:
df['col2'].nunique()

3

In [109]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

In [110]:
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]

In [111]:
newdf

Unnamed: 0,col1,col2,col3
3,4,444,xyz


In [112]:
# Applying Functions
def times2(x):
    return x*2

In [113]:
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [114]:
df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

In [115]:
df['col1'].sum()

10

** Permanently Removing a Column**

In [116]:
del df['col1']

In [117]:
df

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In [118]:
# get columns and index names 
df.columns 

Index(['col2', 'col3'], dtype='object')

In [119]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [120]:
df

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In [121]:
df.sort_values(by='col2') #inplace=False by default

Unnamed: 0,col2,col3
0,444,abc
3,444,xyz
1,555,def
2,666,ghi


In [123]:
# Check is there any null value or not 
df.isnull()

Unnamed: 0,col2,col3
0,False,False
1,False,False
2,False,False
3,False,False


In [124]:
# Drop rows with NaN Values
df.dropna()

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In [125]:
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1.0,,abc
1,2.0,555.0,def
2,3.0,666.0,ghi
3,,444.0,xyz


In [126]:
df.fillna('FILL')

Unnamed: 0,col1,col2,col3
0,1,FILL,abc
1,2,555,def
2,3,666,ghi
3,FILL,444,xyz


In [127]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

In [128]:
df

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


# Great