What the f*ck is Pandas? - A Complete Tutorial

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

ยท

21 min read

What the f*ck is Pandas? - A Complete Tutorial

Pandas is a Python library, much like NumPy. While Pandas adopts many coding idioms from NumPy, the biggest difference is that Pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

Before reading this blog, I assume that you have read my NumPy blog post, because Pandas require that you know the basics of NumPy. If you haven't read my NumPy blog, click here to read it.

Just like we imported NumPy, we will import Pandas.

import pandas as pd

01 . Introduction to Pandas ๐Ÿผ

Pandas is a Python library used for data manipulation and analysis. Pandas provides a convenient way to analyze and clean data. The Pandas library introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy.

Pandas is a powerful library generally used for:

  • Data Cleaning

  • Data Transformation

  • Data Analysis

  • Machine Learning

  • Data Visualization

02 . Why do I even care about Pandas? ๐Ÿ™„

Some of the reasons why we should use Pandas are as follows:

  1. Handle Large Data Efficiently
    Pandas is designed for handling large datasets. It provides powerful tools that simplify tasks like data filtering, transforming, and merging.
    It also provides built-in functions to work with formats like CSV, JSON, TXT, Excel, and SQL databases.

  2. Tabular Data Representation

    Pandas DataFrames, the primary data structure of Pandas, handles data in tabular format. This allows easy indexing, selecting, replacing, and slicing of data.

  3. Data Cleaning and Preprocessing

    Data cleaning and preprocessing are essential steps in the data analysis pipeline, and Pandas provides powerful tools to facilitate these tasks. It has methods for handling missing values, removing duplicates, handling outliers, data normalization, etc.

  4. Time Series Functionality

    Pandas contains an extensive set of tools for working with dates, times, and time-indexed data as it was initially developed for financial modeling.

  5. Free and Open-Source

    Pandas follow the same principles as Python, allowing you to use and distribute Pandas for free, even for commercial use.

03 . Series ๐Ÿ˜‰

To get started with Pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications. We will be starting with Series.

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type. It is very similar to NumPy 1-D Array.

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

The output of this would be:

One difference between NumPy Array and Pandas Series, as we can see is that Pandas Series has index labels, as shown above. By default, the index labels are 0,1,2. However, we can also change them.

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a,index=("apple","banana","mango"))
print(myvar)

Now the output of this would be:

Unlike NumPy, Pandas is very flexible. Instead of a tuple, you can also give a list or even a NumPy array as input.

import pandas as pd
import numpy as np
a = np.array([1, 7, 2])
myvar = pd.Series(a,index=np.array(["apple","banana","mango"]))
print(myvar)

The output of this would remain the same.

In Pandas, you can access an element either by index values or numerical values. For example, suppose I have to access the element "7", then I can use one of the two ways:

myvar["banana"]
myvar[1]

Both of them would give the same output, which is 7. But what if we want to access the index values? Well... for that, we can use the .index method. This gives an array of indices as an output.

import pandas as pd
import numpy as np
a = np.array([1, 7, 2])
myvar = pd.Series(a,index=np.array(["apple","banana","mango"]))
print(myvar.index)

The output of this would be:

We can then access individual elements using NumPy indexing and slicing. (If you do not understand what Indexing and Slicing are, check out my blog post on NumPy).

Now that we have had some introduction to Series, let us dive deep into this data structure.

03.01. Creating a Series

There are several methods of creating a Series. We can create it using a List, Tuple or a NumPy Array.

import pandas as pd
import numpy as np
np_array = np.array([1,2,3,5,6,7,9,1,10])
data = pd.Series(np_array)
print(data)

The output of this would be:

By default, the index values in all of the three methods are 0,1,2,3, and so on... If we want to give our indices, we can pass a list, tuple, or ndarray to the pd.Series( ) function using index= attribute.

Now, there's another way to create a Pandas Series, which is using a Dictionary.

import pandas as pd
import numpy as np
dictionary = {'a':1,'b':2,'c':3, 'd':4}
data = pd.Series(dictionary)
print(data)

Here, the keys of dictionaries are converted to index labels. We do not have to specify the index attribute separately. However, if we do give the index attribute, it gives an interesting result.

import pandas as pd
import numpy as np
dictionary = {'a':1,'b':2,'c':3, 'd':4}
data = pd.Series(dictionary,index=['a',6,'c',8])
print(data)

Let's see the output first.

As we can see, it creates a new Series object with the index values. If the values corresponding to the indices are present in the dictionary, those values are used, else NaN is used.

Another interesting observation is that when we pass a scalar value in the Series( ) method with index values, it just copies the scalar value for every index.

import pandas as pd
import numpy as np
data = pd.Series(2,index=['a',6,'c',8])
print(data)

The output of this is:

03.02. Accessing Elements of Series

There are primarily two methods for retrieving an element from the pandas series object: using position or label. Let us go through them one by one in further detail. But first, let's examine how we may retrieve an element based on its position.

The index number is employed to retrieve the Pandas Series object's element. We may use the index operator "[]" to retrieve an element in a pandas series. The index ought to be an integer. We implement the slice operation to retrieve numerous entries from a panda series, which returns a subarray from the list.

import pandas as pd
import numpy as np
np_array = np.array([1,2,3,5,6,7,9,1,10])
data = pd.Series(np_array)
print("data :\n", data)
print("data present at index 1 is :",data[1])
print("First 5 data elements : ")
print(data[:5])

The output of this would be:

If we have predetermined index labels, we may utilize them to access the Pandas Series Object. You may picture the series as a fixed-size dictionary where you can retrieve and modify values by index label. Please note that when we are slicing using Index values, the end value is also included. For example,

import pandas as pd
import numpy as np
np_array = np.array([1,2,3,5,6,7,9,1,10])
data = pd.Series(np_array,index=['a,b,c,d,e,f,g,h,i'.split(",")])
print("data :")
print(data)
print("First 5 values :")
print(data["a":"e"]) #Includes e

The output of this would be:

03.03. Indexing Operators

Indexing can be done using two functions: .loc[ ] and .iloc[ ]. Notice that unlike normal functions loc and iloc have square brackets instead of round brackets.

The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc().

The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc(). iloc() does not accept the boolean data unlike loc().

Let's look at an example. Suppose we have an array and we have to find all values greater than 4. For that, we can simply use the .loc[ ] method, as shown below:

import pandas as pd
import numpy as np

a = pd.Series(np.array((1,2,3,4,5,6)),index=['a,b,c,d,e,f'.split(',')])
print(a.loc[a>4])

As expected, the output of this would be:

We can also use .loc[ ] to extract values based on index labels.

import pandas as pd
import numpy as np

a = pd.Series(np.array((1,2,3,4,5,6)),index=['a,b,c,d,e,f'.split(',')])
print(a.loc['a':'d'])

As we just learnt, this method includes the last value of the range that is passed to it. Hence, the output of this would be:

If we have to use integer indexes instead of boolean values or index labels, we use .iloc[ ]. This does not include the last value of the range passed to it.

import pandas as pd
import numpy as np

a = pd.Series(np.array((1,2,3,4,5,6)),index=['a,b,c,d,e,f'.split(',')])
print(a.iloc[3:6])

The output of this would be:

Please note that the range function does nothing but return a list. Wherever we are using the range function, we can also use a list. This implies that fancy indexing works in Pandas as well. Suppose I want to take the Series and get element 1, element 4, and element 0 in order as mentioned, then I can do the following:

import pandas as pd
import numpy as np

a = pd.Series(np.array((1,2,3,4,5,6)),index=['a,b,c,d,e,f'.split(',')])
print(a.iloc[[1,4,0]])

03.04. Series Attributes

Several attributes can be used with a Series object. These are as follows:

  1. series.values: The values attribute is used to get the data array. The values property provides an array containing all of the series' values.

     print("Series Values :\n", series.values)
    
  2. series.index: The index attribute is used to get the index array. The index property provides an array containing all of the series' index.

     print("Series Index :\n", series.index)
    
  3. series.index.size: size is a property of a NumPy array that returns the number of items contained in it. We can also use it to find the number of items in Series.

  4. series.values.itemsize: itemsize is a property of NumPy array that returns the size taken by one element of a NumPy array in memory.

  5. series.dtype: This tells us the datatype of elements of the Series object.

  6. series.shape: The shape attribute provides a tuple with several rows and columns.

  7. series.ndim: It returns the number of dimensions of the object. Pandas Series is always a 1 dimensional object.

  8. series.nbytes: Pandas' nbytes attribute returns the size of the underlying data item's dtype for the specified Pandas Series object.

  9. series.size: The size attribute gives the number of Series elements.

import pandas as pd
import numpy as np

a = pd.Series(np.array((1,2,3,4,5,6)),index=['a,b,c,d,e,f'.split(',')])
print(a.shape)
print(a.dtype)
print(a.ndim)
print(a.size)

The output of this would be:

03.05. isna( )

If you are familiar with np.nan, then you must know that two NaN values are never equal. This implies that we cannot use .loc[ ] to check whether NaN values are present in a Series object or not. Therefore, we have a method called isna( ) that returns a Series with True, if it finds NaN values, else false.

import pandas as pd
import numpy as np

a = pd.Series(np.array((1,2,3,4,np.nan,6)),index=['a,b,c,d,e,f'.split(',')])
print(a.loc[a.isna()])

The output of this would be:

03.06. Head and Tail

Head() and tail() functions retrieve the first and last n rows, respectively. If we don't specify a number for n, it defaults to 5. They are handy for swiftly verifying data, such as when we have a large amount of data.

series.head( )
series.tail( )

03.07. Unique Values in Series

The unique() and nunique() functions return the distinct values and the count of distinct values. These methods are useful when we want to check the many groups into which our data may have been divided.

import pandas as pd
import numpy as np

series = pd.Series(['a','b','c','d','c','d','b','a'])
print("Unqiue Values : ",series.unique())
print("Number of Unqiue Values : ",series.nunique())

The output of this would be:

We have the value_counts() function in addition to the unique() and nunique() functions. The value_counts() function outputs the count of times each unique value appears in a Series. It is beneficial to understand the pandas series object's value distribution.

series.value_counts()

The output of this would be:

03.08. Performing Statistical Calculations

On a Series object's values, we may execute statistical operation s such as mean(), sum(), product(), max(), min(), median() and so on . For example:

import pandas as pd
import numpy as np

series = pd.Series([2,3,5,2,5,6,1,3,0])
print("Sum of all the elements of series : ",series.sum())
print("Mean of all the elements of series : ",series.mean())
print("Product of all the elements of series : ",series.product())
print("Standard deviation of all the elements of series : ",series.std())
print("Max elements of series : ",series.max())
print("Smallest elements of series : ",series.min())
print("Median of all the elements of series : ",series.median())

The output of this would be:

If we require numerous statistical operations to be performed simultaneously, we may send them to the agg() function as a list.

import pandas as pd
import numpy as np

series = pd.Series([2,3,5,2,5,6,1,3,0])
print(series.agg(['sum','mean','product','std','max','min','median']))

The output of this would be:

04 . DataFrames ๐Ÿ˜Ž

A DataFrame is a data structure that, like a spreadsheet, arranges information into a 2-dimensional table of rows and columns. Because they provide a versatile and easy manner of storing and interacting with data, DataFrames are among the most often utilized data structures in modern data analytics.

Understanding DataFrame is critical not just because it is one of the fundamental data structures, but also, because it is the foundation of an Index, which is a 3-dimensional labeled data structure. In a word, Pandas DataFrame is a Python equivalent of Excel, containing tables, rows, and columns, as well as several functions that make it an excellent framework for processing, data examination, and modification.

Some of the common functionalities provided by Pandas DataFrame are as follows:

  1. Pandas DataFrame has several adjustable features and parameters. They're great productivity enhancers since they allow you to completely customize your Pandas DataFrame environment.

  2. Each column in a Pandas DataFrame must include data of the same type, although separate columns might contain data of different types.

  3. Pandas DataFrame is value mutable and size mutable; mutable in this context refers to the column number.

  4. There are two axes in a Pandas DataFrame object: "axis 0" and "axis 1." Rows are represented by "axis 0," while columns are represented by "axis 1."

  5. We can execute mathematical operations on the rows and columns of a Pandas DataFrame.

  6. We can stack two separate DataFrames horizontally or vertically.

  7. Our DataFrame may be reshaped, merged, and transposed.

04.01. Structure of a DataFrame

A Pandas DataFrame is a two-dimensional data structure made up of rows and columns, similar to an Excel or a Database table ( Like SQL ). A Pandas DataFrame column is a pandas Series. These columns should all be the same length, although they may be of distinct data types, such as float, int, or bool. DataFrames can change their values as well as their size. It thus allows us to change the values stored in the DataFrame or add/remove columns from the Pandas DataFrame.

Pandas DataFrames are typically made up of Values, a row index, and a column index. There are two index arrays in the DataFrame. The functions of the first index array are substantially similar to those of the index array in series. In reality, each label corresponds to every value in the row. The second array has a sequence of labels, each connected with a certain column. A Pandas DataFrame object has two axes: "axis 0" and "axis 1." The "axis 0" indicates rows, whereas the "axis 1" indicates columns.

04.02. Creating a DataFrame

An Empty Pandas DataFrame is a type of basic DataFrame that can be formed.

import pandas as pd
df = pd.DataFrame()
print(df)

We can create a DataFrame using Ndarrays, List or Tuples.

import pandas as pd
import numpy as np

data = [[1,2,3],
        [4,5,6],
        [7,8,9]]

df1 = pd.DataFrame(data, columns = ['a','b','c']) # Using List
print('DataFrame 1: \n',df1)
df2 = pd.DataFrame(np.array(data), columns = ['a','b','c']) # Using NumPy array
print('DataFrame 2: \n',df2)

The output of this would be as follows:

In the preceding code example, we supplied a 2D numpy array / list; each array corresponds to a row in the DataFrame and ought to be of equal length. If we additionally supply an index, it will have a similar length as the arrays. Let's supply index values. This will be done in the same way as we did in Series.

import pandas as pd
import numpy as np

data = [[1,2,3],
        [4,5,6],
        [7,8,9]]

df1 = pd.DataFrame(data, columns = ['a','b','c'],index=['p','q','r']) 
print('DataFrame: \n',df1)

The output of this would be:

We can use a dictionary to create a DataFrame as well. It is done in the same way as Series with one minor difference: while create a Series using a dictionary, the key values are replaced with row values. Here, the key values are replaced with column values.

Let's take an example.

import pandas as pd
import numpy as np

data = {'a' : [1,2,3],
        'b' : [4,5,6],
        'c' : [7,8,9]}

df = pd.DataFrame(data)
print('DataFrame : \n',df)

The output of this would be:

Suppose that we want just the reverse. We want a, b, c to be our index (or row) labels. Then we can simply do what we do in NumPy - we can transpose our DataFrame, using the transpose( ) method.

import pandas as pd

data = {'a' : [1,2,3],
        'b' : [4,5,6],
        'c' : [7,8,9]}

df = pd.DataFrame(data)
print('DataFrame : \n',df.transpose())

The output of this would be:

There is one more way we can create a DataFrame using dictionaries. In the examples above, we used a dictionary of lists. However, we can also use a list of dictionaries.

import pandas as pd
import numpy as np

data = [{'a': 1, 'b': 2, 'c':3},
        {'a': 4, 'b': 5, 'c':6},
        {'a': 9, 'b': 8, 'c':7}]

df = pd.DataFrame(data)
print('DataFrame : \n',df)

This method is viable, although not recommended as it is too verbose. The output of this would be:

Please note that we can also use NumPy Arrays or Tuples instead of Lists. There are two more very common methods of creating a DataFrame. However, they're so common that they deserve a separate subtopic.

04.03. Creating a DataFrame using Excel and CSV Files

We can create a DataFrame using MS Excel or CSV file. What is a CSV File? CSV is an abbreviation for comma-separated file. CSV is one of the most widely used file formats. An example of a CSV File is https://bit.ly/PandasCSVFile. We shall be performing all operations from now on this CSV File. Hence, I recommend that you download it if you are following along.

We can create a DataFrame using the CSV File with the help of read_csv( ) method. It takes in one argument, which is the path or the URL of the csv file.

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print('DataFrame : \n',df.head())

As we have seen in Series( ) section, .head( ) prints the first five rows of the DataFrame.

The output of this would be:

... means that some values are hidden, as the size of the screen is small. But we know that 5 rows and 5 columns are present in this DataFrame, as it has been written at the bottom of the DataFrame by Python.

Anyways, we now have learned how to convert a CSV file to a DataFrame. But what about Excel Files? Well... if you have a file with .xlsx or .xlx extensions, then you can use a method known as .read_excel( ). It works the same way as .read_csv( ), with the only difference that instead of the path of a CSV File, you have to put in the path of an Excel file as an argument.

04.04. Head and Tail

We just saw what head( ) function does in DataFrame. Much like Series, it displays the first five rows of the DataFrame. Similarly, the tail( ) function displays the last five rows of the DataFrame. What I did not discuss in the Series section is that we can put in a numerical argument in both head( ) and tail( ). Let's say that we put in df.head(10). This implies that the first ten rows will be returned. Similarly, df.tail(8) implies that the last 8 rows will be returned.

Let's see an example.

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print('DataFrame : \n',df.head(10))

The output of this would be:

04.05. Descriptive Statistics

describe() generates descriptive statistics for information in a Pandas DataFrame that excludes NaN values. It describes the dataset's central tendency and dispersion. description() provides a summary of the dataset. To further comprehend it, consider the following example:

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print('DataFrame : \n',df.describe())

The output of this would be:

In the last code example, we observed that describe() returned a list of several descriptive statistical studies for each numerical column in our dataset. By setting the 'include' property to 'all,' we can force the description to incorporate all columns, even those with categorical data.

Please note that the value returned here is a DataFrame and if we have to find one particular value, say mean, then we can use slicing, as we will see later on in this article.

04.06. DataFrame Slicing

We can slice DataFrame in the same way we slice NumPy 2-D Arrays. Suppose that we have to find the first and second row in a DataFrame. Then we can use the slicing operator.

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print('DataFrame : \n',df[1:3])

The output of this would be:

However, a problem in DataFrames is that we cannot use column indices for slicing DataFrame directly. For that, we must use .loc( ) and .iloc( ).

04.07. Indexing Operators

As we have already seen in Series section, there are two descriptive operators: .loc[ ] and .iloc[ ].

The โ€œ.loc()โ€ method makes obtaining data values from a pandas DataFrame object simple. We may retrieve the data values fitting in a group of rows and columns using the โ€œ.loc()โ€ method depending on the index value given to the function. This method retrieves data by using the explicit index. It could also be employed to select data subsets. For example:

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print('DataFrame : \n',df.loc[1:4,'Company Name':'Leave'])

The output of this would be:

As we can see this is very similar to what we have seen in Series. Pandas' "DataFrame.iloc" property offers completely integer-location-driven indexing for position selection over a particular DataFrame object. This method enables us to gather data based on its location. To accomplish so, we'll need to identify the locations of the data we desire. The "DataFrame.iloc" indexer is somewhat similar to "DataFrame.loc," except it only selects integer positions. Consider the following code example:

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print('DataFrame : \n',df.iloc[1:4,2:6])

The output of this would be:

04.07. Remove Duplicates

drop_duplicates() produces a Pandas DataFrame that is empty of duplicate entries. Even among duplicates, you may choose whether to maintain the first or final occurrence of the duplicate data. In addition, the inplace and ignore index attributes can be specified. The modifications are incorporated into the original dataset when the inplace option is set to True. We can see the differences by comparing the shapes of the original and updated datasets (after removing the duplicates). Examine the following code example to better understand it:

print("shape before: ",df.shape)
df.drop_duplicates(inplace=True)
print("shape after: ",df.shape) 
# In our dataset there were no duplicate row hence the shape remained same

The output of this would be:

04.08. Dropping from DataFrame

Pandas provide data analysts with a way to delete and filter data frames using dataframe.drop() method. Rows or columns can be removed using an index label or column name using this method. Let's drop all employees who have taken more than 2 days of leave. First of all, let us get all the employees who took more than 2 days of leaves.

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
print(df.loc[df['Leave']>2])

The output of this would be:

Next, we can get the index values as a list with the help of .index attribute.

print(df.loc[df['Leave']>2].index)

Now we only need to drop these values.

df.drop(df.loc[df['Leave']>2].index,inplace=True)

This will drop all the employees who have more than 2 days of leave. What about columns? How do we drop them?

Let's drop the column with Company Names.

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/lightlessdays/What-the-Fck/main/is%20Pandas/05.03/sample.csv')
df.drop(inplace=True,columns=["Company Name"])
print(df)

The output of this would be:

Please note that we cannot drop columns and rows at the same time. Either we can drop columns with one line or rows. Doing both is not possible.

04.09. Counting Values

The value_counts() function outputs the count of times each unique value appears in a DataFrame. It is beneficial to get an understanding of the pandas DataFrame object's value distribution. Consider the following example:

df['Employee Markme'].value_counts()

The output of this would be:

This is a DataFrame and we can access individual values by slicing.

04.10. Sorting Values

sort_values() is employed to sort columns in a Pandas DataFrame in either ascending or descending order by values. We may modify straightly in the original DataFrame by setting the inplace parameter to True. Consider the following example:

df.sort_values(by='Leave', inplace =True)
print(df.head())

The output of this would be:

If we want to arrange in descending order, we can use slicing.

print(df[::-1].head())

The output of this would be:

04.11. Fill NaN Values

We may typically discover multiple records categorized NaN by Python in a huge dataset. NaN is an abbreviation for "not a number," and denotes items not filled out in the original database. Pandas ensures that the information in the DataFrame may be identified independently by the user while populating them. fillna() replaces all NaN values in a DataFrame by incorporating more suitable values in their stead. The mean, median, mode, or other constant number might be used to fill in the blanks.

First of all, let us find out the number of NaN Values in our DataFrame.

print(df.isna().sum())

Please note that we can also use isnull( ) instead of isna( ). The output of this is:

If we have to find total number of NaN values, we will have to use .sum( ) twice. The first .sum( ) gives the sum in a particular column and the second sums up the result.

Please note that if we change the axis to 1, it will give the sum in a row, however, that value is pretty much never used in data analysis. Now let us replace the NaN values with a String, "NULL".

df.fillna('NULL')

When it comes to Pandas DataFrame methods, this is only the tip of the iceberg. There are thousands of more methods, but I shall not be discussing them in this blog. This was it for Pandas. See ya.

ย