Pandas is an open-source Python library that is used for data handling tasks for machine learning and data science objectives.
Firstly create an alias of pandas let’s use pd here.
One most frequently used functionality of Pandas is to read a data file in the format of csv, json, SQL table, or a JSON file.
For eg. we can read a csv file using the following syntax:
data_frame=pd.read_csv(“location_of_the_file”)
Series are one dimensional labeled Pandas arrays that can contain any kind of data, even NaNs.
import pandas as pd
import numpy as np
lectures = pd.Series(["Mathematics","Chemistry","Physics","History","Geography","German"]*3)
grades = pd.Series([90,54,77,22,25]*3)
classes = pd.Series(['A','B','C']*6)
credits = pd.Series(['1','2','6']*6)
names=np.array([["John"]*6,["Dan"]*6,["Zac"]*6]).flatten()
retake=np.array(['Yes','No']*9)
df=pd.DataFrame({"Names":names,"Lectures": lectures, "Grades": grades*3, "Classes":classes,"Credits": credits, "Retake":retake})
print(df.to_string(index=False)) # code to show the dataframe without index column
print(df.head(7))
head() is a function using which we can retrieve the first rows of the dataframe. By default, it retrieves the first five rows but we can retrieve as many front (first) rows after passing them as arguments.
- DataFrames are a lot similar to data files like an Excel csv file or an SQL table.
- Other than reading from a file a dataframe can also be created through a series in Pandas.
- Pandas provides DataFrame Slicing using “loc” and “iloc” functions.
print(df.loc[:10,['Names','Lectures']]) #here we are retrieving first ten rows from which only Names and Lectures variables are selected.
In the case of iloc the arguments passed need to be integers like in iloc Names and lectures won’t work but we will have to pass their indices like 0,1 in the list to get the output otherwise it’ll give an error.
print(df.iloc[5:10,1:3]) #here we have retrieved the columns from index 1 to 3 (Lectures and Grades) for rows of index 5 to 10.
Let’s say John's parents want to learn more about their son’s performance at the school. They want to see their son’s lectures, grades for these lectures, the number of credits earned, and finally if their son will need to take a retake exam. We can simply slice the DataFrame created with the grades.csv file (which has all the student’s academic records), and extract the necessary information we need. For example:
Grades = df.loc[(df["Names"] == "John"), ["Lectures","Grades","Credits","Retake"]]
In the above code, we are just retrieving those rows in which the “Name” variable is equal to the mentioned name.
You can use the loc and iloc functions to access rows in a Pandas DataFrame.
print(df.iloc[0])
This row will just return the info about the first row of the dataframe.
The Pandas groupby function allows you to split data into groups based on some criteria. Pandas DataFrames can be split on either axis, ie., row or column.
print(df.groupby(["Lectures","Names"]).first())
Using the above code, the data can be divided into groups using Lectures and Names attributes where the division would be according to the Lectures at level1 then Names at level2.Example
We can even iterate on grouped objects as we have done in the code below, according to the Classes.
for key, item in grouped_obj:
if(key=='A'):
print("Key is: " + str(key))
print(str(item), "\n\n")
One can also save data in a CSV in the local directory using Pandas, using the below code.
df.to_csv('file1.csv') # here file1 is the name of the file and to_csv is the function used to save the CSV.
Some of the important uses of Pandas are:
- Data cleansing
- Data fill
- Data normalization
- Merges and joins
- Data visualization
- Statistical analysis
- Data inspection
- Loading and saving data