Data Analysis

Go to Problems

Pandas

Pandas is an open-source Python library that is used for data handling tasks for machine learning and data science objectives.

Firstly create an alias of pandas let’s use pd here.

One most frequently used functionality of Pandas is to read a data file in the format of csv, json, SQL table, or a JSON file.

For eg. we can read a csv file using the following syntax:

data_frame=pd.read_csv(“location_of_the_file”) 

 

Series are one dimensional labeled Pandas arrays that can contain any kind of data, even NaNs.

import pandas as pd
import numpy as np
lectures = pd.Series(["Mathematics","Chemistry","Physics","History","Geography","German"]*3)
grades  = pd.Series([90,54,77,22,25]*3)
classes = pd.Series(['A','B','C']*6)
credits = pd.Series(['1','2','6']*6)
names=np.array([["John"]*6,["Dan"]*6,["Zac"]*6]).flatten()
retake=np.array(['Yes','No']*9)
df=pd.DataFrame({"Names":names,"Lectures": lectures, "Grades": grades*3, "Classes":classes,"Credits": credits, "Retake":retake})
print(df.to_string(index=False)) # code to show the dataframe without index column

 

print(df.head(7)) 

head() is a function using which we can retrieve the first rows of the dataframe. By default, it retrieves the first five rows but we can retrieve as many front (first) rows after passing them as arguments.

 

  • DataFrames are a lot similar to data files like an Excel csv file or an SQL table.
  • Other than reading from a file a dataframe can also be created through a series in Pandas.
  • Pandas provides DataFrame Slicing using “loc” and “iloc” functions.

 

print(df.loc[:10,['Names','Lectures']])   #here we are retrieving first ten rows from which only Names and Lectures variables are selected. 

In the case of iloc the arguments passed need to be integers like in iloc Names and lectures won’t work but we will have to pass their indices like 0,1 in the list to get the output otherwise it’ll give an error.

 

print(df.iloc[5:10,1:3]) #here we have retrieved the columns from index 1 to 3 (Lectures and Grades) for rows of index 5 to 10. 

 

Let’s say John's parents want to learn more about their son’s performance at the school. They want to see their son’s lectures, grades for these lectures, the number of credits earned, and finally if their son will need to take a retake exam. We can simply slice the DataFrame created with the grades.csv file (which has all the student’s academic records), and extract the necessary information we need. For example:

Grades = df.loc[(df["Names"] == "John"), ["Lectures","Grades","Credits","Retake"]] 




In the above code, we are just retrieving those rows in which the “Name” variable is equal to the mentioned name.

You can use the loc and iloc functions to access rows in a Pandas DataFrame. 

print(df.iloc[0]) 

This row will just return the info about the first row of the dataframe.

The Pandas groupby function allows you to split data into groups based on some criteria. Pandas DataFrames can be split on either axis, ie., row or column.

print(df.groupby(["Lectures","Names"]).first()) 

Using the above code, the data can be divided into groups using Lectures and Names attributes where the division would be according to the Lectures at level1 then Names at level2.Example





We can even iterate on grouped objects as we have done in the code below, according to the  Classes.

for key, item in grouped_obj:
    if(key=='A'):
        print("Key is: " + str(key))
        print(str(item), "\n\n")

One can also save data in a CSV in the local directory using Pandas, using the below code.

df.to_csv('file1.csv') # here file1 is the name of the file and to_csv is the function used to save the CSV. 

 

Some of the important uses of Pandas are:

  • Data cleansing
  • Data fill
  • Data normalization
  • Merges and joins
  • Data visualization
  • Statistical analysis
  • Data inspection
  • Loading and saving data

Serious about Learning Data Science and Machine Learning ?

Learn this and a lot more with Scaler's Data Science industry vetted curriculum.
Vector analysis (numpy)
Problem Score Companies Time Status
find the one 30
2:29
choose the output 30
4:00
python broadcasting 30
5:01
How not to retrieve? 30
4:54
Fill Infinite 30
2:36
Duplicates detection 50
25:00
Row-wise unique 50
29:01
Data handling (pandas)
Problem Score Companies Time Status
For 'series' 30
4:54
drop axis 30
1:47
Rename axis 30
2:17
iloc vs loc part I 30
1:42
As a Series 50
20:07
Max registrations they asked? 50
43:05
Basic computer vision (opencv)
Problem Score Companies Time Status
Which library it is? 30
0:50
Image dimensions 30
1:34
Dimension with components 30
1:18
Color interpretation 30
1:56
Image cropping 30
2:02
Data visualization (matplotlib)
Problem Score Companies Time Status
2d graphics 30
0:39
Suitable plot type 30
1:20
Subplot Coordinates 30
3:58
Vertically Stacked Bar Graph 30
3:32
Load RGB 30
2:25
Web scraping basics
Problem Score Companies Time Status
What does the code do? 30
2:35
Retrieval protocol 30
1:44
2-way communication 30
0:54
Search engine process 30
1:31
What does the code print? 30
1:17
Eda
Problem Score Companies Time Status
PCA's secondary objective 30
1:33
Five number theory 30
1:32