Pandas
Pandas For Data Analysis
Ultimate Guide for Python Engineer
-
school Intro - What is Pandas?
In this walkthrough, you will learn how to analyze and visualize data using Pandas. You will also get familiar with various tips and tricks on how to use Pandas for Data Analysis and Data Science Project.
According to Wiki,Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license
-
school History of Pandas and Its Creator
Pandas was originally created by Wes Mckinney.Wes McKinney started building what would become pandas at AQR Capital while he was a researcher there from 2007 to 2010
Wes is an American software developer. He is the creator and "Benevolent Dictator for Life" (BDFL) of the open-source Pandas package for data analysis in Python and has also authored two versions of the reference book Python for Data Analysis. As a bussinessman He was the CEO and founder of technology startup Datapad.
In 2007, Wes McKinney graduated from MIT with a B.S. After which he started working on Pandas. In 2010, he began a Ph.D program in Statistics at Duke University, but went on leave in 2011.
You can check out more from his website @ https://wesmckinney.com/
- In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open sourced, and is actively supported today by a community of like-minded individuals around the world who contribute their valuable time and energy to help make open source pandas possible.
- Pandas 1.0 was released in 2008 and was a major revision of the language that is not completely backward-compatible.
Mission of Pandas
The aims of pandas is to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
-
school Benefits & Usefulness of Pandas
Before Pandas was created, developers and data scientist used to use other tools such as Excel, Pascal and R -Dataframe for performing analysis. However the creation of Pandas has brought so many benefits to engineers. Some of these benefits include the following.
Usefulness
- Data cleansing
- Data fill
- Data normalization
- Merges and joins
- Data visualization
- Statistical analysis
- Data inspection
- Loading and saving data
- Vast Libraries and Package Ecosystem
- Open Source
- Great Documentation and Numerous Tutorials to learn from
- Large Community
- It is very old (almost 12 years)
Disadvantages
- No Ideal for 3D matrices
-
school Installing - Pandas
To install Pandas, you may need to download the most recent stable version. This is the one with the highest number that isn't marked as an alpha or beta release.
Installation Guide For Packages
Via the pip and PyPy & Conda platforms you can install Pandas on your system
Using Pip
To install Pandas you can use pip3 or pip or conda as below
pip install pandas
-
school Pandas and Jupyter Notebook
Pandas like any Python Package can be used inside Jupyter Notebooks as well as any IDE or REPL such as below
Python IDE - Interactive Development Environment
- VsCode
- Sublime-Text
- PyCharm
- Atom & Bracket
- Notepad ++
Python REPL & Notebooks -
- IPython
- BPython
- Jupyter Notebooks
- JupyterLab
- etc
-
school Getting Started with Pandas
Let us start with how to use Pandas to perform data analysis from end to end. By the end of this you will have an indepth understanding of Pandas in relation to Data Analysis
To work with Pandas you will need to import it. There is a common convention used by Data Science People when importing pandas. The convention is to import it as below
import pandas as pd
You can check for the version via the `.__version__`
import pandas as pd pd.__version__
In summary
-
map Reading Various Data Format
One of the features that makes pandas standouts is its ability to read various file format ranging from CSV to Parquet. Let us see how to read the various file formats.Pandas provides a simple API to read the respective file formats.
The format goes with the `pd.read_*` where * is the file format type such as csv,excel,html,parquet,etc
In summary
Source:Pandas Official Website
Certain file formats may require some dependencies which in most cases would be installed during your initial installation of pandas.
-
map Reading CSV Files
CSV stands for Comma Separated Values. It is actually a text file that has a specific format which allows data to be saved in a table structured format.It uses comma `,` to separate the data. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
It is one of the most commonest and beginner friendly file formats to work with just like JSON
As a delimited file, we can change the separator or delimiter to other special characters other than comma. So in case we change it from comma to white space or tab space it becomes a WSV or TSV file format
Pandas allows you to read different delimited file formats or variants of CSV files by specifying the separator or delimiter in the required params
In Summary
As you can see from above, Pandas read_* has several optional params for several use case. In case you want to read tab or whitespace or semi-colon separated file format. You can modify either the `sep=','` or `delimiter=''` as per your need
-
school Pandas Basics
With Pandas you can preview the file using the head, tail option just as you would within a linux terminal.
- df.head(): view the first n datapoints
- df.head(10): view the first 10 rows/datapoints
- df.tail(): view the last n datapoints
-
map Writing to Various Data Format
With Pandas you can write or save your dataframe to various file format ranging from CSV to Parquet. Let us see how to write to the various file formats.Pandas provides a simple API to read the respective file formats.
The format goes with the `pd.().to_*` where * is the file format type such as csv,excel,html,parquet,etc
In summary
Source:Pandas Official Website
Certain file formats may require some dependencies which in most cases would be installed during your initial installation of pandas.
The `.to_*() also has several arguments and optional params per your needs
-
map Selecting Rows and Columns
-
map Reshaping Data with Pandas
-
school Statistics with Pandas
Tasks
Practical Task on Pandas
-
map Coming Soon
Info
starPandas,PyPolars,PySpark
Back- layers Goal :
- person Tasks :
- access_time Time