In this article, I am going to explain how to use Pandas in Python. Pandas is one of the most popular modules in python that can be used for data manipulation and analysis using python. Basically, it provides an easy interface to interact with flowing data and apply transformations to them on the go. This module is covered under the BSD license and can be used for free. You can download this module by visiting the website or by installing it through the python package manager.
Pandas provide us with a range of data analysis options such as reading data from files and databases, to applying various transformations within the data frames, slicing and dicing the data, and then writing the data back to a database or prepare it for a visualization tool to be fed to. Pandas can also visualize data within the python environment by importing another module known as matplotlib and display stunning visuals within it. However, for the scope of this article, we will stick to learning Pandas in python only. As per the definition provided by Wikipedia, “The name Pandas is derived from the term ‘panel data’, an econometrics term for data sets that include observations over multiple time periods for the same individuals”. Over the last few years, this module has been gaining popularity and this can be explained if we see the search trends from Stack Overflow.
Figure 1 – Pandas popularity from Stack Overflow
If you see the above graph, it is clearly visible that in recent years, the trend of using Pandas has increased exponentially and it is now one of the most common modules used by the entire data science community.
What can be done with Pandas in python?
You can consider it to be the bread and butter for your data applications. Whenever you think about playing with data in python, the very first thing that you can consider is to use Pandas to wrangle the data into your playground. You can get started with cleaning the data by removing unwanted information, transform the data by applying business logic to it, and then finally prepare the data for visualization.
Let’s take an example that you want to read data from a CSV file which is either on your machine or on a shared network location. With the help of Pandas, you will easily be able to connect and extract information from the CSV file and create a data frame within the python environment. Once the data is within the python environment, you can apply many operations to it, some of which are mentioned as follows.
- You can calculate the basic statistics of your dataset and answer common questions like what the mean is, the median, the minimum, and the maximum values
- You can also find a correlation between two or more columns in the dataset
- Perform data cleaning by removing missing or blank values and filter records based on a criterion
- Visualize the data by using other modules like seaborn, matplotlib, etc.
- Save the cleaned data frame into a CSV or a database of your choice
How does it fit into the data world?
If you are working as a Data Engineer or a Data Scientist, you might already have come across Pandas while developing applications. However, for a beginner, I would suggest that you should have a basic understanding of how python works, the various data structures within python, like lists, dictionaries, tuples, iterations, etc.
The Pandas module has been developed on top of another popular module, known as NumPy. This means that a lot of data structures between these two modules will be similar. The data in Pandas can be used to provide other packages such as SciPy, for making scientific analyses or Matplotlib for making visualizations, etc. It can also be used as a source for machine learning modules like Scikit-learn.
Installing and setting up Pandas
So far, we have learned about what Pandas library in python is and various information related to it. Let us now go ahead and see how we can get this installed on our machine and start using it. Head over to the command prompt on your machine and type the following command.
pip install pandas
As soon as you hit Enter, you can see that the library has started downloading and will be installed on your machine shortly. The size of this module is around 9MB and should be installed within a minute or so.
Figure 2 – Installing Pandas in Python
If you are using Anaconda, then you can install Pandas by running the following command.
conda install pandas
Now that we have installed Pandas on our machine, let us go ahead and print the version information of the module. On your command prompt window, type “python” and hit Enter. This will start the python execution within the command prompt window.
Figure 3 – Starting the python execution in command prompt
Once the python shell is up and running, we need to import the Pandas module into our python environment. This can be done by running the following command and hit Enter.
import pandas
This will import the Pandas module and now we can start using this in our code. Once the module is imported, write the command that will print the version of the Pandas that we have installed recently.
print(pandas.__version__)
Once you run the above command, the version of the Pandas will be printed on the screen as follows.
Figure 4 – Printing Pandas version information
Creating Data Frames using Pandas in Python
The basic structure of a Pandas library is the data frame. The data frame is basically a representation of a 2-D array. You can also consider the data frame as an in-memory table on which you can perform all the operations as discussed earlier. Whenever we work with the Pandas module, we should try to fit the data into a data frame so that we can apply all the in-built methods directly.
There are a number of ways in which a data frame can be created. For the sake of this article, let us try to create the same from two dictionaries. For example, let us consider that we have a list of employees and their corresponding departments. So, we can create a simple dictionary with two lists in it that will contain the information. You can use the code below to create the dictionary.
1 2 3 4 |
data = { 'employees':['Bob','Jack'], 'department':['IT','Customer Service'] } |
Figure 5 – Creating the Dictionary object
Once the dictionary object has been created, let us now use this and pass it to the Pandas to create a data frame.
1 |
empDf = pandas.DataFrame(data) |
Figure 6 – Converting the dictionary to a Pandas Data Frame
As you can see in the figure above, the dictionary object has been transformed into a Pandas data frame. This data frame can be now used to perform data analysis and other operations on it. In my next article, I will mention how can we read data from a CSV file and apply transformations using the data frame.
Conclusion
In this article, we have seen what Pandas in python is and how can we install it on our machine. We have also learned about some of the important functions that can be done with the help of the Pandas library. In day-to-day analysis, the Pandas module plays a very important role in transforming the raw dataset and to apply operations on this dataset as required. You can either sort the data, filter it, add new columns to the dataset based on existing values, etc. This makes it a very popular module that is heavily used in data science and machine learning activities.
To learn more about the Pandas library, you can follow the official documentation from the Pandas website. There is also a very good resource available for Pandas in python which you can purchase from Amazon. This book especially describes the methods in more detail and is quite helpful for beginners to start with. If you are planning to learn python and Pandas by watching video tutorials, Python for Everybody is a good place to learn from Coursera.
Table of contents
- Getting started with PostgreSQL on Docker - August 12, 2022
- Getting started with Spatial Data in PostgreSQL - January 13, 2022
- An overview of Power BI Incremental Refresh - December 6, 2021