Five and a half Lines of Code for Understanding Your Data with Pandas today

9 min readDec 15, 2020

Because you can’t analyze data without first figuring out what you’re working with

Before you start working with any dataset, it’s important to know exactly what kind of data you’re dealing with. In this piece, we’ll discuss how you can do that with just a few lines of code.

As we’ll see in a bit, Pandas is incredibly powerful not only for manipulating data but also for exploring it. However, it’s not enough to just run the code. These lines of code are only useful if you’re able to interpret and use the results that come back. For example, if you were to find that some data is missing from certain columns, this is a sign that you should investigate how much of this data is missing, why it’s missing, and how you should treat missing data in your analysis.

These techniques won’t give you all the answers for how to process and analyze your data, but they’ll give you the information needed to make good decisions moving forward.

Examples in this piece will use some old Tesla stock price data from Yahoo Finance. Feel free to run the code below if you want to follow along. It would be great if you had a Jupyter Notebook open as well so you could run the code and test things out for yourself.
import pandas as pd
df = pd.read_html(“")[0]
df = df.head(11).sort_values(by=’Date’)
df = df.astype({“Open”:’float’,
“Adj Close**”:’float’,
df[‘Gain’] = df[‘Close*’] — df[‘Open’]
remarkable_filter = (df[‘Volume’] > 30000000) | (df[‘Gain’] > 0)
df[‘Remarkable’] = False
df.loc[remarkable_filter, [‘Remarkable’]] = True
If you want to learn more about the basic data collection and processing in the code above (like withpd.read_html , df.sort_values, and implementing filters) check out the links at the end of this piece. Otherwise, just run the code above, and let’s get started!
1. Check out the first few rows
Pretty much everyone knows that you should do this, but it’s always good to take a look at your DataFrame (and how Pandas read it). You’ve probably already checked it out in Excel, but it’s possible that Pandas may have read columns incorrectly or added blank columns. As such, we’ll run the following code to take a peek at our DataFrame right after reading it in:

The head method simply returns the first n rows of your DataFrame. It defaults to the first 5 rows (so actually the 5 in the above code is unnecessary), but you can specify more or less if you’d like. You can also set n to a negative number, which will result in head returning all but the last n rows in the DataFrame.
2. Take a look at your DataFrame’s columns
So you’ve taken a look at the first few rows in your DataFrame, but what about the values across all columns? Especially when you’re working with a DataFrame with dozens of columns, it’ll be helpful to get a summary table to give you basic information about what’s in your DataFrame. To do so, we run the following:

info returns a basic summary breakdown of your DataFrame’s columns. It shows how many columns there are, how many non-null values exist for each column, and what data types exist in your columns. This can be helpful to understand how Pandas has read in your DataFrame.
For example, above we see that all columns have the same number of non-null values, so all rows in the DataFrame don’t have any null values. However, if this is not the case, you may need to do some work to either fill in the missing values with something or drop the rows with missing values, depending on what kind of analysis you’re doing.
Furthermore, we see that there are seven columns of type float64, which makes sense if you see how we used dtype on the DataFrame columns at the very beginning. You’ll also see that there is a bool column, which comes from the column with only True and False values. You can change the data type of certain columns as needed, and then re-use the info method to make sure that you’ve altered the correct columns.
3. Examine the basic descriptive statistics of your DataFrame
When working with numerical datasets, it’s helpful to take a high-level look at the central tendency, dispersion, and shape of the data you will be working with. Luckily, Pandas will automatically calculate some basic summary statistics for you and organize the results in a table. To do so, we use the following method:

The describe method returns a table with all the numerical columns in your DataFrame along with their associated summary statistics. For each column, the table includes the count (number of values), mean (average value), standard deviation (a measure of how spread out the data is), 25%/50%/75% (values that occur at the quartiles and midpoint), and minimum/maximum (lowest and highest data points).
Depending on the type of data you’re working with, any of these statistics can be helpful for processing your data. For example, if you see the maximum value of a specific column is much higher compared to the mean, you might need to consider removing outliers from that column.
In the title of this piece, I wrote “and a half” lines of code. The half I was talking about is an addendum to the describe method. By default, it will only return a table with the numerical columns, which makes sense because you can’t really calculate a “mean” for a column with text values. But what if you wanted a kind of summary for your categorical variables? To do so, try the following code below:

Passing “all” into the include parameter will give you a few additional rows that only apply to the categorical variables. These are unique (the number of unique values in the column), top (the mode which is the value that occurs the most in the column), and frequency (the number of times the mode occurs).
4. Check the pairwise correlation of your DataFrame’s numerical variables
Another way to look at your numerical columns is by taking a look at the pairwise correlation between all the variables. This is another descriptive statistic that can help you understand how your variables are related to one another. Correlation values range between -1 and 1, meaning that the two values are either negatively or positively correlated respectively. For example, if two columns have a correlation value of 1, this would indicate that when one variable has a higher value, the other variable is also likely to have a higher value. On the other hand, a value of -1 would indicate that the variables move in opposite directions.
To create this table, we run the following code:

5. Look at the categorical variables in your DataFrame
Finally, if you want to take a closer look at your categorical variables, it would be useful to know what categories exist in the column. Earlier, in the describe summary table we saw that the Remarkable column had two unique values. To take a look at what those two values are and how many times each occurs, we run the following code:

As you can see, our table is quite simple, and if you ran the pre-processing code earlier, these results should not be surprising to you. However, if you were working with a fresh dataset that had many categorical columns, it may be useful to run value_counts on each of the columns, so you can take a look at how many different values are the non-numerical columns you’re working with. If you have a text column with 300 unique values, it’s possible that for your analysis you’d want to group certain values together so you could have a categorical column with only 10 unique values. Using value_counts would easily help you identify which columns you’d need to adjust.
And that’s all!
I hope you found these five tips helpful for when you get started working with a new dataset. I find that running these lines of code before doing anything else helps me figure out how to move forward with my data analysis. Knowing how the data is structured and its basic descriptive statistics will help you determine what steps are necessary for further preprocessing.
Good luck with your future Pandas adventures!
More by me:
- Conditional Selection and Assignment With .loc in Pandas
- 2 Easy Ways to Get Tables From a Website With Pandas
- 4 Different Ways to Efficiently Sort a Pandas DataFrame
- Top 4 Repositories on GitHub to Learn Pandas
- Learning to Forecast With Tableau in 5 Minutes Or Less