Pandas DataFrames

Getting to Know DataFrames in Pandas

Working with Pandas DataFrames

Pandas uses the DataFrame class for working with tabular data. DataFrames allow you to read, sort, manipulate, and display data. In this exercise, you’ll create your first DataFrame and use it to do some initial exploration of a data set.

Objectives

When you finish this lab, you should be comfortable with:

  • Loading data into a DataFrame from a CSV file
  • Printing out the first or last five rows
  • Viewing DataFrame attributes like its columns and shape
  • Accessing individual columns
  • Manipulating the data set including sorting, ranking, and grouping
  • Adding new columns and populating them with generated data
  • Changing the data type of a column
  • Filtering data by column values using loc and numerical or qualitative values

We’ll build on these further in the next lab.

Resources

Exercise Files

You can view this lab as an interactive Jupyter notebook here.

You can use right-click + “save link as” to download these files. You can also access them from GitHub at kpendolino/PandasDataFrames

Pandas Documentation

Setup

We’ve started you off with some basic setup, including importing pandas and defining some useful constants.

Example

# SETUP
import pandas as pd

# We can use these constants to refer to columns in the DataFrame. 
COMPANY="Company"
COUNTRY="Country"
RATING="Rating"
RATERS="Raters"
TYPE="Type"

OVERALLRANK="Overall Rank"
COUNTRYRANK="Rank In Country"
INTRATING="Integer Rating"

Create Your DataFrame

To create your DataFrame, you’ll use pd.read_csv(<filename>). This reads a CSV file and loads it into a DataFrame using the first row as column headers. To take a quick peek at your data, you can use DataFrame.head() to print out the first five rows or DataFrame.tail() to print out the last five. Take a look at this in action:

Example

chips =  pd.read_csv('snackdata.csv')
print(chips.head())
print()
print(chips.tail())

Output

     Company     Flavor    Style  Rating
0  Old Dutch      Plain  Regular   6.307
1  Old Dutch      Plain   Kettle   5.887
2  Old Dutch      Plain   Canned   5.771
3  Old Dutch      Plain   Ridges   5.152
4  Old Dutch  Bar-B-Que  Regular   6.837

                Company                Flavor    Style  Rating
175  President's Choice       Salt and Pepper   Ridges   6.076
176  President's Choice  Sour Cream and Onion  Regular   4.734
177  President's Choice  Sour Cream and Onion   Kettle   6.014
178  President's Choice  Sour Cream and Onion   Canned   5.232
179  President's Choice  Sour Cream and Onion   Ridges   7.059

Load the data from ‘chocolate_data.csv’ into pandas as a DataFrame called chocolate.

Print the first five rows and the last five rows.

Working With Columns

Pandas allows you to access column datas in several ways. First, you can use square brackets to access column data the same way you would use them to access dictionary data by key, for example, df['Column Name'] returns the data in the Column Name column. For simple column names, you can also access that data as an attribute, for example, df.ColName returns the data in the ColName column. We’ll see another method later.

Example

print(chips[STYLE].head())
print()
print(chips.Flavor.tail())

Output

0    Regular
1     Kettle
2     Canned
3     Ridges
4    Regular
Name: Style, dtype: object

175         Salt and Pepper
176    Sour Cream and Onion
177    Sour Cream and Onion
178    Sour Cream and Onion
179    Sour Cream and Onion
Name: Flavor, dtype: object

Print the first five values in the Company column using square brackets.

Print the last five values in the Rating column by accessing it as an attribute.

Accessing Multiple Columns

To access a subset of the DataFrame that includes multiple columns, you can use df[[column1, column2, ...]].

Example

print(chips[[COMPANY,STYLE,FLAVOR]].head())

Output

     Company    Style     Flavor
0  Old Dutch  Regular      Plain
1  Old Dutch   Kettle      Plain
2  Old Dutch   Canned      Plain
3  Old Dutch   Ridges      Plain
4  Old Dutch  Regular  Bar-B-Que

Print the first five rows of chocolate, showing only the Company, Country, and Type columns.

DataFrame Attributes

DataFrames have several built-in attributes that you can use to view summary information. For example, you can view a list of columns in a DataFrame using df.columns. You can view the dimensions of the DataFrame - the number of rows and columns in it - using df.shape.

Example

print(chips.columns)
print()
print(chips.shape)

Output

Index([u'Company', u'Flavor', u'Style', u'Rating'], dtype='object')

(180, 4)

Print the list of columns in chocolate.

Print the dimensions of chocolate.

Manipulating Data

DataFrames have several methods that you can use to manipulate the data. For example, you can use df.sort_values(<column(s)>, ascending=<direction(s)>) to sort a DataFrame. You can use df.rank(method=method, ascending=<boolean>) to calculate ranks for each row. We’ll use the min ranking method; this means that records with the same value will all share the rank on the lower end of the tied range - if three records are tied for second place, they will all receive rank 2, and the next record will start at rank 5. These methods do not modify the DataFrame in place, but you can save the results as a new DataFrame or a new column in the DataFrame. To save results as a new column, use square brackets - just as you would to add a new element to a dictionary.

Example

# Sort the rows by Rating, then print the best five
print(chips.sort_values(RATING,ascending=False).head())
print()

# Calculate overall rankings
chips[OVERALLRANK]=chips[RATING].rank(method="min",ascending=False)

print(chips[[COMPANY,STYLE,FLAVOR,OVERALLRANK]].head())

Output

              Company                Flavor   Style  Rating
94           Cape Cod  Sour Cream and Onion  Kettle   9.525
24             Kettle  Sour Cream and Onion  Kettle   9.455
17          Old Dutch  Sour Cream and Onion  Kettle   9.415
23             Kettle       Salt and Pepper  Kettle   9.343
159  Jay's of Chicago  Sour Cream and Onion  Ridges   8.874

     Company    Style     Flavor  Overall Rank
0  Old Dutch  Regular      Plain          74.0
1  Old Dutch   Kettle      Plain         105.0
2  Old Dutch   Canned      Plain         116.0
3  Old Dutch   Ridges      Plain         143.0
4  Old Dutch  Regular  Bar-B-Que          51.0

Print the first five rows of chocolate after sorting by Rating, descending.

Add a new column to chocolate using the OVERALLRANK constant as the label. Populate it with the rank of each record by Rating, descending. Use the “min” ranking method.

Print the first five rows of chocolate, showing only Company, Country, Type, and Rating.

Modifying Data Types

You can use df[column].astype(type) to cast the values in a column to another data type. For example, you can use this to change integers into floats or strings.

Example

chips[INTRATING]=chips[RATING].astype(int)
print(chips[[COMPANY,STYLE,RATING,INTRATING]].head())

Output

     Company    Style  Rating  Integer Rating
0  Old Dutch  Regular   6.307               6
1  Old Dutch   Kettle   5.887               5
2  Old Dutch   Canned   5.771               5
3  Old Dutch   Ridges   5.152               5
4  Old Dutch  Regular   6.837               6

Add a new column to chocolate using the INTRATING constant as the column label. Populate the column with values from RATING, cast as integers.

Print the Company, Type, RATING, and INTRATING values for the first five records.

Grouping Data

You can use df.groupby(column(s)) to access the data grouped by the indicated column. This can be handy for manipulating sub-sets of the data, such as to calculate ranks or statistics by groups.

Example

# Calculate ranks within each Style
chips[STYLERANK]=chips.groupby(STYLE)[RATING].rank(method="min",ascending=False)
print(chips[[STYLE,RATING,OVERALLRANK,STYLERANK]].head())
print()

# Cast ranks as integers
chips[STYLERANK]=chips[STYLERANK].astype(int)
print(chips[[STYLE,RATING,OVERALLRANK,STYLERANK]].head())

Output

     Style  Rating  Overall Rank  Rank In Style
0  Regular   6.307          74.0            7.0
1   Kettle   5.887         105.0           37.0
2   Canned   5.771         116.0           27.0
3   Ridges   5.152         143.0           37.0
4  Regular   6.837          51.0            4.0

     Style  Rating  Overall Rank  Rank In Style
0  Regular   6.307          74.0              7
1   Kettle   5.887         105.0             37
2   Canned   5.771         116.0             27
3   Ridges   5.152         143.0             37
4  Regular   6.837          51.0              4

Calculate ranks for chocolates within each Country using the “min” ranking method in descending order. Save them as a new column on chocolate using the COUNTRYRANK constant as a label.

Print out the top five records, showing the Country, Type, RATING, OVERALLLRATING, and COUNTRYRANK.

Save the ranks as integers.

Print out the top five records again, showing the Country, Type, RATING, OVERALLLRATING, and COUNTRYRANK.

Filtering Columns

You can view data at a specific location using the df.loc() method. The location identifier can be a simple column label or row index, a combination of labels and indices, or even logical operators on column labels, row indices, or cell values.

Example

tenbest=chips.loc[chips[OVERALLRANK] <= 10].sort_values(RATING,ascending=False)
print(tenbest[[COMPANY,FLAVOR,STYLE,RATING,OVERALLRANK]])
print()


bestinstyle=chips.loc[chips[STYLERANK] == 1].sort_values(RATING,ascending=False)
print(bestinstyle[[STYLE,COMPANY,FLAVOR,RATING,OVERALLRANK,STYLERANK]])

Output

              Company                Flavor   Style  Rating  Overall Rank
94           Cape Cod  Sour Cream and Onion  Kettle   9.525           1.0
24             Kettle  Sour Cream and Onion  Kettle   9.455           2.0
17          Old Dutch  Sour Cream and Onion  Kettle   9.415           3.0
23             Kettle       Salt and Pepper  Kettle   9.343           4.0
159  Jay's of Chicago  Sour Cream and Onion  Ridges   8.874           5.0
11          Old Dutch      Salt and Vinegar  Ridges   8.754           6.0
130            Zapp's      Salt and Vinegar  Canned   8.725           7.0
13          Old Dutch       Salt and Pepper  Kettle   8.686           8.0
150  Jay's of Chicago      Salt and Vinegar  Canned   8.440           9.0
22             Kettle      Salt and Vinegar  Kettle   8.340          10.0

       Style           Company  ... Overall Rank  Rank In Style
94    Kettle          Cape Cod  ...          1.0              1
159   Ridges  Jay's of Chicago  ...          5.0              1
130   Canned            Zapp's  ...          7.0              1
144  Regular  Jay's of Chicago  ...         39.0              1

Create a new DataFrame called tenbest that contains only the ten highest-rated chocolate bars in chocolate. Print out only the Company, Country, Type, Rating, and OVERALLRANK values for tenbest.

Create a new DataFrame called bestincountry that contains only the highest rated chocolate bar from each Country. Print out only the Company, Country, Type, Rating, Overall Rank, and COUNTRYRANK values for bestincountry.

Filtering Columns, Continued

You can also use df.isin(values) to filter a DataFrame to only records that include values in the provided list. Usually, you’ll use this with a subset of the full DataFrame within a call to df.loc[]. For example, df.loc[chips.Style.isin(["Kettle","Ridges","Canned"])] will return only records from chips where the Style column contains either “Kettle”, “Ridges”, or “Canned”.

Example

# Get the 5 best-ranked Kettle and Ridges chips across all companies and flavors
crunch=chips.loc[chips[STYLE].isin(["Kettle","Ridges"]) &
     (chips[STYLERANK]<=5)].sort_values([STYLE,RATING],ascending=[True,False])
print(crunch[[STYLE,COMPANY,FLAVOR,STYLERANK]])

Output

      Style           Company                Flavor  Rank In Style
94   Kettle          Cape Cod  Sour Cream and Onion              1
24   Kettle            Kettle  Sour Cream and Onion              2
17   Kettle         Old Dutch  Sour Cream and Onion              3
23   Kettle            Kettle       Salt and Pepper              4
13   Kettle         Old Dutch       Salt and Pepper              5
159  Ridges  Jay's of Chicago  Sour Cream and Onion              1
11   Ridges         Old Dutch      Salt and Vinegar              2
111  Ridges       Better Made      Salt and Vinegar              3
15   Ridges         Old Dutch       Salt and Pepper              4
61   Ridges              Lays      Salt and Vinegar              5

Create a new DataFrame called usa_uk that includes only chocolate bars made in “America” or “United Kingdom” that are in the top 5 best chocolate bars from their country.

Sort the values by Country, ascending, then by Rating, descending.

Print usa_uk, showing only the Type, Company, Country, and COUNTRYRANK.

The Big Picture

You’ve now explored the basics of creating and manipulating DataFrames in Pandas. You can pair DataFrames with standard python methods that act on lists to take them even further. Take a few minutes to think through what kinds of insights these tools can provide.

Which country has the most chocolate bars in the top ten overall?

Which company has the greatest range in the ratings of their bars? Which company is the most consistently rated?

What do you think the rating scale used in this data set was?

Congratulations on completing this lab exercise!


© 2022 Kendra Pendolino

 Share!