1)Examining the dataset
Throughout this course, you'll be analyzing a dataset of traffic stops in Rhode Island that was collected by the Stanford Open Policing Project.
Before beginning your analysis, it's important that you familiarize yourself with the dataset. In this exercise, you'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.
- Import pandas using the alias pd.
- Read the file police.csv into a DataFrame named ri.
- Examine the first 5 rows of the DataFrame (known as the "head").
- Count the number of missing values in each column: Use .isnull() to check which DataFrame elements are missing, and then take the .sum() to count the number of True values in each column.
# Import the pandas library as pd
import pandas as pd
# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv('police.csv')
# Examine the head of the DataFrame
print(ri.head())
# Count the number of missing values in each column
print(ri.isnull().sum())
2)Dropping columns
Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame, to make it easier for you to focus on the remaining columns.
In this exercise, you'll drop the county_name column because it only contains missing values, and you'll drop the state column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information. The number of missing values in each column has been printed for you.
- Examine the DataFrame's .shape to find out the number of rows and columns.
- Drop both the county_name and state columns by passing the column names to the .drop() method as a list of strings.
- Examine the .shape again to verify that there are now two fewer columns.
# Examine the shape of the DataFrame
print(ri.shape)
# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)
# Examine the shape of the DataFrame (again)
print(ri.shape)
3)Dropping rows
When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.
During this course, the driver_gender column will be critical to many of your analyses. Because only a small fraction of rows are missing driver_gender, we'll drop those rows from the dataset.
- Count the number of missing values in each column.
- Drop all rows that are missing driver_gender by passing the column name to the subset parameter of .dropna().
- Count the number of missing values in each column again, to verify that none of the remaining rows are missing driver_gender.
- Examine the DataFrame's .shape to see how many rows and columns remain.
# Count the number of missing values in each column
print(ri.isnull().sum())
# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)
# Count the number of missing values in each column (again)
print(ri.isnull().sum())
# Examine the shape of the DataFrame
print(ri.shape)
4)Finding an incorrect data type
The dtypes attribute of the ri DataFrame has been printed for you. Your task is to explore the ri DataFrame to determine which column's data type should be changed.
5)Fixing a data type
We saw in the previous exercise that the is_arrested column currently has the object data type. In this exercise, we'll change the data type to bool, which is the most suitable type for a column containing True and False values.
Fixing the data type will enable us to use mathematical operations on the is_arrested column that would not be possible otherwise.
- Examine the head of the is_arrested column to verify that it contains True and False values and to check the column's data type.
- Use the .astype() method to convert is_arrested to a bool column.
- Check the new data type of is_arrested to confirm that it is now a bool column.
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())
# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype(bool)
# Check the data type of 'is_arrested'
print(ri['is_arrested'].dtype)
6)Combining object columns
Currently, the date and time of each traffic stop are stored in separate object columns: stop_date and stop_time.
In this exercise, you'll combine these two columns into a single column, and then convert it to datetime format. This will enable convenient date-based attributes that we'll use later in the course.
- Use a string method to concatenate stop_date and stop_time (separated by a space), and store the result in combined.
- Convert combined to datetime format, and store the result in a new column named stop_datetime.
- Examine the DataFrame .dtypes to confirm that stop_datetime is a datetime column.
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date + ' ' + ri.stop_time
# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)
# Examine the data types of the DataFrame
print(ri.dtypes)
7)Setting the index
The last step that you'll take in this chapter is to set the stop_datetime column as the DataFrame's index. By replacing the default index with a DatetimeIndex, you'll make it easier to analyze the dataset by date and time, which will come in handy later in the course!
- Set stop_datetime as the DataFrame index.
- Examine the index to verify that it is a DatetimeIndex.
- Examine the DataFrame columns to confirm that stop_datetime is no longer one of the columns.
# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)
# Examine the index
print(ri.index)
# Examine the columns
print(ri.columns)
8)Examining traffic violations
Before comparing the violations being committed by each gender, you should examine the violations committed by all drivers to get a baseline understanding of the data.
In this exercise, you'll count the unique values in the violation column, and then separately express those counts as proportions.
- Count the unique values in the violation column of the ri DataFrame, to see what violations are being committed by all drivers.
- Express the violation counts as proportions of the total.
# Count the unique values in the 'violation' column
violation_counts = ri.violation.value_counts()
print(violation_counts)
# Express the counts as proportions of the total
violation_props = ri.violation.value_counts(normalize=True)
print(violation_props)
9)Comparing violations by gender
The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.
In this exercise, you'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.
- Create a DataFrame, female, that only contains rows in which driver_gender is 'F'.
- Create a DataFrame, male, that only contains rows in which driver_gender is 'M'.
- Count the violations committed by female drivers and express them as proportions.
- Count the violations committed by male drivers and express them as proportions.
# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']
# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']
# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))
# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))
10)Filtering by multiple conditions
Which one of these commands would filter the ri DataFrame to only include female drivers who were stopped for a speeding violation?
11)Comparing speeding outcomes by gender
When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?
First, you'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.
Then, for each gender, you'll use the stop_outcome column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".
- Create a DataFrame, female_and_speeding, that only includes female drivers who were stopped for speeding.
- Create a DataFrame, male_and_speeding, that only includes male drivers who were stopped for speeding.
- Count the stop outcomes for the female drivers and express them as proportions.
- Count the stop outcomes for the male drivers and express them as proportions.
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')]
# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')]
# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))
# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))
12)Calculating the search rate
During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops in the ri DataFrame that result in a vehicle search, also known as the search rate.
- Check the data type of search_conducted to confirm that it's a Boolean Series.
- Calculate the search rate by counting the Series values and expressing them as proportions.
- Calculate the search rate by taking the mean of the Series. (It should match the proportion of True values calculated above.)
# Check the data type of 'search_conducted'
print(ri['search_conducted'].dtype)
# Calculate the search rate by counting the values
print(ri['search_conducted'].value_counts(normalize=True))
# Calculate the search rate by taking the mean
print(ri['search_conducted'].mean())
13)Comparing search rates by gender
In this exercise, you'll compare the rates at which female and male drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about 3.8%.
First, you'll filter the DataFrame by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a .groupby().
- Filter the DataFrame to only include female drivers, and then calculate the search rate by taking the mean of search_conducted.
# Calculate the search rate for female drivers
print(ri[ri['driver_gender'] == 'F'].search_conducted.mean())
- Filter the DataFrame to only include male drivers, and then repeat the search rate calculation.
# Calculate the search rate for male drivers
print(ri[ri['driver_gender'] == 'M'].search_conducted.mean())
- Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)
# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())
14)Adding a second factor to the analysis
Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.
For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.
You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Let's find out if that's the case!
- Use a .groupby() to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?
# Calculate the search rate for each combination of gender and violation
print(ri.groupby(['driver_gender', 'violation']).search_conducted.mean())
- Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.
# Reverse the ordering to group by violation before gender
print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean())
15)Counting protective frisks
During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."
In this exercise, you'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.
- Count the search_type values in the ri DataFrame to see how many times "Protective Frisk" was the only search type.
- Create a new column, frisk, that is True if search_type contains the string "Protective Frisk" and False otherwise.
- Check the data type of frisk to confirm that it's a Boolean Series.
- Take the sum of frisk to count the total number of frisks.
# Count the 'search_type' values
print(ri.search_type.value_counts())
# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)
# Check the data type of 'frisk'
print(ri.frisk.dtype)
# Take the sum of 'frisk'
print(ri.frisk.sum())
16)Comparing frisk rates by gender
In this exercise, you'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?
Before doing any calculations, it's important to filter the DataFrame to only include the relevant subset of data, namely stops in which a search was conducted.
- Create a DataFrame, searched, that only contains rows in which search_conducted is True.
- Take the mean of the frisk column to find out what percentage of searches included a frisk.
- Calculate the frisk rate for each gender using a .groupby().
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]
# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())
# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())
17)Calculating the hourly arrest rate
When a police officer stops a driver, a small percentage of those stops ends in an arrest. This is known as the arrest rate. In this exercise, you'll find out whether the arrest rate varies by time of day.
First, you'll calculate the arrest rate across all stops in the ri DataFrame. Then, you'll calculate the hourly arrest rate by using the hour attribute of the index. The hour ranges from 0 to 23, in which:
- 0 = midnight
- 12 = noon
- 23 = 11 PM
- Take the mean of the is_arrested column to calculate the
overall arrest rate.
- Group by the hour attribute of the DataFrame
index to calculate the hourly arrest rate.
- Save the hourly arrest rate Series as a new
object, hourly_arrest_rate.
# Calculate the overall arrest rate
print(ri.is_arrested.mean())
# Calculate the hourly arrest rate
print(ri.groupby(ri.index.hour).is_arrested.mean())
# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
18)Plotting the hourly arrest rate
In this exercise, you'll create a line plot from the hourly_arrest_rate object. A line plot is appropriate in this case because you're showing how a quantity changes over time.
This plot should help you to spot some trends that may not have been obvious when examining the raw numbers!
- Import matplotlib.pyplot using the alias plt.
- Create a line plot of hourly_arrest_rate using the .plot() method.
- Label the x-axis as 'Hour', label the y-axis as 'Arrest Rate', and title the plot 'Arrest Rate by Time of Day'.
- Display the plot using the .show() function.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()
# Add the xlabel, ylabel, and title
plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')
# Display the plot
plt.show()
19)Plotting drug-related stops
In a small portion of traffic stops, drugs are found in the vehicle during a search. In this exercise, you'll assess whether these drug-related stops are becoming more common over time.
The Boolean column drugs_related_stop indicates whether drugs were found during a given stop. You'll calculate the annual drug rate by resampling this column, and then you'll use a line plot to visualize how the rate has changed over time.
- Calculate the annual rate of drug-related stops by resampling the drugs_related_stop column (on the 'A' frequency) and taking the mean.
- Save the annual drug rate Series as a new object, annual_drug_rate.
- Create a line plot of annual_drug_rate using the .plot() method.
- Display the plot using the .show() function.
# Calculate the annual rate of drug-related stops
print(ri.drugs_related_stop.resample('A').mean())
# Save the annual rate of drug-related stops
annual_drug_rate = ri.drugs_related_stop.resample('A').mean()
# Create a line plot of 'annual_drug_rate'
annual_drug_rate.plot()
# Display the plot
import matplotlib.pyplot as plt
plt.show()
20)Comparing drug and search rates
As you saw in the last exercise, the rate of drug-related stops increased significantly between 2005 and 2015. You might hypothesize that the rate of vehicle searches was also increasing, which would have led to an increase in drug-related stops even if more drivers were not carrying drugs.
You can test this hypothesis by calculating the annual search rate, and then plotting it against the annual drug rate. If the hypothesis is true, then you'll see both rates increasing over time.
- Calculate the annual search rate by resampling the search_conducted column, and save the result as annual_search_rate.
- Concatenate annual_drug_rate and annual_search_rate along the columns axis, and save the result as annual.
- Create subplots of the drug and search rates from the annual DataFrame.
- Display the subplots.
# Calculate and save the annual search rate
annual_search_rate = ri.search_conducted.resample('A').mean()
# Concatenate 'annual_drug_rate' and 'annual_search_rate'
annual = pd.concat([annual_drug_rate, annual_search_rate], axis='columns')
# Create subplots from 'annual'
annual.plot(subplots=True)
# Display the subplots
plt.show()
21)Tallying violations by district
The state of Rhode Island is broken into six police districts, also known as zones. How do the zones compare in terms of what violations are caught by police?
In this exercise, you'll create a frequency table to determine how many violations of each type took place in each of the six zones. Then, you'll filter the table to focus on the "K" zones, which you'll examine further in the next exercise.
- Create a frequency table from the ri DataFrame's district and violation columns using the pd.crosstab() function.
- Save the frequency table as a new object, all_zones.
- Select rows 'Zone K1' through 'Zone K3' from all_zones using the .loc[] accessor.
- Save the smaller table as a new object, k_zones.
# Create a frequency table of districts and violations
print(pd.crosstab(ri['district'], ri['violation']))
# Save the frequency table as 'all_zones'
all_zones = pd.crosstab(ri['district'], ri['violation'])
# Select rows 'Zone K1' through 'Zone K3'
print(all_zones.loc['Zone K1':'Zone K3'])
# Save the smaller table as 'k_zones'
k_zones = all_zones.loc['Zone K1':'Zone K3']
22)Plotting violations by district
Now that you've created a frequency table focused on the "K" zones, you'll visualize the data to help you compare what violations are being caught in each zone.
First you'll create a bar plot, which is an appropriate plot type since you're comparing categorical data. Then you'll create a stacked bar plot in order to get a slightly different look at the data. Which plot do you find to be more insightful?
- Create a bar plot of k_zones.
- Display the plot and examine it. What do you notice about each of the zones?
# Create a bar plot of 'k_zones'
k_zones.plot(kind='bar')
# Display the plot
plt.show()
- Create a stacked bar plot of k_zones.
- Display the plot and examine it. Do you notice anything different about the data than you did previously?
# Create a stacked bar plot of 'k_zones'
k_zones.plot(kind='bar', stacked=True)
# Display the plot
plt.show()
23)Converting stop durations to numbers
In the traffic stops dataset, the stop_duration column tells you approximately how long the driver was detained by the officer. Unfortunately, the durations are stored as strings, such as '0-15 Min'. How can you make this data easier to analyze?
In this exercise, you'll convert the stop durations to integers. Because the precise durations are not available, you'll have to estimate the numbers using reasonable values:
- Convert '0-15 Min' to 8
- Convert '16-30 Min' to 23
- Convert '30+ Min' to 45
- Print the unique values in the stop_duration column. (This has been done
for you.)
- Create a dictionary called mapping that maps the stop_duration strings to the specified
integers.
- Convert the stop_duration strings to integers using
the mapping, and
store the results in a new column called stop_minutes.
- Print the unique values in the stop_minutes column, to verify that the
durations were properly converted to integers.
# Print the unique values in 'stop_duration'
print(ri.stop_duration.unique())
# Create a dictionary that maps strings to integers
mapping = {'0-15 Min': 8, '16-30 Min': 23, '30+ Min': 45}
# Convert the 'stop_duration' strings to integers using the 'mapping'
ri['stop_minutes'] = ri.stop_duration.map(mapping)
# Print the unique values in 'stop_minutes'
print(ri.stop_minutes.unique())
24)Plotting stop length
If you were stopped for a particular violation, how long might you expect to be detained?
In this exercise, you'll visualize the average length of time drivers are stopped for each type of violation. Rather than using the violation column in this exercise, you'll use violation_raw since it contains more detailed descriptions of the violations.
- For each value in the ri DataFrame's violation_raw column, calculate the mean number of stop_minutes that a driver is detained.
- Save the resulting Series as a new object, stop_length.
- Sort stop_length by its values, and then visualize it using a horizontal bar plot.
- Display the plot.
# Calculate the mean 'stop_minutes' for each value in 'violation_raw'
stop_length = ri.groupby('violation_raw').stop_minutes.mean()
# Save the resulting Series as 'stop_length'
print(stop_length)
# Sort 'stop_length' by its values and create a horizontal bar plot
stop_length.sort_values().plot(kind='barh', figsize=(8, 10))
# Display the plot
plt.show()
25)Plotting the temperature
In this exercise, you'll examine the temperature columns from the weather dataset to assess whether the data seems trustworthy. First you'll print the summary statistics, and then you'll visualize the data using a box plot.
When deciding whether the values seem reasonable, keep in mind that the temperature is measured in degrees Fahrenheit, not Celsius!
- Read weather.csv into a DataFrame named weather.
- Select the temperature columns (TMIN, TAVG, TMAX) and print their summary statistics using the .describe() method.
- Create a box plot to visualize the temperature columns.
- Display the plot.
# Read 'weather.csv' into a DataFrame named 'weather'
weather = pd.read_csv('weather.csv')
# Describe the temperature columns
print(weather[['TMIN', 'TAVG', 'TMAX']].describe())
# Create a box plot of the temperature columns
weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box')
# Display the plot
plt.show()
26)Plotting the temperature difference
In this exercise, you'll continue to assess whether the dataset seems trustworthy by plotting the difference between the maximum and minimum temperatures.
What do you notice about the resulting histogram? Does it match your expectations, or do you see anything unusual?
- Create a new column in the weather DataFrame named TDIFF that represents the difference between the maximum and minimum temperatures.
- Print the summary statistics for TDIFF using the .describe() method.
- Create a histogram with 20 bins to visualize TDIFF.
- Display the plot.
# Create a 'TDIFF' column that represents temperature difference
weather['TDIFF'] = weather['TMAX'] - weather['TMIN']
# Describe the 'TDIFF' column
print(weather['TDIFF'].describe())
# Create a histogram with 20 bins to visualize 'TDIFF'
weather['TDIFF'].plot(kind='hist', bins=20)
# Display the plot
plt.show()
27)Counting bad weather conditions
The weather DataFrame contains 20 columns that start with 'WT', each of which represents a bad weather condition. For example:
- WT05 indicates "Hail"
- WT11 indicates "High or damaging winds"
- WT17 indicates "Freezing rain"
For every row in the dataset, each WT column contains either a 1 (meaning the condition was present that day) or NaN (meaning the condition was not present).
In this exercise, you'll quantify "how bad" the weather was each day by counting the number of 1 values in each row.
- Copy the columns WT01 through WT22 from weather to a new DataFrame named WT.
- Calculate the sum of each row in WT, and store the results in a new weather column named bad_conditions.
- Replace any missing values in bad_conditions with a 0. (This has been done for you.)
- Create a histogram to visualize bad_conditions, and then display the plot.
# Copy 'WT01' through 'WT22' to a new DataFrame
WT = weather.loc[:, 'WT01':'WT22']
# Calculate the sum of each row in 'WT'
weather['bad_conditions'] = WT.sum(axis='columns')
# Replace missing values in 'bad_conditions' with '0'
weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')
# Create a histogram to visualize 'bad_conditions'
weather.bad_conditions.plot(kind='hist')
# Display the plot
plt.show()
28)Rating the weather conditions
In the previous exercise, you counted the number of bad weather conditions each day. In this exercise, you'll use the counts to create a rating system for the weather.
The counts range from 0 to 9, and should be converted to ratings as follows:
- Convert 0 to 'good'
- Convert 1 through 4 to 'bad'
- Convert 5 through 9 to 'worse'
- Count the unique values in the bad_conditions column and sort the index.
(This has been done for you.)
- Create a dictionary called mapping that maps the bad_conditions integers to the specified
strings.
- Convert the bad_conditions integers to strings using
the mapping and
store the results in a new column called rating.
- Count the unique values in rating to verify that the integers
were properly converted to strings.
# Count the unique values in 'bad_conditions' and sort the index
print(weather.bad_conditions.value_counts().sort_index())
# Create a dictionary that maps integers to strings
mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse', 6:'worse', 7:'worse', 8:'worse', 9:'worse'}
# Convert the 'bad_conditions' integers to strings using the 'mapping'
weather['rating'] = weather.bad_conditions.map(mapping)
# Count the unique values in 'rating'
print(weather.rating.value_counts())
29)Changing the data type to category
Since the rating column only has a few possible values, you'll change its data type to category in order to store the data more efficiently. You'll also specify a logical order for the categories, which will be useful for future exercises.
- Create a CategoricalDtype object called cats that lists the weather ratings in a logical order: 'good', 'bad', 'worse'. Make sure to specify that the categories should be treated as ordered.
- Use the cats object to change the data type of the rating column from object to category.
- Examine the head of the rating column to confirm that the categories are logically ordered.
# Specify the logical order of the weather ratings
cats = pd.CategoricalDtype(['good', 'bad', 'worse'], ordered=True)
# Change the data type of 'rating' to category
weather['rating'] = weather.rating.astype(cats)
# Examine the head of 'rating'
print(weather['rating'].head())
30)Preparing the DataFrames
In this exercise, you'll prepare the traffic stop and weather rating DataFrames so that they're ready to be merged:
- With the ri DataFrame, you'll move the stop_datetime index to a column since the index will be lost during the merge.
- With the weather DataFrame, you'll select the DATE and rating columns and put them in a new DataFrame.
- Reset the index of the ri DataFrame.
- Examine the head of ri to verify that stop_datetime is now a DataFrame column, and the index is now the default integer index.
- Create a new DataFrame named weather_rating that contains only the DATE and rating columns from the weather DataFrame.
- Examine the head of weather_rating to verify that it contains the proper columns.
# Reset the index of 'ri'
ri.reset_index(inplace=True)
# Examine the head of 'ri'
print(ri.head())
# Create a DataFrame from the 'DATE' and 'rating' columns
weather_rating = weather[['DATE', 'rating']]
# Examine the head of 'weather_rating'
print(weather_rating.head())
31)Merging the DataFrames
In this exercise, you'll merge the ri and weather_rating DataFrames into a new DataFrame, ri_weather.
The DataFrames will be joined using the stop_date column from ri and the DATE column from weather_rating. Thankfully the date formatting matches exactly, which is not always the case!
Once the merge is complete, you'll set stop_datetime as the index, which is the column you saved in the previous exercise.
- Examine the shape of the ri DataFrame.
- Merge the ri and weather_rating DataFrames using a left join.
- Examine the shape of ri_weather to confirm that it has two more columns but the same number of rows as ri.
- Replace the index of ri_weather with the stop_datetime column.
# Examine the shape of 'ri'
print(ri.shape)
# Merge 'ri' and 'weather_rating' using a left join
ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left')
# Examine the shape of 'ri_weather'
print(ri_weather.shape)
# Set 'stop_datetime' as the index of 'ri_weather'
ri_weather.set_index('stop_datetime', inplace=True)
32) Comparing arrest rates by weather rating
Do police officers arrest drivers more often when the weather is bad? Let's find out!
- First, you'll calculate the overall arrest rate.
- Then, you'll calculate the arrest rate for each of the weather ratings you previously assigned.
- Finally, you'll add violation type as a second factor in the analysis, to see if that accounts for any differences in the arrest rate.
Since you previously defined a logical order for the weather categories, good < bad < worse, they will be sorted that way in the results.
· Calculate the overall arrest rate by taking the mean of the is_arrested Series.
# Calculate the overall arrest rate
print(ri_weather.is_arrested.mean())
· Calculate the arrest rate for each weather rating using a .groupby().
# Calculate the arrest rate for each 'rating'
print(ri_weather.groupby('rating').is_arrested.mean())
· Calculate the arrest rate for each combination of violation and rating. How do the arrest rates differ by group?
# Calculate the arrest rate for each 'violation' and 'rating'
print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean())
33)Selecting from a multi-indexed Series
The output of a single .groupby() operation on multiple columns is a Series with a MultiIndex. Working with this type of object is similar to working with a DataFrame:
- The outer index level is like the DataFrame rows.
- The inner index level is like the DataFrame columns.
In this exercise, you'll practice accessing data from a multi-indexed Series using the .loc[] accessor.
- Save the output of the .groupby() operation from the last exercise as a new object, arrest_rate. (This has been done for you.)
- Print the arrest_rate Series and examine it.
- Print the arrest rate for moving violations in bad weather.
- Print the arrest rates for speeding violations in all three weather conditions.
# Save the output of the groupby operation from the last exercise
arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
# Print the 'arrest_rate' Series
print(arrest_rate)
# Print the arrest rate for moving violations in bad weather
print(arrest_rate.loc['Moving violation', 'bad'])
# Print the arrest rates for speeding violations in all three weather conditions
print(arrest_rate.loc['Speeding'])
34) Reshaping the arrest rate data
In this exercise, you'll start by reshaping the arrest_rate Series into a DataFrame. This is a useful step when working with any multi-indexed Series, since it enables you to access the full range of DataFrame methods.
Then, you'll create the exact same DataFrame using a pivot table. This is a great example of how pandas often gives you more than one way to reach the same result!
- Unstack the arrest_rate Series to reshape it into a DataFrame.
- Create the exact same DataFrame using a pivot table! Each of the three .pivot_table() parameters should be specified as one of the ri_weather columns.
# Unstack the 'arrest_rate' Series into a DataFrame
print(arrest_rate.unstack())
# Create the same DataFrame using a pivot table
print(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))
Comments
Post a Comment