Skip to main content

MSM6353 Week-4 Phyton Lab -2

 1)What does time.time() measure?

What does the time.time() function exactly measure?

2)Measuring time I

In the lecture slides, you saw how the time.time() function can be loaded and used to assess the time required to perform a basic mathematical operation.

Now, you will use the same strategy to assess two different methods for solving a similar problem: calculate the sum of squares of all the positive integers from 1 to 1 million (1,000,000).

Similar to what you saw in the video, you will compare two methods; one that uses brute force and one more mathematically sophisticated.

In the function formula, we use the standard formula


where N=1,000,000.

In the function brute_force we loop over each number from 1 to 1 million and add it to the result.

  • Calculate the result of the problem using the formula() function.
  • Print the time required to calculate the result using the formula() function.
  • Calculate the result of the problem using the brute_force() function.
  • Print the time required to calculate the result using the brute_force() function.

# Calculate the result of the problem using formula() and print the time required

N = 1000000

fm_start_time = time.time()

first_method = formula(N)

print("Time using formula: {} sec".format(time.time() - fm_start_time))

 

# Calculate the result of the problem using brute_force() and print the time required

sm_start_time = time.time()

second_method = brute_force(N)

print("Time using the brute force: {} sec".format(time.time() - sm_start_time))

 

3)Measuring time II

As we discussed in the lectures, in the majority of cases, a list comprehension is faster than a for loop.

In this demonstration, you will see a case where a list comprehension and a for loop have so small difference in efficiency that choosing either method will perform this simple task instantly.

In the list words, there are random words downloaded from the Internet. We are interested to create another list called listlet in which we only keep the words that start with the letter b.

In case you are not familiar with dealing with strings in Python, each string has the .startswith() attribute, which returns a True/False statement whether the string starts with a specific letter/phrase or not.

  • Assign the time before the execution of the list comprehension
  • Assign the time after the execution of the list comprehension

# Store the time before the execution

start_time = time.time()

 

# Execute the operation

letlist = [wrd for wrd in words if wrd.startswith('b')]

 

# Store and print the difference between the start and the current time

total_time_lc = time.time() - start_time

print('Time using list comprehension: {} sec'.format(total_time_lc))

 

  • Assign the time before the execution of the for loop
  • Assign the time after the execution of the for loop

# Store the time before the execution

start_time = time.time()

 

# Execute the operation

letlist = [wrd for wrd in words if wrd.startswith('b')]

 

# Store and print the difference between the start and the current time

total_time_lc = time.time() - start_time

print('Time using list comprehension: {} sec'.format(total_time_lc))

 

# Store the time before the execution of the for loop

start_time = time.time()

 

# Execute the operation

letlist = []

for wrd in words:

    if wrd.startswith('b'):

        letlist.append(wrd)

        

# Print the difference between the start and the current time

total_time_fl = time.time() - start_time

print('Time using for loop: {} sec'.format(total_time_fl))

 

4)Row selection: loc[] vs iloc[]

A big part of working with DataFrames is to locate specific entries in the dataset. You can locate rows in two ways:

  • By a specific value of a column (feature).
  • By the index of the rows (index). In this exercise, we will focus on the second way.

If you have previous experience with pandas, you should be familiar with the .loc and .iloc indexers, which stands for 'location' and 'index location' respectively. In most cases, the indices will be the same as the position of each row in the Dataframe (e.g. the row with index 13 will be the 14th entry).

While we can use both functions to perform the same task, we are interested in which is the most efficient in terms of speed.

  • Store the indices of the first 1000 rows in the row_nums.
  • Use the .loc[] indexer to select the first 1000 rows of poker_hands, and record the times before and after that operation.
  • Print the time it took to select the rows.

# Define the range of rows to select: row_nums

row_nums = range(0, 1000)

 

# Select the rows using .loc[] and row_nums and record the time before and after

loc_start_time = time.time()

rows = poker_hands.loc[row_nums]

loc_end_time = time.time()

 

# Print the time it took to select the rows using .loc[]

print("Time using .loc[]: {} sec".format(loc_end_time - loc_start_time))

 

  • Use the .iloc[] indexer with row_nums to select the first 1000 rows of the DataFrame poker_hands, and print how much time it took (as the difference between the time after the selection and the time before the selection)

# Define the range of rows to select: row_nums

row_nums = range(0, 1000)

 

# Select the rows using .loc[] and row_nums and record the time before and after

loc_start_time = time.time()

rows = poker_hands.loc[row_nums]

loc_end_time = time.time()

 

# Print the time it took to select the rows using .loc

print("Time using .loc[]: {} sec".format(loc_end_time - loc_start_time))

 

# Select the rows using .iloc[] and row_nums and record the time before and after

iloc_start_time = time.time()

rows = poker_hands.iloc[row_nums]

iloc_end_time = time.time()

 

# Print the time it took to select the rows using .iloc

print("Time using .iloc[]: {} sec".format(iloc_end_time - iloc_start_time))

 

 

Question

If you need to select specific rows of a DataFrame, which function is more efficient, it terms of speed?

Possible answers

.loc[]

.iloc[]

 

5)Column selection: .iloc[] vs by name

In the previous exercise, you saw how the .loc[] and .iloc[] functions can be used to locate specific rows of a DataFrame (based on the index). Turns out, the .iloc[] function performs a lot faster (~ 2 times) for this task!

Another important task is to find the faster function to select the targeted features (columns) of a DataFrame. In this exercise, we will compare the following:

  • using the index locator .iloc()
  • using the names of the columns While we can use both functions to perform the same task, we are interested in which is the most efficient in terms of speed.

In this exercise, you will continue working with the poker data which is stored in poker_hands. Take a second to examine the structure of this DataFrame by calling poker_hands.head() in the console!

  • Use the .iloc indexer to select the first, fourth, fifth, seventh and eighth column ('S1', 'R2', 'S3', 'S4', 'R4') of the DataFrame poker_hands by their index and find the time it took.

# Use .iloc to select the first, fourth, fifth, seventh and eighth column and record the times before and after

iloc_start_time = time.time()

cols = poker_hands.iloc[:, [0, 3, 4, 6, 7]]

iloc_end_time = time.time()

 

# Print the time it took

print("Time using .iloc[] : {} sec".format(iloc_end_time - iloc_start_time))

 

  • Select the first, third, fourth, sixth and seventh column ('S1', 'S2', 'R2', 'R3', 'S4') of the DataFrame poker_hands by their names and time this operation.

# Use .iloc to select the first, fourth, fifth, seventh and eighth column and record the times before and after

iloc_start_time = time.time()

cols = poker_hands.iloc[:,[0,3,4,6,7]]

iloc_end_time = time.time()

 

# Print the time it took

print("Time using .iloc[] : {} sec".format(iloc_end_time - iloc_start_time))

 

# Use simple column selection to select the first, third, fourth, sixth and seventh column by name and record the times before and after

names_start_time = time.time()

cols = poker_hands[['S1', 'S2', 'R2', 'R3', 'S4']]

names_end_time = time.time()

 

# Print the time it took

print("Time using selection by name : {} sec".format(names_end_time - names_start_time))

 

Question

If you need to select a specific column (or columns) of a DataFrame, which function is more efficient, it terms of speed?

Possible answers

.iloc()

Simple columns selection

 

6)Random row selection

In this exercise, you will compare the two methods described for selecting random rows (entries) with replacement in a pandas DataFrame:

  • The built-in pandas function .random()
  • The NumPy random integer number generator np.random.randint()

Generally, in the fields of statistics and machine learning, when we need to train an algorithm, we train the algorithm on the 75% of the available data and then test the performance on the remaining 25% of the data.

For this exercise, we will randomly sample the 75% percent of all the played poker hands available, using each of the above methods, and check which method is more efficient in terms of speed.

  • Randomly select 75% of the rows of the poker dataset using the np.random.randint() method.

# Extract number of rows in dataset

N = poker_hands.shape[0]

 

# Select and time the selection of the 75% of the dataset's rows

rand_start_time = time.time()

poker_hands.iloc[np.random.randint(low=0, high=N, size=int(0.75 * N))]

print("Time using Numpy: {} sec".format(time.time() - rand_start_time))

 

  • Randomly select 75% of the rows of the poker dataset using the .sample() method. Make sure to specify the axis correctly!

# Extract number of rows in dataset

N=poker_hands.shape[0]

 

# Select and time the selection of the 75% of the dataset's rows

rand_start_time = time.time()

poker_hands.iloc[np.random.randint(low=0, high=N, size=int(0.75 * N))]

print("Time using Numpy: {} sec".format(time.time() - rand_start_time))

 

# Select and time the selection of the 75% of the dataset's rows using sample()

samp_start_time = time.time()

poker_hands.sample(int(0.75 * N), axis=0, replace=True)

print("Time using .sample: {} sec".format(time.time() - samp_start_time))

 

Question

Between np.random.randint() and .sample(), which one is faster when selecting random rows from a pandas DataFrame?

Possible answers

np.random.randint()

.sample()

 

7)Random column selection

In the previous exercise, we examined two ways to select random rows from a pandas DataFrame. We can use the same functions to randomly select columns in a pandas DataFrame.

To randomly select 4 columns out of the poker dataset, you will use the following two functions:

  • The built-in pandas function .sample()
  • The NumPy random integer number generator np.random.randint()
  • Randomly select 4 columns from the poker_hands dataset using np.random.randint() .

# Extract number of columns in dataset

D = poker_hands.shape[1]

 

# Select and time the selection of 4 of the dataset's columns using NumPy

np_start_time = time.time()

poker_hands.iloc[:, np.random.randint(low=0, high=D, size=4)]

print("Time using NumPy's random.randint(): {} sec".format(time.time() - np_start_time))

 

  • Randomly select 4 columns from the poker_hands dataset using the .sample() method.

# Extract number of columns in dataset

D=poker_hands.shape[1]

 

# Select and time the selection of 4 of the dataset's columns using NumPy

np_start_time = time.time()

poker_hands.iloc[:,np.random.randint(low=0, high=D, size=4)]

print("Time using NymPy's random.randint(): {} sec".format(time.time() - np_start_time))

 

# Select and time the selection of 4 of the dataset's columns using pandas

pd_start_time = time.time()

poker_hands.sample(4, axis=1)

print("Time using panda's .sample(): {} sec".format(time.time() - pd_start_time))

 

Question

Between np.random.randint() and .sample(), which one is faster when selecting random columns from a pandas DataFrame?

Possible answers

numpy.random.randint()

.sample()

 

8)Replacing scalar values I

In this exercise, we will replace a list of values in our dataset by using the .replace() method with another list of desired values.

We will apply the functions in the poker_hands DataFrame. Remember that in the poker_hands DataFrame, each row of columns R1 to R5 represents the rank of each card from a player's poker hand spanning from 1 (Ace) to 13 (King). The Class feature classifies each hand as a category, and the Explanation feature briefly explains each hand.

The poker_hands DataFrame is already loaded for you, and you can explore the features Class and Explanation.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

  • Replace every hand (row) of the DataFrame listed as Class 1 (One Pair) to -2 and each hand listed as Class 2 (Two Pairs) to -3.

# Replace Class 1 to -2 

poker_hands['Class'].replace(1, -2, inplace=True)

# Replace Class 2 to -3

poker_hands['Class'].replace(2, -3, inplace=True)

 

print(poker_hands[['Class', 'Explanation']])

 

9)Replace scalar values II

As discussed in the video, in a pandas DataFrame, it is possible to replace values in a very intuitive way: we locate the position (row and column) in the Dataframe and assign in the new value you want to replace with. In a more pandas-ian way, the .replace() function is available that performs the same task.

You will be using the names DataFrame which includes, among others, the most popular names in the US by year, gender and ethnicity.

Your task is to replace all the babies that are classified as FEMALE to GIRL using the following methods:

  • intuitive scalar replacement
  • using the .replace() function
  • Replace all the babies that are classified as 'FEMALE' to 'GIRL' as described above.

start_time = time.time()

 

# Replace all the entries that have 'FEMALE' as a gender with 'GIRL'

names.loc[names['Gender'] == 'FEMALE', 'Gender'] = 'GIRL'

 

print("Time using .loc[]: {} sec".format(time.time() - start_time))

 

  • Replace all the babies that are classified as 'FEMALE' to 'GIRL' using the .replace() function. Set inplace to True to assign the result back to the original DataFrame.

start_time = time.time()

 

# Replace all the entries that have 'FEMALE' as a gender with 'GIRL'

names['Gender'].replace('FEMALE', 'GIRL', inplace=True)

 

print("Time using .replace(): {} sec".format(time.time() - start_time))

 

Question

Which of the two methods presented in the previous exercises is the most efficient when replacing a scalar value?

Possible answers

Using .replace() was faster.

Using intuitive replacement (with the .loc[] function) was faster.

Both methods present the same performance.

 

10)Replace multiple values I

In this exercise, you will apply the .replace() function for the task of replacing multiple values with one or more values. You will again use the names dataset which contains, among others, the most popular names in the US by year, gender and Ethnicity.

Thus you want to replace all ethnicities classified as black or white non-hispanics to non-hispanic. Remember, the ethnicities are stated in the dataset as follows: ['BLACK NON HISP', 'BLACK NON HISPANIC', 'WHITE NON HISP' , 'WHITE NON HISPANIC'] and should be replaced to 'NON HISPANIC'

  • Replace all the ethnicities that are not Hispanic in the dataset to 'NON HISPANIC' using the.loc()` indexer.

start_time = time.time()

 

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'

names['Ethnicity'][(names["Ethnicity"] == 'BLACK NON HISP') | 

                   (names["Ethnicity"] == 'BLACK NON HISPANIC') | 

                   (names["Ethnicity"] == 'WHITE NON HISP') | 

                   (names["Ethnicity"] == 'WHITE NON HISPANIC')] = 'NON HISPANIC'

 

print("Time using .loc[]: {} sec".format(time.time() - start_time))

 

  • Replace all the ethnicities that are not Hispanic in the dataset to 'NON HISPANIC' using the .replace() function.

start_time = time.time()

 

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'

names['Ethnicity'].replace(['BLACK NON HISP', 'BLACK NON HISPANIC', 

                            'WHITE NON HISP', 'WHITE NON HISPANIC'], 

                           'NON HISPANIC', inplace=True)

 

print("Time using .replace(): {} sec".format(time.time() - start_time))

 

11)Replace multiple values II

As discussed in the video, instead of using the .replace() function multiple times to replace multiple values, you can use lists to map the elements you want to replace one to one with those you want to replace them with.

As you have seen in our popular names dataset, there are two names for the same ethnicity. We want to standardize the naming of each ethnicity by replacing

  • 'ASIAN AND PACI' to 'ASIAN AND PACIFIC ISLANDER'
  • 'BLACK NON HISP' to 'BLACK NON HISPANIC'
  • 'WHITE NON HISP' to 'WHITE NON HISPANIC'

In the DataFrame names, you are going to replace all the values on the left by the values on the right.

  • Replace all the ethnicities by their respective alternative, as indicated above.

start_time = time.time()

 

# Replace ethnicities as instructed

names['Ethnicity'].replace(

    ['ASIAN AND PACI', 'BLACK NON HISP', 'WHITE NON HISP'],

    ['ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISPANIC', 'WHITE NON HISPANIC'],

    inplace=True

)

 

print("Time using .replace(): {} sec".format(time.time() - start_time))

 

12)Replace single values I

In this exercise, we will apply the following replacing technique of replacing multiple values using dictionaries on a different dataset.

We will apply the functions in the data DataFrame. Each row represents the rank of 5 cards from a playing card deck, spanning from 1 (Ace) to 13 (King) (features R1, R2, R3, R4, R5). The feature 'Class' classifies each row to a category (from 0 to 9) and the feature 'Explanation' gives a brief explanation of what each class represents.

The purpose of this exercise is to categorize the two types of flush in the game ('Royal flush' and 'Straight flush') under the 'Flush' name.

  • Replace every row of the DataFrame listed as 'Royal flush' or 'Straight flush' in the 'Explanation' column to 'Flush'.

# Replace Royal flush or Straight flush to Flush

poker_hands.replace({'Royal flush': 'Flush', 'Straight flush': 'Flush'}, inplace=True)

print(poker_hands['Explanation'].head())

 

13)Replace single values II

For this exercise, we will be using the names DataFrame. In this dataset, the column 'Rank' shows the ranking of each name by year. For this exercise, you will use dictionaries to replace the first ranked name of every year as 'FIRST', the second name as 'SECOND' and the third name as 'THIRD'.

You will use dictionaries to replace one single value per key.

You can already see the first 5 names of the data, which correspond to the 5 most popular names for all the females belonging to the 'ASIAN AND PACIFIC ISLANDER' ethnicity in 2011.

  • Replace the ranks, indicated in numbers, by strings, following the pattern given above. Don't hesitate to explore your dataset in the Console after replacing values to see how it changed.

# Replace the number rank by a string

names['Rank'].replace({1: 'FIRST', 2: 'SECOND', 3: 'THIRD'}, inplace=True)

print(names.head())

 

  • Replace the first three ranked names of every year to 'MEDAL'.
  • Replace the fourth and fifth ranked names of every year to 'ALMOST MEDAL'.

# Replace the rank of the first three ranked names to 'MEDAL'

names.replace({'Rank': {1: 'MEDAL', 2: 'MEDAL', 3: 'MEDAL'}}, inplace=True)

 

# Replace the rank of the 4th and 5th ranked names to 'ALMOST MEDAL'

names.replace({'Rank': {4: 'ALMOST MEDAL', 5: 'ALMOST MEDAL'}}, inplace=True)

 

print(names.head())

 

14)Most efficient method for scalar replacement

If you want to replace a scalar value with another scalar value, which technique is the most efficient??

15)Create a generator for a pandas DataFrame

As you've seen in the video, you can easily create a generator out of a pandas DataFrame. Each time you iterate through it, it will yield two elements:

  • the index of the respective row
  • a pandas Series with all the elements of that row

You are going to create a generator over the poker dataset, imported as poker_hands. Then, you will print all the elements of the 2nd row, using the generator.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

  • Assign a generator over the rows of the data dataset on the variable generator.
  • Print all the elements of the 2nd element of the created generator.

# Create a generator over the rows

generator = poker_hands.iterrows()

 

# Access the elements of the 2nd row

first_element = next(generator)  # skip the first row

second_element = next(generator)  # get the second row

 

print(first_element, second_element)

 

16)The iterrows() function for looping

You just saw how to create a generator out of a pandas DataFrame. You will now use this generator and see how to take advantage of that method of looping through a pandas DataFrame, still using the poker_hands dataset.

Specifically, we want the sum of the ranks of all the cards, if the index of the hand is an odd number. The ranks of the cards are located in the odd columns of the DataFrame.

  • Check if the hand index is an odd number.
  • If it is, calculate the sum of the rank of all the cards in that hand. It could take a little longer than usual to compute the results.

data_generator = poker_hands.iterrows()

 

for index, values in data_generator:

    # Check if index is odd

    if index % 2 == 1:

        # Sum the ranks of all the cards (columns R1, R2, R3, R4, R5)

        hand_sum = sum([values[1], values[3], values[5], values[7], values[9]])

        print(f"Index {index}, Hand sum: {hand_sum}")

 

17).apply() function in every cell

As you saw in the lesson, you can use .apply() to map a function to every cell of the DataFrame, regardless the column or the row.

You're going to try it out on the poker_hands dataset. You will use .apply() to square every cell of the DataFrame. The native Python way to square a number n is n**2.

  • Define the lambda transformation for the square.
  • Apply the transformation using the .apply() function.

# Define the lambda transformation

get_square = lambda x: x**2

 

# Apply the transformation

data_sum = poker_hands.applymap(get_square)

print(data_sum.head())

 

18).apply() for rows iteration

.apply() is a very useful to iterate through the rows of a DataFrame and apply a specific function.

You will work on a subset of the poker_hands dataset, which includes only the rank of all the five cards of each hand in each row (this subset is generated for you in the script). You're going to get the variance of every hand for all ranks, and every rank for all hands.

  • Define a lambda function to return the variance, using the numpy package.
  • Apply the transformation for every row.

import numpy as np

 

# Define the lambda transformation

get_variance = lambda x: np.var(x)

 

# Apply the transformation

data_tr = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_variance, axis=1)

print(data_tr.head())

 

  • Modify the script to apply the function on every rank.

import numpy as np

 

# Define the lambda transformation

get_variance = lambda x: np.var(x)

 

# Apply the transformation on every column (rank)

data_tr = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_variance, axis=0)

print(data_tr.head())

 

19)Why vectorization in pandas is so fast?

As you probably noticed in this lesson, we achieved a massive improvement using some form of vectorization.

20)pandas vectorization in action

In this exercise, you will apply vectorization over pandas series to:

  • calculate the mean rank of all the cards in each hand (row)
  • calculate the mean rank of each of the 5 cards in each hand (column)

You will use the poker_hands dataset once again to compare both methods' efficiency.

  • Calculate the mean rank in each hand.
  • Calculate the mean rank of each of the 5 card in all hands.

# Calculate the mean rank in each hand

row_start_time = time.time()

mean_r = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=1)

print("Time using pandas vectorization for rows: {} sec".format(time.time() - row_start_time))

print(mean_r.head())

 

# Calculate the mean rank of each of the 5 cards in all hands

col_start_time = time.time()

mean_c = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=0)

print("Time using pandas vectorization for columns: {} sec".format(time.time() - col_start_time))

print(mean_c.head())

 

21)Best method of vectorization

So far, you have encountered two vectorization methods:

  • Vectorization over pandas Series
  • Vectorization over Numpy ndarrays

While these two methods outperform all the other methods, when can vectorization over NumPy ndarrays be used to replace vectorization over pandas Series?

22)Vectorization methods for looping a DataFrame

Now that you're familiar with vectorization in pandas and NumPy, you're going to compare their respective performances yourself.

Your task is to calculate the variance of all the hands in each hand using the vectorization over pandas Series and then modify your code using the vectorization over Numpy ndarrays method.

  • Calculate the variance of the ranks of all the cards in each hand using vectorization with pandas.

# Calculate the variance in each hand

start_time = time.time()

poker_var = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].var(axis=1)

print("Time using pandas vectorization: {} sec".format(time.time() - start_time))

print(poker_var.head())

 

  • Calculate the variance of the ranks of all the cards in each hand using vectorization with NumPy.

# Calculate the variance in each hand

start_time = time.time()

poker_var = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].values.var(axis=1, ddof=1)

print("Time using NumPy vectorization: {} sec".format(time.time() - start_time))

print(poker_var[0:5])

 

23)The min-max normalization using .transform()

A very common operation is the min-max normalization. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).

You're going to define and apply the min-max normalization to all the numerical variables in the restaurant data. You will first group the entries by the time the meal took place (Lunch or Dinner) and then apply the normalization to each group separately.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

  • Define the min-max normalization using the lambda method.
  • Group the data according to the time the meal took place.
  • Apply the transformation to the grouped data.

# Define the min-max transformation

min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())

 

# Group the data according to the time

restaurant_grouped = restaurant_data.groupby('time')

 

# Apply the transformation

restaurant_min_max_group = restaurant_grouped.transform(min_max_tr)

print(restaurant_min_max_group.head())

 

24)Transforming values to probabilities

In this exercise, we will apply a probability distribution function to a pandas DataFrame with group related parameters by transforming the tip variable to probabilities.

The transformation will be a exponential transformation. The exponential distribution is defined as


where λ (lambda) is the mean of the group that the observation x belongs to.

You're going to apply the exponential distribution transformation to the size of each table in the dataset, after grouping the data according to the time of the day the meal took place. Remember to use each group's mean for the value of λ.

In Python, you can use the exponential as np.exp() from the NumPy library and the mean value as .mean().

  • Define the exponential distribution transformation exp_tr.
  • Group the data according to the time the meal took place.
  • Apply the transformation to the grouped data.

# Define the exponential transformation

exp_tr = lambda x: np.exp(-x.mean()*x) * x.mean()

 

# Group the data according to the time

restaurant_grouped = restaurant_data.groupby('time')

 

# Apply the transformation

restaurant_exp_group = restaurant_grouped['tip'].transform(exp_tr)

print(restaurant_exp_group.head())

 

25)Validation of normalization

For this exercise, we will perform a z-score normalization and verify that it was performed correctly.

A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.

After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.

You will apply the normalization transformation to every numeric variable in the poker_grouped dataset, which is the poker_hands dataset grouped by Class.

  • Apply the normalization transformation to the grouped object poker_grouped.

zscore = lambda x: (x - x.mean()) / x.std()

 

# Apply the transformation

poker_trans = poker_grouped.transform(zscore)

print(poker_trans.head())

 

  • Group poker_trans by class and print the mean and standard deviation to validate the normalization was done correctly

zscore = lambda x: (x - x.mean()) / x.std()

 

# Apply the transformation

poker_trans = poker_grouped.transform(zscore)

 

# Re-group the grouped object and print each group's means and standard deviation

poker_regrouped = poker_trans.groupby(poker_hands['Class'])

 

print(np.round(poker_regrouped.mean(), 3))

print(poker_regrouped.std())

 

26)When to use transform()?

The .transform() function applies a function to all members of each group. Which of the following transformations would produce the same results in the whole dataset regardless of groupings?

27)Identifying missing values

The first step before missing value imputation is to identify if there are missing values in our data, and if so, from which group they arise.

For the same restaurant_data data you encountered in the lesson, an employee erased by mistake the tips left in 65 tables. The question at stake is how many missing entries came from tables that smokers where present vs tables with no-smokers present.

Your task is to group both datasets according to the smoker variable, count the number or present values and then calculate the difference.

We're imputing tips to get you to practice the concepts taught in the lesson. From an ethical standpoint, you should not impute financial data in real life, as it could be considered fraud.

  • Group the data according to smoking status.
  • Calculate the number of non-missing values in each group.
  • Print the number of missing values in each group.

# Group both objects according to smoke condition

restaurant_nan_grouped = restaurant_nan.groupby('smoker')

 

# Store the number of present values

restaurant_nan_nval = restaurant_nan_grouped['tip'].count()

 

# Print the group-wise missing entries

print(restaurant_nan_grouped['total_bill'].count() - restaurant_nan_nval)

 

28)Missing value imputation

As the majority of the real world data contain missing entries, replacing these entries with sensible values can increase the insight you can get from our data.

In the restaurant dataset, the "total_bill" column has some missing entries, meaning that you have not recorded how much some tables have paid. Your task in this exercise is to replace the missing entries with the median value of the amount paid, according to whether the entry was recorded on lunch or dinner (time variable).

  • Define the lambda function that fills missing values with the median.

# Define the lambda function

missing_trans = lambda x: x.fillna(x.median())

 

  • Group the data according to the time of each entry.
  • Apply and print the pre-defined transformation to impute the missing values in the restaurant_data dataset.

# Define the lambda function

missing_trans = lambda x: x.fillna(x.median())

 

# Group the data according to time

restaurant_grouped = restaurant_data.groupby('time')

 

# Apply the transformation

restaurant_impute = restaurant_grouped.transform(missing_trans)

print(restaurant_impute.head())

 

29)When to use filtration?

When applying the filter() function on a grouped object, what you can use as a criterion for filtering?

30)Data filtration

As you noticed in the video lesson, you may need to filter your data for various reasons.

In this exercise, you will use filtering to select a specific part of our DataFrame:

  • by the number of entries recorded in each day of the week
  • by the mean amount of money the customers paid to the restaurant each day of the week
  • Create a new DataFrame containing only the days when the count of total_bill is greater than 40.

# Filter the days where the count of total_bill is greater than 40

total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)

 

# Print the number of tables where total_bill is greater than 40

print('Number of tables where total_bill is greater than $40:', total_bill_40.shape[0])

 

  • From the total_bill_40 DataFrame, select only the entries that have a mean total_bill greater than $20, grouped by day.

# Filter the days where the count of total_bill is greater than $40

total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)

 

# Select only the entries that have a mean total_bill greater than $20

total_bill_20 = total_bill_40.groupby('day').filter(lambda x: x['total_bill'].mean() > 20)

 

# Print days of the week that have a mean total_bill greater than $20

print('Days of the week that have a mean total_bill greater than $20:', total_bill_20.day.unique())

 

Question

  • After applying the .filter() operation on total_bill_20 in Step 2 in the Console, how many entries (rows) does the last DataFrame you created (total_bill_20) have?

·          Possible answers

·        183

·        78

·        163

 

Comments

Popular posts from this blog

GE5103-2 Project Management [Aug 23 Syllabus]

    Some of the advantages of using time boxes and cycles in project coordination efforts include creating urgency, measuring progress, and allowing for predictable measurements. A)        True 2.    Even though most project managers are not contract specialists, they need to understand the process well enough to coordinate with the team. For the current assignment, you are looking at a short-term and small effort with a contractor of just a few hours without significant clarity. Which of the following would be the most applicable contract to use in this situation? A)        Time and materials 3. The project you are working on has had modifications to the plan from the start and even how the project is run. Project governance covers all of the 3 following except: A)        Naming The project manager 4. Of the following, which is most likely a trigger condition defined early in t...

GE5093 Design Thinking All Quizzes

  GE---5093-1D2-FA-2021 - Design Thinking Home My courses 2021-FA GE---5093-1D2-FA-2021 Week 1 Reading Quiz 1 Started on Sunday, October 31, 2021, 2:04 PM State Finished Completed on Sunday, October 31, 2021, 2:30 PM Time taken 25 mins 58 secs Grade 8.00  out of 10.00 ( 80 %) Top of Form Question  1 Correct 1.00 points out of 1.00 Flag question Question text A critical finding of Edward Lorenz related to Design Thinking was: Select one: a. An application of the caterpillar effect b. The idea of deterministic chaos or the "Butterfly Effect" c. Business leaders enjoy chaos d. Statistical modeling of weather was fairly accurate in the long term Feedback Your answer is correct. The correct answer is: The idea of deterministic chaos or the "Butterfly Effect" Question  2 Incorrect 0.00 point...

IS5213 Data Science and Big Data Solutions

WEEK- 2 code  install.packages("dplyr") library(dplyr) Rajeshdf = read.csv('c:\\Insurance.csv') str(Rajeshdf)                        str(Rajeshdf) summary(Rajeshdf) agg_tbl <- Rajeshdf %>% group_by(Rajeshdf$JOB) %>%    summarise(total_count=n(),             .groups = 'drop') agg_tbl a = aggregate( x=Rajeshdf$HOME_VAL, by=list( Rajeshdf$CAR_TYPE), FUN=median, na.rm=TRUE ) a QUIZ 2. What famous literary detective solved a crime because a dog did not bark at the criminal? A). Sherlock Holmes 1.  In the Insurance data set, how many Lawyers are there? A).  1031 3. What two prefixes does the instructor use for variables when fixing the missing values? Select all that apply. A). IMP_ M_ 4. What is the median Home Value of a person who drives a Van? A).  204139 5. In the insurance data set, how many missing (NA) values does the variable AGE have? A) 7   1. What...