IS5213 Data Science and Big Data Solutions

WEEK- 2

code

install.packages("dplyr")

library(dplyr)

Rajeshdf = read.csv('c:\\Insurance.csv')

str(Rajeshdf)

summary(Rajeshdf)

agg_tbl <- Rajeshdf %>% group_by(Rajeshdf$JOB) %>%

summarise(total_count=n(),

.groups = 'drop')

agg_tbl

a = aggregate( x=Rajeshdf$HOME_VAL, by=list( Rajeshdf$CAR_TYPE), FUN=median, na.rm=TRUE )

QUIZ

What famous literary detective solved a crime because a dog did not bark at the criminal?

A). Sherlock Holmes

1. In the Insurance data set, how many Lawyers are there?

A). 1031

3. What two prefixes does the instructor use for variables when fixing the missing values? Select all that apply.

A).

IMP_

4. What is the median Home Value of a person who drives a Van?

A). 204139

5. In the insurance data set, how many missing (NA) values does the variable AGE have?

A) 7

1. What is the process called where missing data is fixed?

a). Imputing

2. According to the instructor, approximately what percentage of the analytic time is spent on data preparation?

a). 90%

3. In the Insurance data set, how many Blue Collar workers are there?

a). 2288

4. What is the median Home Value of a person who drives a Panel Truck?

A). 220541

5. In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?

A). 0

In the Insurance data set, how many Doctors are there?

A). 321

A). 639

What is the median Home Value of a person who drives a Pickup?

A). 151061

In the insurance data set, how many missing (NA) values does the variable AGE have?

A). 7

What is the process called that converts categorical variables into flag variables?

A).
One Hot Encoding

In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?

A). 0

In the R programming language, what is one method for converting a TRUE/FALSE variable into a 1/0 variable?

A). Add the number zero (0) to the TRUE/FALSE variable.

What is the median Home Value of a person who drives an SUV?

A). 140927

According to the instructor, after a variable with missing values is "fixed", it is a good idea to remove the variable from the data set.

A). True

What is the median Home Value of a person who drives a Minivan?

A). 172269

In the insurance data set, how many missing (NA) values does the variable YOJ have?

A). 548

In the Insurance data set, how many Home Makers are there?

a). 843

In the Insurance data set, how many Clerical workers are there?

a). 1590

In the insurance data set, how many missing (NA) values does the variable CAR_AGE have?

A). 639

WEEK 5 QUIZ

1.Random Forests and the Gradient Boosting models will usually be more accurate than Decision Tree models.

A. True

2.Which of these modelling techniques is not adversely affected by outliers?

a. All of these

3.Gradient Boosting models are easy to interpret.

A. False

4.Which of these modelling techniques trains many trees with each tree is built on a random subset of variables?

A. Random Forests

5.Which of these modelling techniques tends to use many small trees?

A. Gradient Boosting

6.Which of these modelling techniques is usually the easiest to interpret?

A. Decision Trees

7.Random Forests are easy to interpret.

A. False

8.In the United States, it is probably against the law to use a Gradient Boosting model for Marketing models.

A. FALS

9.Gradient Boosting models are based on Decision Trees.

A. True

10.Which of these modelling techniques is usually the fastest to train?

A. Decision Trees

11. Random Forests and the Gradient Boosting models will always be more accurate than Decision Tree models.

A. False

12. A Random Forest is more sensitive to a small input change than a Decision Tree

A. False

13. Which of these modelling techniques trains many trees with each tree is built on a random subset of records?

A. Random Forests

14. Random Forests are based on Decision Trees.

A. True

15.A Gradient Boosting model is less sensitive to a small input change than a Decision Tree

A. True

16. In the United States, it may be against the law to use a Gradient Boosting models for Credit or Auto Insurance models.

A. True

17.Which of these modelling techniques alters the data in order to over sample records that it incorrectly classified?

A. Gradient Boosting

18. Which of these modelling techniques is usually the easiest to convert into IF-THEN-ELSE rules?

A. Decision Trees

19. In the United States, it may be against the law to use a Random Forest for Credit or Auto Insurance models.

A. True

20. In the United States, it is probably against the law to use a Random Forest for Marketing models.

A. False

1. WhendoingtSNE analysis,settingthePerplexitytoalownumberwilltendto favor local aspects of the data. High numbers will tend to favor global data.

A. True

2. PrincipalComponentsarealwaysOrthogonaltooneanother.

A. True

3. WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendto have less well defined groupings.

A. False

4. WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendto have more well defined groupings.

A. False

5. InPCAanalysis,thevectorsrepresentaLINEARrelationshipinthe data.

A. True

6. Assumethatyouhave3continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a PCA Analysis?

A. 3

7. PrincipalComponentsarealwaysIndependenttooneanother.

A. True

8. IntheRprogramminglanguage,the"prcomp"functionallowsforscoringdata using the "predict" command.

A. True

9. Assumethatyouhave8continuousvariablesinyourdataset,howmany Principal Components will be created if you do a PCA Analysis?

A. 8

10. WhendoingtSNEanalysis,settingthePerplexitytoalownumberwilltendto favor global aspects of the data. High numbers will tend to favor local data.

A. False

11. Assumethatyouhave3continuousvariablesinyourdataset,howmanyPrincipalComponents will be created if you do a PCA Analysis?

A. 3

12) WhendoingtSNEanalysis,settingthePerplexitytoalownumberwilltendtofavorglobal aspects of the data. High numbers will tend to favor local data.

A. False

13) WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendtohavemorewell defined groupings.

A. False

14) Assumethatyouhave2continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a PCA Analysis?

A. 2

15) tSNEvectorsarealwaysOrthogonaltooneanother.

A. False

16) tSNEvectorsarealwaysOrthogonaltooneanother.

A. False

17) Assume thathave8continuesvariablesinyourdatasethowmanyprincipalcomponentswill be created if you do a PCA analysis?

A. 8

18) )Assumethatyouhave8continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a tSNE Analysis using Rtsne?

A. 2or3

19) In PCAanalysis,thevectorsrepresentaNONLINEARrelationshipinthedata.

A. False

20) Assumethataninputdatasethasfourvariables:A,B,C,Dandtheyareusedtocreatefour PrincipalComponents: PC1, PC2, PC3, and PC4.If A,B,C,D are allhighly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?

A. PC1,PC2,PC3,andPC4arecompletelyuncorrelatedfromoneanother.

21) IntSNEanalysis,thevectorsrepresentaLINEARrelationshipinthedata

A. False

22) GiventhefollowingScreePlot,howmanyPrincipalComponentsshouldbeused?

A) 2or possibly3Principal Components

23) IntheRprogramminglanguage,the"Rtsne"functionallowsforscoringdatausingthe "predict" command.

A. False

To answer this question, please refer to the CRAN Packages web page referred to in the course material.

Which of these packages are used for Optical Character Recognition?

A. abbyyR

Using the iris data set in R, generate a box plot by Species of the variable Petal Length.

Using the iris data set in R, generate a box plot by Species of the variable Petal Width.

Using the iris data set in R, generate a box plot by Species of the variable Sepal Width.

Using the iris data set in R, generate a box plot by Species of the variable Sepal Length.

1. What are the two commands that will return the first and last six rows of a Data Frame?

A. head, tail

2. The R programming language has data sets that are pre-loaded. One of these data sets is the "iris" data set. What command will give you information about this data set?

A. iris

3. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=1 ?

A. 88.0

4. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=2 ?

A.104.5

5. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=3 ?

A. 125.5

6. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=4 ?

A. 129.5

7. To answer this question, please refer to the CRAN Packages web page referred to in the course material.

Which of these packages are used for Reliability and Scoring Routines?

A. ATtools

8. How many records are in the predefined data set named "trees"

A. 31

9. There is no guarantee that an R Package included in CRAN will be maintained and "up to date".

A. False

10. Which of these packages are used for Combining Multidimensional Arrays?

A. abind

11. How many records are in the predefined data set named "cars"

A. 50

12. Which of these packages are used for Baysian approximation?

A. abc

13. If an R Package is included in CRAN it is guaranteed to be regularly updated, and will always be "up to date".

A. False

WEEK-7 QUIZ

1. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor local aspects of the data. High numbers will tend to favor global data.

A. True

2. In the R programming language, the "Rtsne" function allows for scoring data using the "predict" command.

A. False

3. Assume that an input data set has four variables: A,B,C,D and they are used to create four Principal Components: PC1, PC2, PC3, and PC4. If A,B,C,D are all highly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?

A. PC1, PC2, PC3, and PC4 are completely uncorrelated from one another.

4. tSNE vectors are always Independent to one another.

A. False

5. Assume that you have 3 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

A. 3

6. Principal Components are always Orthogonal to one another

A. True

7. tSNE vectors are always Orthogonal to one another.

A. False

8. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.

A. False

9. Assume that you have 2 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

A. 2

10. Principal Components are always Orthogonal to one another.

A. True

11. In the R programming language, the "Rtsne" function allows for scoring data using the "predict" command.

False

12. In tSNE analysis, the vectors represent a LINEAR relationship in the data.

A. False

13. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a tSNE Analysis using Rtsne?

A. 2 or 3

14. In tSNE analysis, the vectors represent a NON LINEAR relationship in the data.

Ture

15. When doing tSNE analysis, setting the Perplexity to a high number will tend to have less well defined groupings.

False

16. In the R programming language, the "prcomp" function allows for scoring data using the "predict" command.

True

17. In PCA analysis, the vectors represent a NON LINEAR relationship in the data.

False

18. In PCA analysis, the vectors represent a LINEAR relationship in the data.

A. True

20. When doing tSNE analysis, setting the Perplexity to a high number will tend to have more well defined groupings.

A. True

21. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

22. Given the following Scree Plot, how many Principal Components should be used?

1 or possibly 2 Principal Components

23. Principal Components are always Independent to one another.

True

US-India Student Support Services

Search This Blog

IS5213 Data Science and Big Data Solutions

1. WhendoingtSNE analysis,settingthePerplexitytoalownumberwilltendto favor local aspects of the data. High numbers will tend to favor global data.

2. PrincipalComponentsarealwaysOrthogonaltooneanother.

3. WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendto have less well defined groupings.

4. WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendto have more well defined groupings.

5. InPCAanalysis,thevectorsrepresentaLINEARrelationshipinthe data.

6. Assumethatyouhave3continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a PCA Analysis?

7. PrincipalComponentsarealwaysIndependenttooneanother.

8. IntheRprogramminglanguage,the"prcomp"functionallowsforscoringdata using the "predict" command.

A. True

9. Assumethatyouhave8continuousvariablesinyourdataset,howmany Principal Components will be created if you do a PCA Analysis?

10. WhendoingtSNEanalysis,settingthePerplexitytoalownumberwilltendto favor global aspects of the data. High numbers will tend to favor local data.

Labels

Comments

Post a Comment

Popular posts from this blog

GE5103-2 Project Management [Aug 23 Syllabus]

GE5093 Design Thinking All Quizzes