Skip to main content

IS5213 Data Science and Big Data Solutions


WEEK- 2


code

 install.packages("dplyr")

library(dplyr)


Rajeshdf = read.csv('c:\\Insurance.csv')

str(Rajeshdf)                       

str(Rajeshdf)


summary(Rajeshdf)


agg_tbl <- Rajeshdf %>% group_by(Rajeshdf$JOB) %>% 

  summarise(total_count=n(),

            .groups = 'drop')


agg_tbl


a = aggregate( x=Rajeshdf$HOME_VAL, by=list( Rajeshdf$CAR_TYPE), FUN=median, na.rm=TRUE )


a


QUIZ


2.

What famous literary detective solved a crime because a dog did not bark at the criminal?

A). Sherlock Holmes

1.  In the Insurance data set, how many Lawyers are there?

A). 1031

3. What two prefixes does the instructor use for variables when fixing the missing values? Select all that apply.

A).


IMP_


M_

4. What is the median Home Value of a person who drives a Van?

A). 204139

5. In the insurance data set, how many missing (NA) values does the variable AGE have?

A) 7

 

1. What is the process called where missing data is fixed?


a). Imputing

 

2. According to the instructor, approximately what percentage of the analytic time is spent on data preparation?


a). 90%

 

3. In the Insurance data set, how many Blue Collar workers are there?

 


a). 2288

 

 

4. What is the median Home Value of a person who drives a Panel Truck?

 


A). 220541

 

 

5. In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?

 


A). 0

 

In the Insurance data set, how many Doctors are there?

 

A). 321

 

                  

 

A).  639

 

What is the median Home Value of a person who drives a Pickup?

 

A). 151061

In the insurance data set, how many missing (NA) values does the variable AGE have?

 

A). 7

What is the process called that converts categorical variables into flag variables?

 

A).  
One Hot Encoding

 

In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?

A). 0

 

In the R programming language, what is one method for converting a TRUE/FALSE variable into a 1/0 variable?

 

A). Add the number zero (0) to the TRUE/FALSE variable.

 

 

 

 

What is the median Home Value of a person who drives an SUV?

A). 140927

 

According to the instructor, after a variable with missing values is "fixed", it is a good idea to remove the variable from the data set.

 

A). True

 

What is the median Home Value of a person who drives a Minivan?

 

A). 172269

 

In the insurance data set, how many missing (NA) values does the variable YOJ have?

 

A). 548

 

 

 

In the Insurance data set, how many Home Makers are there?

 

a). 843

In the Insurance data set, how many Clerical workers are there?

 

a). 1590

 

In the insurance data set, how many missing (NA) values does the variable CAR_AGE have?

A).  639


WEEK 5 QUIZ

1.Random Forests and the Gradient Boosting models will usually be more accurate than Decision Tree models.

A. True

 

2.Which of these modelling techniques is not adversely affected by outliers?

a. All of these

 

3.Gradient Boosting models are easy to interpret.

A. False


4.Which of these modelling techniques trains many trees with each tree is built on a random subset of variables?

A. Random Forests

 

5.Which of these modelling techniques tends to use many small trees?

A. Gradient Boosting


6.Which of these modelling techniques is usually the easiest to interpret?

A. Decision Trees

 

7.Random Forests are easy to interpret.

A. False


8.In the United States, it is probably against the law to use a Gradient Boosting model for Marketing models.

A.  FALS

 

9.Gradient Boosting models are based on Decision Trees.

A. True

 

10.Which of these modelling techniques is usually the fastest to train?

A. Decision Trees

 

11. Random Forests and the Gradient Boosting models will always be more accurate than Decision Tree models.

A. False

 

12. A Random Forest is more sensitive to a small input change than a Decision Tree

A. False


13. Which of these modelling techniques trains many trees with each tree is built on a random subset of records?

A. Random Forests

 

14. Random Forests are based on Decision Trees.

A. True

 

15.A Gradient Boosting model is less sensitive to a small input change than a Decision Tree

A. True

 

16. In the United States, it may be against the law to use a Gradient Boosting models for Credit or Auto Insurance models.

A. True

 

 17.Which of these modelling techniques alters the data in order to over sample records that it incorrectly classified?

A. Gradient Boosting

 

18. Which of these modelling techniques is usually the easiest to convert into IF-THEN-ELSE rules?

A. Decision Trees

 

19. In the United States, it may be against the law to use a Random Forest for Credit or Auto Insurance models.

A. True

 

20. In the United States, it is probably against the law to use a Random Forest for Marketing models.

A. False

 

 

 

 

1. WhendoingtSNE analysis,settingthePerplexitytoalownumberwilltendto favor local aspects of the data. High numbers will tend to favor global data.


A. True

2.   PrincipalComponentsarealwaysOrthogonaltooneanother.


A. True

3.   WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendto have less well defined groupings.

A. False


4.   WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendto have more well defined groupings.

A.  False

5.   InPCAanalysis,thevectorsrepresentaLINEARrelationshipinthe data.

A. True

6.   Assumethatyouhave3continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a PCA Analysis?

A. 3


7.   PrincipalComponentsarealwaysIndependenttooneanother.


A. True



8.   IntheRprogramminglanguage,the"prcomp"functionallowsforscoringdata using the "predict" command.


A. True


9.   Assumethatyouhave8continuousvariablesinyourdataset,howmany Principal Components will be created if you do a PCA Analysis?

A. 8


10.    WhendoingtSNEanalysis,settingthePerplexitytoalownumberwilltendto favor global aspects of the data. High numbers will tend to favor local data.


A. False


 

 

11.   Assumethatyouhave3continuousvariablesinyourdataset,howmanyPrincipalComponents will be created if you do a PCA Analysis?


A. 3


12)  WhendoingtSNEanalysis,settingthePerplexitytoalownumberwilltendtofavorglobal aspects of the data. High numbers will tend to favor local data.

A. False

13)  WhendoingtSNEanalysis,settingthePerplexitytoahighnumberwilltendtohavemorewell defined groupings.

A.  False

14)  Assumethatyouhave2continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a PCA Analysis?


A. 2

15)  tSNEvectorsarealwaysOrthogonaltooneanother. 

A. False


 

 

16)  tSNEvectorsarealwaysOrthogonaltooneanother. 


A.  False

17)  Assume thathave8continuesvariablesinyourdatasethowmanyprincipalcomponentswill be created if you do a PCA analysis?



A.  8

18)  )Assumethatyouhave8continuousvariablesinyourdataset,howmanyPrincipal Components will be created if you do a tSNE Analysis using Rtsne?


A.  2or3

19)  In PCAanalysis,thevectorsrepresentaNONLINEARrelationshipinthedata. 


A.  False

20)  Assumethataninputdatasethasfourvariables:A,B,C,Dandtheyareusedtocreatefour PrincipalComponents: PC1, PC2, PC3, and PC4.If A,B,C,D are allhighly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?


A. PC1,PC2,PC3,andPC4arecompletelyuncorrelatedfromoneanother.



 

 

21)  IntSNEanalysis,thevectorsrepresentaLINEARrelationshipinthedata 


A.  False

22)  GiventhefollowingScreePlot,howmanyPrincipalComponentsshouldbeused?

A) 2or possibly3Principal Components

23)  IntheRprogramminglanguage,the"Rtsne"functionallowsforscoringdatausingthe "predict" command.


A. False

 

To answer this question, please refer to the CRAN Packages web page referred to in the course material.

Which of these packages are used for Optical Character Recognition?

A. abbyyR




Using the iris data set in R, generate a box plot by Species of the variable Petal Length.



Using the iris data set in R, generate a box plot by Species of the variable Petal Width.


Using the iris data set in R, generate a box plot by Species of the variable Sepal Width.




Using the iris data set in R, generate a box plot by Species of the variable Sepal Length.




1. What are the two commands that will return the first and last six rows of a Data Frame?

A. head, tail

 

2. The R programming language has data sets that are pre-loaded. One of these data sets is the "iris" data set. What command will give you information about this data set?

A. iris

 

3.  In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=1 ?

A. 88.0

4. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=2 ?

A.104.5

5. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=3 ?

A. 125.5

6. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=4 ?

A. 129.5

 

7. To answer this question, please refer to the CRAN Packages web page referred to in the course material.

Which of these packages are used for Reliability and Scoring Routines?

A.  ATtools

 

8. How many records are in the predefined data set named "trees"

A.  31

 

9. There is no guarantee that an R Package included in CRAN will be maintained and "up to date". 

A.  False

 

10. Which of these packages are used for Combining Multidimensional Arrays?

A.  abind

 

11. How many records are in the predefined data set named "cars"

  A.  50 

 

 12. Which of these packages are used for Baysian approximation?



A.  abc

13. If an R Package is included in CRAN it is guaranteed to be regularly updated, and will always be "up to date". 


A. False

R

WEEK-7 QUIZ


1. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor local aspects of the data. High numbers will tend to favor global data.

A. True

2. In the R programming language, the "Rtsne" function allows for scoring data using the "predict" command.

A. False

3. Assume that an input data set has four variables: A,B,C,D and they are used to create four Principal Components: PC1, PC2, PC3, and PC4. If A,B,C,D are all highly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?

A. PC1, PC2, PC3, and PC4 are completely uncorrelated from one another.

4. tSNE vectors are always Independent to one another.

A. False

5. Assume that you have 3 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

A. 3

6. Principal Components are always Orthogonal to one another

A. True

7. tSNE vectors are always Orthogonal to one another.

A. False

8. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.

A. False

9. Assume that you have 2 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

A. 2

 

10. Principal Components are always Orthogonal to one another.

A. True

 

11. In the R programming language, the "Rtsne" function allows for scoring data using the "predict" command. 

False

 

12. In tSNE analysis, the vectors represent a LINEAR relationship in the data.

A. False


13. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a tSNE Analysis using Rtsne?

A. 2 or 3

 

14. In tSNE analysis, the vectors represent a NON LINEAR relationship in the data.

Ture

 

15. When doing tSNE analysis, setting the Perplexity to a high number will tend to have less well defined groupings.

False

 

 

16. In the R programming language, the "prcomp" function allows for scoring data using the "predict" command. 

True

 

17. In PCA analysis, the vectors represent a NON LINEAR relationship in the data.

False

 

18. In PCA analysis, the vectors represent a LINEAR relationship in the data.

A. True

 

20. When doing tSNE analysis, setting the Perplexity to a high number will tend to have more well defined groupings.

A. True

 

21. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

8

22. Given the following Scree Plot, how many Principal Components should be used?

1 or possibly 2 Principal Components

23. Principal Components are always Independent to one another.

True

 


 

 

 



Comments

Popular posts from this blog

IS5203 Type 2 Post Assessment and Final Quiz

  Carlos has just created a new subnet for the finance department. He needs to be able to allow the employees in finance to retrieve files from the sales server, which is located in another subnet. Which of the following OSI model layers would it be necessary to categorize the device into? a. Layer 4 b. Layer 6 c. Layer 2 d. Layer 3 All of the cubicles in a company's office have cables that run up to the ceiling and across to an IDF. Inside the IDF, they connect directly to the latest generation switch available from the networking equipment vendor that they have chosen. Which of the following describes the physical topology most likely in use? a. mesh

GE5103-2 Project Management [Aug 23 Syllabus]

    Some of the advantages of using time boxes and cycles in project coordination efforts include creating urgency, measuring progress, and allowing for predictable measurements. A)        True 2.    Even though most project managers are not contract specialists, they need to understand the process well enough to coordinate with the team. For the current assignment, you are looking at a short-term and small effort with a contractor of just a few hours without significant clarity. Which of the following would be the most applicable contract to use in this situation? A)        Time and materials 3. The project you are working on has had modifications to the plan from the start and even how the project is run. Project governance covers all of the 3 following except: A)        Naming The project manager 4. Of the following, which is most likely a trigger condition defined early in the project? A) Alerting Governance board if 10 percent over schedule 5. Of the following options, which stand

GE5163 Week8 ( Final Exam ) Quize's

  A process or product that is insensitive to normal variation is referred to as being Select one: a. in specification b. capable c. robust d. out of control Feedback Your answer is correct. A completed failure mode and effects analysis (FMEA) results in the following assessment rating.      Occurrence = 4      Severity = 8      Detection = 10 What is the risk priority number (RPN) for this FMEA? Select one: a. 42 b. 22 c. 320 d. 120 Feedback Your answer is correct. In a visual inspection situation, one of the best ways to minimize deterioration of the quality level is to: Select one: a. have a program of frequent eye exams. b. retrain the inspector frequently. c. add variety to the task. d. have a standard to compare against as an element of the operation. Feedback Your answer is correct. Which of the following elements is least necessary to a good corrective action feedback report? Select one: a. What caused the failure b. Who caused the failure c. What correction has been made d. Wh