Skip to main content

IS-5113 Data Mining and Data Visualization [Jan 25 Syllabus]

 

Lesson 1 Quiz

One of the key differences between business analytics and data science is their primary focus either on business problems or on mathematical algorithms.

True

Analytics and analysis are essentially the same thing; they both focus on the granular level representation of complex problems through decomposition of the whole into its lower-level parts.

False

If a data scientist is analyzing historical data to identify problems and root causes, he/she is essentially conducting descriptive analytics.

True

ERP stands for enterprise resource planning and is used for the integration of company-wide data.

True

The most important driver behind business analytics popularity is the need for business managers to make experience and intuition driven business decisions.

False

Business analytics and data science have the same purpose: to convert data into actionable insight through an algorithm-based discovery process.

True

Major commercial business intelligence products and services were established in the early 1970s.

False

If I am distributing funds to different financial products to maximize return, | am essentially doing descriptive analytics.

False

Today, analytics can be defined simply as "the discovery of information/knowledge/insight in data.”

True

Business intelligence is a broad concept that also includes business analytics within its simple taxonomy.

False

Analytics is the art and science of discovering insight to support accurate and timely decision making.

True

Business analytics is the process of developing computer code and novel IT frameworks.

False

Organizations apply analytics to business problems to identify problems, foresee future trends, and make the best possible decisions.

True

DeepQA is a massively parallel, web mining focused, probabilistic computational algorithm developed by the SAS Institute.

False

Descriptive analytics is also called business intelligence that is the entry level in analytics taxonomy.

True


Lesson 1 Post Assessment

What are the main roadblocks to the adoption of analytics?

All of these

Jim, the marketing manager in the company, is interested in the sales numbers in the south region by each product type for the last six months. What type of analytics would you use to help him?

Descriptive

Which of the following developments is not contributing to facilitating the growth of decision support and analytics? AO Knowledge management systems

Locally concentrated workforces

What type of analytics seeks to identify the courses of action to achieve the best performance possible?

Prescriptive

If Jack is interested in identifying the optimal quantity of purchase orders in order to minimize the overall cost, which of the following types of analytics should he use?

Prescriptive

Firms have used analytics to enhance which of the following business activities?

All of these

Which of the following is not commonly used as an enabler of descriptive analytics?

Data mining

Lesson 2 Quiz

1. Association patterns can include capturing the sequence of events and things.

True

2. Cubes in OLAP are defined as a multidimensional representation of the data stored in and retrieved from data warehouses.

True

3. Prediction modeling is often classified under the unsupervised machine learning methods.

False

4. Data mining can be used to predict the result of sporting events to identify means to decrease odds of winning against specific opponent.

False

5. In banking and finance, data mining is often used to manage microeconomics movements and overall cash flow outcomes.

False

6. One of the most pronounced reasons for the increasing popularity of data mining is due to the fact that there are less suppliers than corresponding demand in the business marketplace.

False

7. Novel is a key term in the definition of data mining, which means that the patterns are known by the user within the context of the system being analyzed.

False

8. Segmentation and outlier analysis are part of classification modeling.

False

9. Data mining is primarily concerned with mining (that is, digging out data) from a variety of disparate data sources.

False

10. In the retail industry, association rule mining is frequently called market-based analysis.

True

11. CRM aims to create one-on-one relationships with customers by developing an intimate understanding of their needs and wants.

True

12. Data mining leverages capabilities of statistics, artificial intelligence, machine learning, management science, information systems, and databases in a systematic and synergistic way.

True

13. The original terminology of data mining commonly refers to discovering known patterns in large and structured data sets.

False

14. Manufacturers use data mining to classify anomalies and commonalities in the production system to improve the manufacturing system.

True

15. Information warfare often refers to identify and stop malicious attacks on critical information infrastructures in literarily any and every organizations and business

True

Lesson 2 Post Assessment

In data mining, clustering is classified further into:

segmentation and outlier analysis.

Which of the following is the most commonly used clustering

k-means

What kinds of patterns can data mining discover?

 Each correct answer represents a complete solution. Choose all that apply.

Clustering

Classification

Optimization

Forecasting

Association

What are the most common reasons why data mining has gained overwhelming attention in the business world?

All of these

In retailing, data mining is most commonly used to: |

predict future sales.

Which of the following statements is true about clustering?

Assigns customers to different segments

What is the primary difference between statistics and data mining?

Statistics starts with a well-defined proposition and hypothesis, whereas data mining starts with a loosely defined discovery statement.

 Lesson 3 Quiz

The important part of the KDD process is the feedback loop that allows the process flow to redirect backward, from any step to any other previous steps, for rework and readjustments.

True

The data sources that are combined in a centralized data repository for supporting managerial decisions is known as a data warehouse.

True

In the SEMMA process, the accuracy and usefulness of the models are evaluated in the Assess step.

True

In the SEMMA process, visualization and description of the data are carried out in the Modify step.

False

The CRISP-DM methodology was proposed by Fayyad et al., in the year 1996.

False

In the model building task, both the CRISP-DM and SEMMA methodologies build and test various models.

True

Define, Explore, Measure, and Assess are the steps involved in the Six Sigma process.

False

In the testing and evaluation step of the CRISP-DM methodology, monitoring and maintenance of the models are important.

False

During the model building step in the CRISP-DM process, the data mining methods and algorithms are applied to the current data set.

True

The Six Sigma process promotes an error-free/perfect business execution.

True

The Modify step in Six Sigma involves the process of assessing the mapping between organizational data repositories and the business problem.

False

In the project finalization task, both the CRISP-DM and SEMMA methodologies prescribe deploying the results.

False

Identifying the most pressing problem and defining the goals and objectives can be done in the Define step of the Six Sigma process.

True

When compared with all other methodologies, CRISP-DM is the most popular data mining process that is being used in data analytics.

True

In the CRISP-DM process, it is not important or necessary to follow the sequential order of each step. That is, the steps can be executed in an arbitrary sequence.

False


Lesson 3 Post Assessment

During which step of the SEMMA process the analyst searches for unanticipated trends and anomalies to gain a better understanding of the data set?

Explore

Which of the following steps of the CRISP-DM process is commonly called the data preprocessing step that produces the data identified in the data understanding | step for analysis?

Data preparation

Which of the following is the most relevant methodology that is used to implement data science and business analytics projects?

CRISP-DM

During which step of the Six Sigma process are the identified data sources consolidated and transformed into a format that is amenable to machine processing?

Measure

Which of the following steps of the CRISP-DM process identifies the relevant data from different sources?

Data understanding

Which of the following substeps are involved in the Sample step of the SEMMA process?

Training, validation, and test

Which of the following steps of the CRISP-DM process identifies the goals, purpose, and requirements of the customers?

Business understanding

The customer credit ratings like bad, fair, and excellent are considered as what type of data?

Ordinal

Lesson 4 Quiz


The ratio of accurately classified instances (positives and negatives) divided by the total number of instances is defined as the overall accuracy metric.

True

 Handling the missing values in the data is typically performed in the data consolidation phase.

False 

F1 metric is simply the harmonic mean of precision and recall.

True

A typical example of interval scale measurement is the temperature on the Celsius scale.

True

Apriori and FP-Growth algorithms are part of the association type data mining tasks.

True

The ratio of correctly classified positives divided by the total positive count is defined as a precision metric.

 

False 

If a classification problem is not binary, you cannot use a confusion matrix to tabulate prediction outcomes.

False 

k-means algorithm is a part of prediction data mining method.

False

The bootstrapping methodology is similar to the leave-one-out methodology, where it can be used to calculate accuracy by leaving out one sample at each iteration of the estimation process.

False

Balancing skewed data means oversampling the more represented class records and undersampling the less represented class records.

False

Decision trees are part of the regression type prediction methods.

False

The multi split methodology partitions data into exactly two mutually exclusive subsets called training set and test set.

False

The purpose of data preparation (commonly called data preprocessing) is to eliminate the possibility of GIGO errors.

True

How and what the model concludes on certain predictions is obtained by the interpretability characteristic of the prediction method.

True

The area under the ROC curve is a graphical assessment technique for binary classification problems, in which sensitivity is plotted on the y-axis and the specificity is plotted on the x-axis.

False

Lesson 4 Post Assessment

Which clustering method is based on the basic idea that nearby objects are more related to each other than are those that are farther away from each other?

Hierarchical

Which cross-validation methodology achieves random sampling of a fixed number of instances from the original data with replacement to construct the training data set?

Bootstrapping

Which classification method use(s) conditional probabilities to build classification models?

Bayesian classifiers

Which of the following is defined as the ratio of correctly classified negatives divided by the total negative count?

Specificity

Which of the following factors refers to a model's ability to make reasonably accurate predictions, given noisy data or data with missing and erroneous values?

Robustness

Which method takes into account the partial membership of class labels to predefined categories while building models for classification problems?

 Rough sets


   Lesson 5 Quiz

Time series is a sequence of data points of interest measured and represented at consecutive and regular time intervals.

True

In linear regression, the independence of errors assumption is also known as homoscedasticity.

False

 Multicollinearity can be triggered by having two or more perfectly correlated explanatory variables present in the model.

True

In linear regression, hypothesis testing reveals the existence of relationships between explanatory variables.

False 

The Naive Bayes method requires output variables to have numeric values

False

 In prediction, linear regression uses a mathematical equation to identify additive mathematical relationships between explanatory variables and the response variable

True

 In the normality of error assumption of linear regression, the response variables' values are expected to be randomly distributed.

False

 In time-series forecasting, an estimator's mean squared error measures the average absolute error between the estimated and the actual values.

False

Correlation is meant to represent the linear relationships between two nominal input variables.

False 

k-NN is a prediction method used not only for classification but also for regression-type prediction problems.

True

 To deploy a developed SVM model, the model coefficients can be extracted and integrated directly into the decision support system.

True

 Logistic regression is like linear regression where both of them are used to predict a numeric target variable.

False

 Linear regression aims to capture the functional relationships between one or more numeric input variables and a categorical output variable.

False

Homoscedasticity states that the response variables must have the same variance in their error, regardless of the explanatory variables' values.

True

 In the SVM model, normalization's main benefit is to avoid having attributes in greater numeric ranges and dominate those in smaller numeric ranges.

True 


Lesson 6 Quiz

In prediction analytics, variance refers to the error, and bias refers to the consistency in the predictive accuracy of models applied to other data sets.

False


A data set is imbalanced when the distribution of different classes in the input variables are significantly dissimilar.

False


 Overfitting is the notion of making the model too specific to the training data to capture not only the signal but also the noise in the data set.

True


Information fusion type model ensembles utilize meta-modeling called super learners.

False


Bias is often defined as the difference between a model's prediction output and the actual values for a given prediction problem.

True


Model ensembles are known to be more robust against outliers and noise in the data compared to individual models.

True

 

Bagging type ensembles can be used in both regression and classification type prediction problems.

True


In explainable AI, the LIME and SHAP methods are considered as global interpreters.

False


 Sensitivity analysis based on the leave-one-out methodology can be applied to any predictive analytics method because of its model agnostic implementation methodology.

True


A model with low variance is the one that captures both noise and generalized patterns in the data and therefore produces an overfit model.

False

 

In ensemble modeling, bagging uses the bootstrap sampling of cases to create a collection of decision trees.

True


Model ensembles are much easier and faster to develop than individual models.

False


In ensemble modeling, boosting builds several independent simple trees for the resultant prediction model.

False


Underfitting is mainly characterized on the bias–variance trade-off continuum as low-bias/low-variance outcome.

False


Sensitivity analysis based on input value perturbation is often used in trained feed-forward neural network modeling, where all of the input variables are numeric and standardized.

True

Lesson 7 Quiz

Clustering is a supervised learning process in which objects are assigned to pre-determined number of artificial groups called clusters.

False 

Text-to-speech is a text processing function that can read textual content and detects and corrects syntactic and semantic errors.

False 

In the context of the text mining process, both structured and unstructured data are extracted from the data sources and converted into context-specific knowledge.

True 

SCM and ERP are the first two beneficiaries of the NLP and WordNet.

False 

 In marketing applications, text mining can be used to assess and help predict a customer's propensity to attrite.

True 

 Singular value decomposition help reduce the overall structure of the term-document matrix to a lower dimensional space for further pattern/knowledge discovery.

True 

 A polygraph is a non-intrusive deception-detection technique commonly used to assess the level of truthfulness in the textual content.

False 

 In the first task of the text mining process, the data is structured and preprocessed to achieve hidden patterns and knowledge nuggets.

False 

 In text mining, associations refer to direct relationships between terms or sets of concepts.

True

Automatic summarization is a program that is used to assign documents into a predefined set of categories.

False 

 The main aim of NLP is to move away from word counting to a real understanding and processing of natural human language.

True 

 Tokenizing refers to the process of breaking sentences into blocks of text that performs a specific linguistic function.

True 

In the context of text mining, lemmatization is a process of syntactically reducing words to their stem/root form.

False 

 In the term-by-document matrix, the columns represent the terms and the rows represent the documents, and the cells represent the variances.

False 

 In the context of text mining, structured data is for humans to process, while unstructured data is for computers to process and understand.

False 

 

 Lesson 7 Post Assessment


In the context of text mining, which of the following is a part of NLP that studies the internal structure of words (that is, the patterns of word formation within a language or across languages)?

Morphology

Which of the following are the most commonly used normalization methods?

Log, binary, and inverse document frequencies

Which of the following are the best options available to manage the TDM matrix size?

Labor-intensive process, eliminate terms, and singular value decomposition

Which of the following are the common challenges that are associated with the implementation of NLP?

All of these

Which of the following is not among the steps involved in sentiment analysis?

Latent Dirichlet allocation

In the knowledge extraction method of the text mining process, ____________ refers to the natural groping, analysis, and navigation of large text collections, such as web pages.

Clustering

Which of the following applications utilize the capabilities of text mining?

Marketing applications

Security applications

Biomedical applications

In which of the following categories of knowledge extraction method is the task of text categorization achieved?

Classification

Lesson 8 Quiz

Hadoop is an open-source framework for processing, storing, and analyzing massive amounts of distributed, wide variety of data.

True 

The term velocity in big data analytics refers to how fast digitized data is created and processed.

True 

Big data comes from a variety of sources within an organization, including marketing and sales transaction, inventory records, financial transaction, and human resources and accounting records.

False 

Hadoop is a batch-oriented computing framework, which implies it does not support real-time data processing and analysis.

True 

A stream in a stream analytics is defined as a discrete and aggregated level of data elements.

False 

MapReduce is a contemporary programming language designed to be used by computer programmers.

False 

Among the variety of factors, the key driver for big data analytics is the business needs at any level, including strategic, tactical, or operational.

True 

Grid computing increases efficiency, lowers total cost, and enhances production by processing computational jobs in a shared, centrally managed ordinary pool of computing resources.

True 

HDFS (Hadoop Distributed File System) was invented before Google developed MapReduce. Hence, the early versions of MapReduce relied on HDFS.

False 

The main benefit of Hadoop is that it allows enterprises to process and analyze large volumes of structured and semi-structured data on specialized hardware.

False 

Hadoop is not just about the volume but also processing of diversity of data types.

True 

A data scientist's main objective is to organize and analyze large amounts of data, to solve complex problems, often using software specifically designed for the task.

True 

The term veracity in big data analytics refers to the processing of different types and formats of data, structured and unstructured.

False 

Hadoop is a replacement for a data warehouse which stores and processes large amounts of structured data.

False 

In typical data stream mining applications, the purpose is to predict the class or value of new instances in the data stream, given some knowledge about the class membership or values of previous instances in the data stream.

True 


Lesson 9 Quiz

The main characteristic of deep learning solutions is that they use AI (artificial intelligence) to understand and organize data, predict the intent of a search query, improve the relevancy of results, and automatically tune the relevancy of results over time. xzc xc

False 

 Human–computer interaction is a critical component of cognitive systems that allows users to interact with cognitive machines and define their needs.

True

Deep learning analytics is a term that refers to the computing−branded technology platforms, such as IBM Watson, that specialize in processing and analyzing large, unstructured data sets.

False 

In a typical neural network, the goal of the testing process is to adjust the network weights and biases such that the network output for each set of inputs is adequately close to its corresponding target value.

False 

Connection weights are the key elements of an artificial neural network (ANN). They produce the final value through the summation and transfer function.

False 

AI (artificial intelligence) has the capability to find hidden patterns in a variety of data sources to identify problems and provide potential solutions.

True

Cognitive computing has the capability to simulate human thought processes to assist humans in finding solutions to complex problems.

True

In artificial neural networks, neurons are processing units, also called processing elements, that perform predefined mathematical operations on the numeric values from the input variables or the other neuron outputs to create and push out their own outputs.

True

The term long short-term memory network refers to a network that is used to remember what happened in the past for a long enough time that it can be leveraged in accomplishing the task when needed.

True

Multilayer perceptron type deep networks are also known as feedforward networks because the flow of information that goes through them is always forwarding, and no feedback connections are allowed.

True

In representation learning, the emphasis is on automatically discovering the features to be used for analytics purposes.

True

Delta (or an error) is defined as the difference between the network weights in two consecutive iterations. 

False 

The purpose of artificial intelligence is to augment human capability.

False 

The main characteristic of the convolutional networks is having at least one layer involving a convolution weight function instead of general matrix multiplication.

True

Deep learning is an extension of neural networks that deal with more complicated tasks with a higher level of sophistication by employing many layers of connected neurons.

True



Comments

Popular posts from this blog

GE5103-2 Project Management [Aug 23 Syllabus]

    Some of the advantages of using time boxes and cycles in project coordination efforts include creating urgency, measuring progress, and allowing for predictable measurements. A)        True 2.    Even though most project managers are not contract specialists, they need to understand the process well enough to coordinate with the team. For the current assignment, you are looking at a short-term and small effort with a contractor of just a few hours without significant clarity. Which of the following would be the most applicable contract to use in this situation? A)        Time and materials 3. The project you are working on has had modifications to the plan from the start and even how the project is run. Project governance covers all of the 3 following except: A)        Naming The project manager 4. Of the following, which is most likely a trigger condition defined early in t...

GE5093 Design Thinking All Quizzes

  GE---5093-1D2-FA-2021 - Design Thinking Home My courses 2021-FA GE---5093-1D2-FA-2021 Week 1 Reading Quiz 1 Started on Sunday, October 31, 2021, 2:04 PM State Finished Completed on Sunday, October 31, 2021, 2:30 PM Time taken 25 mins 58 secs Grade 8.00  out of 10.00 ( 80 %) Top of Form Question  1 Correct 1.00 points out of 1.00 Flag question Question text A critical finding of Edward Lorenz related to Design Thinking was: Select one: a. An application of the caterpillar effect b. The idea of deterministic chaos or the "Butterfly Effect" c. Business leaders enjoy chaos d. Statistical modeling of weather was fairly accurate in the long term Feedback Your answer is correct. The correct answer is: The idea of deterministic chaos or the "Butterfly Effect" Question  2 Incorrect 0.00 point...

IS5213 Data Science and Big Data Solutions

WEEK- 2 code  install.packages("dplyr") library(dplyr) Rajeshdf = read.csv('c:\\Insurance.csv') str(Rajeshdf)                        str(Rajeshdf) summary(Rajeshdf) agg_tbl <- Rajeshdf %>% group_by(Rajeshdf$JOB) %>%    summarise(total_count=n(),             .groups = 'drop') agg_tbl a = aggregate( x=Rajeshdf$HOME_VAL, by=list( Rajeshdf$CAR_TYPE), FUN=median, na.rm=TRUE ) a QUIZ 2. What famous literary detective solved a crime because a dog did not bark at the criminal? A). Sherlock Holmes 1.  In the Insurance data set, how many Lawyers are there? A).  1031 3. What two prefixes does the instructor use for variables when fixing the missing values? Select all that apply. A). IMP_ M_ 4. What is the median Home Value of a person who drives a Van? A).  204139 5. In the insurance data set, how many missing (NA) values does the variable AGE have? A) 7   1. What...