Answer the following questions
1. What is data mining?
Ans. : A large amount of data is available
in different industries and organizations. The availability of this huge
data is of no use unless it is converted into valuable
information. Otherwise, we are sinking in data, but starving for
knowledge. The solution to this problem is data mining which is the
separation of useful information from the huge amount of data that is
available.
Data mining is defined as: "Data mining, also known as Knowledge Discovery in Data (KDD), is the process of uncovering patterns and other valuable information from large data sets".
2. What is data warehousing?
Ans. : Data Warehouse, also known as
DWH is a system that is used for reporting and data analysis. Data
Warehouse is a concept which supports decision support systems where a
large amount of data is merged. A data warehouse is a repository which is
at the top of multiple databases. It can be defined as a process for
collecting and managing data from varied sources to provide meaningful
business insights.
3. Explain Spark with features?
Ans. : Apache Spark has the following
features:
Speed: The main feature of Spark is its
in-memory cluster computing that. increases the processing speed of an
application. Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk. Multiple language support: Spark supports multiple languages. It
provides various APIs written in Java, Scala, Python and R.
Multiple platform support: Spark will
run on multiple platforms while not moving the processing speed. It runs
on Hadoop, Kubernetes, Mesos, Standalone, and even withinthe Cloud.
Advanced Analytics: Spark not only
supports 'Map' and reduce'. It also supports SQL queries, Streaming data,
Machine Learning (ML), and Graph algorithms.
4. What Is AI and Explain the
application?
Ans. : AI is the science and
engineering of making machine intelligent.
1. Gaming: Al plays a crucial role in
strategic games such as Chess, Poker, Tic-TacToe, etc., where machines can
think of a large number of possible positions based on heuristic
knowledge.
2. Robotics: Robotics is a branch of
Al, which is composed of Electrical Engineering Mechanical Engineering and
Computer Science for designing, construction and application of robots.
3. Cognitive Science: It is the
interdisciplinary and scientific study of human behavior and intelligence,
with a focus on how information is perceived processed and transformed.
5. Define Search Strategy?
Ans. : :- The word 'search' refers to
the search for a solution in a problem space.
Search proceeds with different types of
search control strategies. A strategy is defined by picking the order in
which the nodes expand.
So far, we have not given much
attention to the question of how to decide which rule to apply next during
the process of searching for a solution to a problem. This question arises
when more than one rule will have its left side match the current state.
In search method or technique, firstly
select one option and leave other option. If this option is our final
goal, then stop the search else we continue selecting, testing.
and expanding until either a solution is found or no more states to be
expanded.
The depth-first search and breadth-first
search are the two common search strategies.
6. Explain BFS, DFS,DLS?
Ans. : BFS (Breadth-first Search)
Breadth First searches are performed by
exploring all nodes at a given depth before proceeding to the next level.
This means that all immediate children of nodes are explored before any
children's children are considered. This process is called Breadth First
Search.
7. Expain DFS
Ans. : Depth first searches are
performed by going downward into a tree as early as possible. Consider a
single branch of the tree until it produces a solution or until a decision
to terminate the path is made. It makes sense to terminate a path if it reaches
a dead end, produces a previous state or becomes longer than some limit,
in such cases backtracking occurs. To overcome such backtracking is known
as Depth First Search.
8. Expain DLS
Ans. Depth limited search is the new
search algorithm for uninformed search. The unbounded tree problem happens
to appear in the depth first search algorithm, and it can be fixed by
imposing a boundary or a limit to the depth of the search domain. The
Depth Limited Search (DLS) method is almost equal to Depth First Search (DFS).
But DLS can work on the infinite state
space problem because it bounds the depth of the search tree with a
predetermined limit L. Nodes at this depth limit are treated as if theyhad no
successors.
9. Explain Uniform Cost Search?
Ans. : Uniform Cost Search is a
searching algorithm used for traversing a weighted tree or graph. This
algorithm comes into play when a different cost is available for each edge.
The primary goal of the Uniform Cost Search is to find a path to the goal
node which has the lowest cumulative cost. Uniform Cost search expands
nodes according to their path costs from the root node. It can be used to
solve any graph/tree where the optimal cost is in demand. Uniform Cost
Search algorithm is implemented by the priority queue.
10. Iteratitve Deepening Search
Ans. : The iterative deepening
algorithm is a combination of DFS and BFS algorithms. This search
algorithm finds out the best depth limit and does it by gradually increasing
the limit until a goal is found. This algorithm performs Depth First
Search up to a certain "depth limit", and it keeps increasing
the depth limit after each iteration until the goal node is found.
This Search algorithm combines the benefits of Breadth First Search's fast
search and Depth First Search's memory efficiency.
11. Write Hill Climbing algorithm?
Ans. : The algorithm for Hill Climbing
as follows:
Step 1: Evaluate the initial state. If
it is a goal state then stop and return success Otherwise, make the
initial state the current state.
Step 2: Loop until the solution state
is found or there are no new operators present which can be applied to the
current state.
(a) Select a state that has not been
yet applied to the current state and apply it to produce a new state.
(b) Perform these to evaluate new state.
(i) If the current state is a goal
state, then stop and return to success.
(ii) If it is better than the current
state, then make it the current state and proceed further.
(iii) If it is not better than the
current state, then continue in the loop until a solution is found.
Step 3: Exit
12. Write OLTP Databases?
Ans. : Online system OLTP Transaction
Processing. It manages transaction oriented I applications. It is an online
database modifying system. Its basic focus is on manipulating the
database. The queries are short and simple. The modeling of OLTP is industry
oriented.
The main purpose is to control day to
day transactions in the database. Less Number of data accessed. Relational
databases are created for online Transactional Processing (OLTP).
13. Write OLAP Databases?
Ans. : Online system Analytical
processing. It manages the reports to multi dimensional analytical
queries. It is an online query answering system. Its main focus is to analyze
and extract the data for strategic decision making. The queries are long
and complex. The design of OLAP is subject or domain specific. Its main
purpose is to find the hidden data and support decision making. Large
Number of data accessed. Data Warehouse designed foronline Analytical
Processing (OLAP).
14. List Major Components of Date
Warehouse System?
Ans. : CRM, Billing, ETL, Flat Files,
Data Warehouse, Reporting, Data Mining
15. List Out Step of KDD (Kowledge Data
Discovery) process?
Ans. : 1. Selection: The data which is
to be mined may not be necessarily from a single source. The data may have
many heterogeneous origins. This data needs to be obtained from various
data sources and files. The data selection is based on your mining goal.
Data relevant to the mining task is selected from various sources.
2. Pre-processing: Pre-processing
involves cleaning of the data and integration of the data. The data
selected for mining purposes may have some incorrect. irrelevant values
which lead to unwanted results. Some values may be missing or erroneous.
Also, when data is collected from heterogeneous sources, it may involve
varying data types and metrics. So, this data needs to be cleaned and
integrated for noise elimination and inconsistency.
3. Transformation: Data transformation
is the process of converting the data into the format which is suitable
for processing. Here, data is created in the form which is required by
the data mining process.
4. Data Mining: The Data Mining process
leads towards using methods, techniques to extract the pattern present in
the data. The process involves transformation of relevant data records
into patterns using classification. This step involves application of various
data mining algorithms to the transformed data. This process generates the
desired results for which the whole KDD process is undertaken.
5. Visualization/Interpretation: This
is the last step in the KDD process. In this step, the data is presented
to the user in the form of reports, tables or graphs. The presentation of
the data to the users directly affects the usefulness
16. What is Prediction?
Ans. : Prediction is a classification
task. Prediction discovers the relationship between dependent variables
and relationship between independent variables. It can also be viewed as
estimation. The prediction is based on the data in hand and predictions or
future trends of a phenomenon can be predicted using some predictive
algorithms. The best example of prediction is the profit that could be
gained out of sale. Predication is the technique of identifying the
unavailable numerical data for a new process. Prediction applications
include flooding, speech recognition, machine learning, and pattern
recognition.
17. Define Predictive Data mining ?
Ans. : Predictive data mining tasks
include the prediction based on the available data set in hand. These
tasks give the model based on data and predict the future trends related to
that data or unknown values that may be of interest for the future. The
example of predictive tasks includes the prediction of future value of
gold according to the current market trend. Also, prediction of high or
low value of a share in the share market based on its previous growth is
also a predictive data mining task. Predictive data mining includes
Classification,Regression. Prediction and Time Series Analysis
18. Define Descriptive Data mining?
Ans. : Descriptive data mining tasks
include the analysis of available data patterns or models to find out new
interesting and significant information based on available data set. The
example of descriptive data mining tasks includes the interchange in places of
the super market according to the purchase pattern of the customers.
Descriptive data mining includes Clustering, Summarization, Association
Rules and Sequence Discovery.
19. What is data integration?
Ans. : Data integration is the process
of combining data from disparate sources into a meaningful and valuable
data set for the purpose of analysis. In this step, a logical data source
is prepared. This is done by collecting and integrating data from multiple
sources like databases, legacy systems, flat files, data cubes
20. What is graph mining?
Ans. : Graph Mining is the set of tools
and techniques used to:
(a) analyze the properties of
real-world graphs. (b) predict how the structure and properties of a given
graph might affect some application. (c) develop models that can generate realistic
graphs that match the patterns found in real-world graphs of interest.
21. Explain Web Mining?
Ans. : As tremendous amount of data is
being generated daily on the web, the mining of this data is very
essential. Web mining refers to the mining of data related to World Wide Web. This data contains the actual
data present on web as well as the data related to web. Web data can be
classified into following categories: Content of actual web page. Inter-page structure containing actual linkage structure between
web pages. Intra-page structure containing HTML or
XML code. Web page access log. User profiles.
22. Explain spatial mining ?
Ans. : Spatial data are the data about
objects that are located in a physical space. This includes the data related to
space and including maps. Spatial mining is the process of application of data mining to spatial data. In spatial mining,
geographic or spatial information is used to produce the results.
In Spatial mining the extraction of knowledge, spatial relationships, or other interesting patterns stored in spatial
databases is done. The application of spatial mining
is for learning spatial records, discovering spatial relationships and relationships among spatial and non-spatial records,
constructing spatial knowledge bases, reorganizing spatial
databases, and optimizing spatial queries.
23. Explain Temporal Mining?
Ans. : A temporal database stores data
relating to time instances. Temporal Data Mining is a single step in the process of
Knowledge Discovery in Temporal Databases that enumerates structures (temporal
patterns or models) over the temporal data, and any algorithm that enumerates temporal patterns from, or fits models
to, temporal data is a Temporal Data Mining Algorithm.
Temporal Data Mining often involves processing time series,
typically sequences of data, which measure values of the
same attribute at a sequence of different time points.
24. Difference Between Verification
& Discovery?
Ans. : Verification : 1. It takes a
hypothesis from the user and tests the validity of it against the data.
2. The emphasis is with the user who is
responsible for formulating the hypotheses and issuing the query on the
data to affirm or negate the hypothesis.
3. No new information is created in the retrieval
process.
4. The search process here is iterative
in that the output is reviewed, a new set of
questions or hypothesis formulated to refine the search and the whole process repeated.
Discovery : 1. Knowledge discovery is
the concept of analyzing large amount of data and gathering out relevant
information leading to the knowledge discovery process for extracting meaningful rules, patterns and models from data.
2. The discovery model differs in its emphasis in that it is the
system automatically discovering important information hidden in the data
3. The discovery or data mining tools
aim to tell many facts about the data.
4. The data is shifted in search of
frequently occurring patterns, trends and generalizations about the data without intervention or guidance from the user.
25. List the software used for data
mining?
Ans. : R is an open-source programming
tool developed by Bell Laboratories. R is a programming language and an
environment for statistical computing and graphics. It is compatible with UNIX platforms, FreeBSD, Linux, macOS, and Windows
operating systems. R is popular for data mining as it is used to
run a variety of statistical analysis, such as timeseries analysis, clustering,
and linear and non linear modelling. R also supplies excellent data mining packages. Overall, R also offers graphical facilities
for data analysis. The applications of R also include
statistical computing, analytics, and machine learning tasks.
Weka
Weka is a collection of machine
learning algorithms for data mining tasks. It is open-source software that provides tools
for data pre-processing, implementation of several Machine Learning algorithms. The algorithms can either be applied directly
to a data set or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering association rules, and visualization. It is
also well-suited for developing new machine learning
schemes. Weka is comprehensive software that lets you pre-process the big data, apply different machine learning
algorithms on big data and compare various outputs. This
software makes it easy to work with big data and train a machine using machine learning algorithms
26. What Is data cleaning?
Ans. : The first step in data
pre-processing is data cleaning. It is also known as scrubbing.
Data cleaning includes handling missing
data and noisy data.
(a) Missing data: Missing data is the
case wherein some of the attributes or attribute data is missing or the data is not
normalized. This situation can be handled by either ignoring the values or filling the missing value.
Noisy data: This is data with error or
data which has no meaning at all. This type of data can either lead to invalid
results or can create the problem to the process of mining itself.
The problem of noisy data can be solved
with binning methods, regression and clustering.
27. What is RDD (Resilient Distributed
Datasets)?
Ans. : RDD is a fundamental data
structure of Apache Spark. It is an immutable collection of objects which computes on the
different node of the cluster. Decomposing the name RDD:
Resilient: i.e. fault-tolerant with the
help of RDD lineage graph and so able to re compute missing or damaged partitions
due to node failures.
Distributed: Since Data resides on
multiple nodes.
Dataset: It represents records of the
data you work with. The user can load the data set externally which can be either
JSON file, CSV file, text file or database via JDBC with no specific data structure.
28. Define Functions of SparkCore?
Ans. : Spark core is nothing but a
space engine for distributed data processing and largescale parallel process.
The Spark core is also known as the distributed execution engine while the Python APIs, Java,
and Scala offer ETL application platform. Spark core can perform different functions like monitoring jobs, storage system
interactions, job scheduling, memory management, and
fault-tolerance. Further, it can allow workload streaming, machine learning, and SQL.
The Spark core can also be responsible
for:
1. Monitoring, Scheduling, and
distributing jobs on a cluster
2. Fault recovery and memory management
3. Ecosystems interactions
29. Component of Spark Ecosystem?
Ans. : The main components of Apache
spark are as shown in the figure:
Aapache spark Spark Core is the underlying
general execution engine for spark platform. All the other functionality is built upon. It provides in-Memory computing
and referencing datasets in external storage systems.
Spark SQL Spark SQL is a component above
Spark Core. It contains a new data abstraction called SchemaRDD. SchemaRDD provides support for
structured and semi-structured data. It supports many sources of data including Hive tablets, Parquet, JSON.
Spark Streaming: Spark Streaming leverages Spark
Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those
mini-batches of data.
30. List the function of SparkSQL
Ans. : API: When writing and executing
Spark SQL from Scala, Java, Python or R, a SparkSession is still the entry
point. Once a SparkSession has been established, a DataFrame or a Dataset needs to be created on the data before
Spark SQL can be executed. Spark SQL CLI: This Spark SQL
Command Line interface is a lifesaver for writing and testing out SQL. However, the SQL is executed against Hive, so
make sure test data exists in some capacity.
0 Comments