Answer the following questions
1.Explain Advantages And Disadvantages
Of Data Mining?
Ans : Advantages of Data Mining 1 The data mining helps financial institutions and banks to identify probable defaulters and hence will help them whether to issue credit card, loan etc. or not. This is done based on past transactions, user behavior and data patterns.
2. The data mining based methods are
cost effective and efficient compared to other statistical data applications.
3. It has been used in many different
areas or domains viz. bioinformatics, medicine, genetics, education,
agriculture, law enforcement, e-marketing, electrical power engineering etc.
For example, in genetics it helps in predicting risk of diseases based on DNA
sequence of individuals.
Disadvantages of Data Mining
1. The information obtained based on
data mining by companies can be misused against a group of people.
2. The data mining techniques are not
100% accurate and may cause serious consequences in certain conditions.
3. Different data mining tools work in
different manners due to different algorithms employed in their design. Hence
the selection of the right data mining tool is a tedious and cumbersome task as
one needs to obtain knowledge of algorithms, features etc. of various available
tools.
2. Describe
Advantages And Disadvantages of data warehousing ?
Ans. : Advantages of Data Warehousing
1. Data Warehouse allows business users
to quickly access critical data from some sources all in one place.
2. Data Warehouse stores large amount
of historical data that helps users to analyze different time periods and
trends to make future predictions. 3. Data Warehouse helps to reduce total
turnaround time for analysis and reporting.
4. Data Warehouse provides consistent
information on various cross-functional activities.
5. Data Warehouse helps to integrate
many sources of data to reduce stress on the production system.
Disadvantages of Data Warehousing
1. Not an ideal option for unstructured
data.
2. Difficult to make changes in data
types and ranges, data source schema, indexes, and queries.
3. The data warehouse may seem easy, but
it is too complex for the average users.
3. What
are the advantages & disadvantages of AI?
Ans. : Advantages of AI
1. Reduction in Human Error: Humans
make mistakes from time to time, but computers don't if they programmed
properly. With AI, the decisions are taken from the previously gathered
information applying a certain set of algorithms.
2. Available 24/7: An average human
will work for 6-8 hours a day excluding the breaks. But by using AI we can make
machines work 24x7 without any breaks.
3. Digital Assistance: Highly advanced
organizations use digital assistants to interact with users which save the need
of human resources. The digital assistant is also used in many websites to
provide things that the user wants. We can chat with them about what we are
looking for. Some chatbots are designed in such a way that it becomes hard to
determine that we are chatting with a chatbot or a human being.
4. Faster Decisions: While making a
decision humans will analyze many factors both emotionally and practically, but
Al-powered machines work on what is programmed and deliver the results in a
faster way.
Disadvantages of AI
Making Human Idle: AI is making humans
idle with its applications automating most of the work. Humans tend to get
addicted to these inventions which can cause problems for future generations.
2. Unemployment: As AI is replacing
most of the repetitive tasks and other work with robots, human interference is
becoming less, which will cause a major problem in the employment standards.
Every organization is looking to replace the minimum qualified individuals with
AI robots which can do similar work with more efficiency.
3. High Costs of Creation: As AI is
updating every day the hardware and software need to get updated with time to
meet the latest requirements. Machines need repairing and maintenance which
need plenty of costs. Its creation requires huge costs as they are very complex
machines.
4. No Emotions: There is no doubt that
machines are much better when it comes to working efficiently but they cannot
replace the human connection that makes the team. Machines cannot develop a
bond with humans, which is an essential attribute when it comes to Team
Management.
4. What
is Spark with Advantages And Disadvantages?
Ans. : Advantages of Spark
When it comes to Big Data, processing
speed always matters. Spark is wildly popular with data scientists because of
its speed. Spark is 100x faster than Hadoop for large scale data processing.
Spark uses an in-memory (RAM) computing system whereas Hadoop uses local memory
space to store data. Spark can handle multiple petabytes of clustered data of
more than 8000 nodes at a time.
2. Spark carries easy-to-use APIs for
operating on large datasets. It offers over 80 high-level operators that make
it easy to build parallel apps.
3. Spark not only supports 'MAP' and
'reduce'. It also supports Machine learning (ML), Graph algorithms, Streaming
data, SQL queries, etc.
4. Spark can handle many analytics
challenges because of its low-latency in memory data processing capability. It
has well-built libraries for graph analytics
Disadvantages of Spark
1. In the case of Apache Spark, you
need to optimize the code manually since it doesn't have any automatic code
optimization process. This will turn into a disadvantage when all the other
technologies and platforms are moving towards automation.
2. Apache Spark doesn't come with its
own file management system. It depends on some other platforms like Hadoop or
other cloud-based platforms.
3. There are fewer algorithms present
in the case of Apache Spark Machine Learning Spark MLib. It delays in terms of
several available algorithms.
5. Write
Application Of Data mining?
Ans. : Data mining is used by many
organizations to improve the customer base. They focus on customer behavioral
patterns, market analysis, profit areas and product improvement. The essential
areas where data mining is used are as follows:
Applications of Data Mining
(a) Education: Educational data mining
deals with developing the methods to discover the knowledge from the education
field. It is used to find out project students' areas of interests, future
learning capacities and other aspects. Educational institutions can apply
different data mining techniques and take appropriate/accurate decisions based
on the outcome of the mining process. Also, the analysis of slow and fast
learners and accordingly their teaching pattern can be determined.
(b) Health and Medicine: Data mining
can effectively be used in health care systems. During Covid-19 pandemic, the
predictions of the Covid-19 waves and the volume of patients was done using
data mining. In Genetics also, data mining helps in determining the sequence of
the genes and future trends.
(c) Market Analysis: Market analysis is
based on a particular pattern of purchase followed by customers. These patterns
help the shop owner to understand the buying pattern of customers and
accordingly useful decisions can be implemented so as to increase the profit of
the store. Also, the market analysis helps to find out the different
methodologies to retain the existing customers and gain new ones.
(d) Fraud Detection: A fraud detection
system helps in finding out the pattern of fraud, its potential
attackers/criminal detection and possible solutions using different data mining
algorithms. These data mining methods provide timely and efficient solutions
for detection and prevention of the frauds. Intrusion and lie detection can
also be addressed by these mechanisms.
6. Give
state space representation for “Water Jug Problem”?
Ans. : In this problem, we use two jugs
called four liter and three liters; four holds a maximum of four liters of
water and three a maximum of three liters of water. There is a pump that can be
used to fill the jugs with water. How can we get two liters of water in the
four-liter jug?
State Space Representation:
The state space is a set of prearranged
pairs giving the number of liters of water in the pair of jugs at any time,
i.e., (x, y) where,
x= 0, 1, 2, 3 or 4.
y = 0, 1, 2 or 3.
x-Represents the number of liters of
water in the 4-liter jug.
y - Represents the number of liters of
water in the 3-liter jug.
The start state is (0, 0) and the goal
state is (2, n), where n may be any, but it is limite to three holding from 0
to 3 liters of water or empty.
x and y show the name and numerical
number shows the amount of water in jugs I solving the water jug problem.
Shows how to solve the problem. Step 1:
We have two jugs; one is 4 liters and the other is 3 liters.Where,
7. Give
state space representation for “Missionary Cannibal Problem”?
Ans. : "Three missionaries and
three cannibals are present at one side of a river and need to cross the river.
There is only one boat available. At any point of time, the number of cannibals
should not outnumber the number of missionaries at that bank. It is also known
that only two persons can occupy the boat available at a time". The
objective of the solution is to find the sequence of their transfer from one
bank of river to another using the boat sailing through the river satisfying
these constraints.
State Space Representation of
Missionary Cannibal Problem Production Rules:
We can form various production rules.
Let Missionary be denoted by 'M' and
Cannibal, by 'C'. These rules are described below:
Rule 1: (0, M): One missionary sailing
the boat from bank-1 to bank-2
Rule 2: (M, 0): One missionary sailing
the boat from bank-2 to bank-1
Rule 3: (M, M): Two missionaries
sailing the boat from bank-1 to bank-2
Rule 4: (M, M): Two missionaries
sailing the boat from bank-2 to bank-1
Rule 5: (M, C): One missionary and one
Cannibal sailing the boat from bank-1 to bank-2
Rule 6: (C, M): One missionary and one
Cannibal sailing the boat from bank-2 to bank-1
Rule 7: (C, C): Two Cannibals sailing
the boat from bank-1 to bank-2
Rule 8: (C, C): Two Cannibals sailing
the boat from bank-2 to bank-1
Rule 9: (0, C): One Cannibal sailing
the boat from bank-1 to bank-2
Rule 10: (C, 0): One Cannibal sailing
the boat from bank-2 to bank-1
9. Write
down the algorithm of BFS with its Advantages?
Ans. : BFS (Breadth-first Search)
Breadth First searches are performed by
exploring all nodes at a given depth before proceeding to the next level. This
means that all immediate children of nodes are explored before any children's
children are considered.
Construct a tree with the initial state
as its root. Generate all its successors by applying all the rules that are
appropriate. Fig. 3.1 shows how the tree looks at this point. Now for each leaf
node, generate all its successors by applying appropriate rules. The tree at
this point is shown in Fig. 3.2. Continue this process until some rule produces
a goal state. This process is called Breadth First Search.
Algorithm To BFS:-
1.Create a variable called NODE-LIST
and set it to the initial state. 2. Until goal state is found or NODE-LIST is
empty:
(a) Remove the first element from
NODE-LIST and call it E. If NODE-LIST was empty, quit.
(b) For each way that each rule can
match the state described in E do:
(i) Apply the rule to generate a new
state.
(ii) If the new state is a goal state,
quit and return to this state.
(iii) Otherwise, add the new state to
the end of NODE-LIST.
Advantages:-
1. BFS will not get trapped explaining
a blind path which happens in depth first search.
2. If there is a solution then
Breadth-First Search is guaranteed to find it out.
3. If there are multiple solutions then
Breadth-First Search can find minimal solution i.e. one that requires the
minimum number of steps, will be found.
Disadvantage of Breadth First Search:
1. High storage requirement exponential
with tree depth. A BFS on a binary tree generally 76uiki mrequires more memory
than a DFS.
2. When the search space is large the
search performance will be poor compared to
other heuristic searches.
10. Write
down the algorithm of DFS With its advantages ?
Ans. : Depth first searches are
performed by going downward into a tree as early as possible. Consider a single
branch of the tree until it produces a solution or until a decision to
terminate the path is made. It makes sense to terminate a path if it reaches a
dead end, produces a previous state or becomes longer than some limit, in such
cases backtracking occurs. To overcome such backtracking is known as Depth
First Search. • Following Fig. 3.3 shows Depth First Search tree for Water Jug
Problem.
Algorithm of Depth First Search:
1. If the initial state is a goal
state, quit and return success.
2. Otherwise, do the following until
success or failure is signaled.
(a) Generate a successor, E, of
the initial state. If there are no more successors,
signal failure.
(b) Call depth first search with E as
the initial state.
(c) If success is returned,
signal success. Otherwise continue in this loop.
Advantages of Depth First Search:
1. Depth First search requires less
memory since only the nodes on the current path are stored. 2. If Depth First
Search finds a solution without examining much of the search space at all. This
is particularly significant if many acceptable solutions exist.Depth First
Search can stop when one of them is found.
3. To solve simple problems like water
jug problem, we are using those control strategies that cause motion and are
systematic which will lead to the final state.But for solving complex problems,
we need efficient control structure.
Disadvantages of Depth First Search:
1. May find a sub-optimal solution (one
that is deeper or more costly than the best solution).
2. Incomplete without a depth bound. It
may not find a solution even if one exists.
11. Write
down the algorithm of DLS With its advantages ?
Ans. : Depth limited search is the new
search algorithm for uninformed search. The unbounded tree problem happens to
appear in the depth first search algorithm, and it can be fixed by imposing a
boundary or a limit to the depth of the search domain.
The Depth Limited Search (DLS) method
is almost equal to Depth First Search (DFS). But DLS can work on the infinite
state space problem because it bounds the depth of the search tree with a
predetermined limit L. Nodes at this depth limit are treated as if they had no
successors.
Failure conditions of DLS:
Depth limited search can be terminated
with two conditions of failure:
Standard failure value: It indicates
that the problem does not have any solution. Cutoff failure value: It defines
no solution for the problem within a given depth limit.
Advantages of Depth Limited Search:
1. Depth limited search is better than
DFS and requires less time and memory space.
2. DFS assures that the solution will
be found if it exists infinite time.
3. There are applications of DLS in
graph theory particularly similar to the DFS. 4. To struggle the disadvantages
of DFS, we add a limit to the depth, and our search strategy performs
recursively down the search tree.
Disadvantages of Depth Limited Search:
1. The depth limit is compulsory for
this algorithm to execute. 2. The goal node may not exist in the depth limit
set earlier, which will push the user to iterate further adding execution time.
3. The goal node will not be found if
it does not exist in the desired limit.
13. Describe
the application of Data Warehouse?
Ans. : Listed below are the
applications of Data warehouses across numerous industry backgrounds.
1. Transportation Industry: In the
transportation industry, data warehouses record customer data enabling traders
to experiment with target marketing where the marketing campaigns are designed
by keeping customer requirements in mind.
2. Services Sector: Data Warehouses
find themselves to be of use in the service sector for maintenance of financial
records, revenue patterns, customer profiling, resource management, and human
resources.
3. Manufacturing and Distribution
Industry: A manufacturing organization has to take several make-or-buy
decisions which can influence the future of the sector, which is why they
utilize high-end OLAP tools as a part of data warehouses to predict market changes,
analyze current business trends, detect warning conditions, view marketing
developments, and ultimately take better decisions.
4. Healthcare: In the Healthcare
sector, all of their financial, clinical, and employee records are fed to
warehouses as it helps them to strategize and predict outcomes, track and
analyze their service feedback, generate patient reports, share data with
tie-in insurance companies, medical aid services, etc.
5. Government and Education: The
government uses data warehouses to maintain and analyze tax records, health
policy records and their respective providers, and also their entire criminal
law database is connected to the state's data warehouse. Criminal activity is
predicted from the patterns and trends, results of the analysis of historical
data associated with past criminals. Universities use warehouses for extracting
of information used for the proposal of research grants, understanding their
student demographics, and human resource management. The entire financial department
of most universities depends on data warehouses, inclusive of the Financial Aid
department.
6. Retailing: Retailers are the
mediators between wholesalers and end customers. and that's why it is necessary
for them to maintain the records of both parties. For helping them store data
in an organized manner, the application of data warehousing comes into the
frame.
14. What
is data cleaning & Describe various methods?
Ans. : The first step in data
pre-processing is data cleaning. It is also known as scrubbing. Data cleaning
includes handling missing data and noisy data.
(a) Missing data: Missing data is the
case wherein some of the attributes or attribute data is missing or the data is
not normalized. This situation can be handled by either ignoring the values or
filling the missing value.
(b) Noisy data: This is data with error
or data which has no meaning at all. This type of data can either lead to
invalid results or can create the problem to the process of mining itself. The
problem of noisy data can be solved with binning methods, regression and
clustering.
15. Explain
major issues of data mining?
Ans. : Data mining issues should be
always considered by all data miners and data mining algorithms before going
for data mining. Following issues were faced while doing data mining.
1. Human Interaction: When a data
mining task is to be undertaken, the goal is not clear. Users as well as the
technical expert are unaware of the results. There is a need for a proper
interface between the domain expert and users. The queries are formed by the
experts based on the user's demand.
2. Overfitting: Overfitting is a
statistical error. When a model is generated for a particular data set, it is
supposed that the same model should accommodate future data sets as well. But
overfitting occurs when the generated model is well suited for the training
data set and it is not suited for the test data set or future data set
3. Outliers: When a model is derived,
there are some values of data that do not fit in the model. These values are
significantly different from the normal values, or they do not fit in any
cluster. These values are called outliers. They can also be called as
exceptions in the model derived
4. Interpretation of the results:
Interpretation of the results obtained by data mining is a very crucial task.
This interpretation is beyond only explanation of the results. This task
requires expert analysis and interpretation. Hence, interpretation of the
results is an issue in data mining.
5. Visualization of the results:
Visualization of the results is useful to understand and quickly view the
output of the different database algorithms.
6. Large data sets: Data Mining models
are generally designed to test the small data sets. But, when these models are
applied to large data sets i.e. data sets with larger size then these models
either fail or they wobble. There are many such models that work very well for
the normal data sets but are inefficient in handling large data sets. The large
data set issue can be handled with sampling and parallelization.
7. High Dimensionality: Dimensionality of
the database refers to the different attributes that are present in the
database. High dimensionality in a database leads to more number of attributes
leading to confusion of choosing the attributes for the particular task. An
increase in the number of attributes increases the complexity and effectiveness
of the algorithm. The solution to High Dimensionality is to reduce the number
of attributes.
8. Multimedia Data: Many users demand
the mining tasks for graphical, video or audio data. The multimedia data can be
an issue in data mining as traditionally data mining tasks are designed for
numeric or alphanumeric data.
9. Missing Data: Sometimes the data is
incomplete or missing. During the KDD process, this data may be filled with
nearest estimates. These estimates may give false or invalid results creating
problems
10. Noisy Data: The data which has no
meaning is called noisy data. These values need to be corrected or replaced
with meaningful data.
16. Explain
Various accuracy measure in data mining?
Ans. : The accuracy of a classifier is
given as the percentage of total correct predictions divided by the total
number of instances.
The information system consists of a
number of different documents. And the various operations are done on these documents
to retrieve useful information. The information is retrieved using queries. The
similarity between the query and the retrieved document is calculated. This
similarity measure is a set membership function describing the likelihood of
the document that the retrieved document is relevant to users query.
Precision and Recall
The effectiveness of the system in
processing a query is measured by precision and recall.
Precision is used to answers the
question: "Are all documents retrieved one that I am interested in?"
in short, precision is the fraction of relevant instances among the retrieved
instances.
Recall answers "Have all relevant
documents been retrieved?" Recall is the fraction of relevant instances
that were retrieved. Both precision and recall are therefore based on
relevance.
F-Measure
F-measure or F-score is a measure of
accuracy of a model on a data set. A measure that combines precision and
recall. It is the harmonic mean of precision and recall, traditional F-measure
or balanced F-score.
The F-score is used for evaluating
information retrieval systems such as search an also in natural language
processing. engines, a It is calculated using,
F= 2 Precision-Recall Precision+ Recall
Confusion Matrix
A Confusion Matrix describes the
accuracy of the solution to a classification problem. A confusion matrix is a
table that is often used to describe the performance of a classification model.
Given m classes, a confusion matrix is
an m x m matrix where row represents the actual truth labels, and the column
represents the predicted labels. It is also known as the error matrix. The
matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like
as below table:
17. Describe
teqniques of data mining ?
Ans. : There are many data mining
techniques organizations can use to turn raw data into actionable insights.
Some of them are given below:
1. Statistical techniques: Statistical
techniques are at the core of most analytics involved in the data mining process.
The different analytics models are based on statistical concepts, which output
numerical values that are applicable to specific business objectives. For
example, neural networks use complex statistics based on different weights and
measures to determine if a picture is a dog or a cat in image recognition
systems.
2. Classification:
• This technique is used to obtain
important and actual information about data and metadata. It is considered to
be a complex data method among other data mining techniques. Information is
classified into different classes.
For example, credit customers can be
classified according to three risk categories.
"low", "medium", or
"high".
3. Clustering:
In this technique, the pieces of
information are grouped according to their similarities. This technique helps
to recognize the differences and similarities between the data. For example,
different groups of customers are clustered together to find similarities and
dissimilarities between the parts of information about them.
4. Regression:
This data mining tool is designed to
identify and analyze the interactions between different variables. It's used
for identification of the probability of a particular variable from other
variables' existence. This method is also known as predictive power.
Regression analysis is also used to
predict the future value of a specific entity (the given feature could be
either linear or nonlinear). Regression techniques are quite advantageous, due
to the power of neural networks which is a unique method that emulates the
neural signals in the brain. Ultimately the goal of regression is to show the
links between two pieces of information in one set.
5. Association:
This mining data technique is used to
find an association between two or more events or properties. It drills down to
an underlying model in the database systems. Outer detection (Outlier
analysis):
This type of data mining technique
relates to the observation of data items in the data set, which do not match an
expected pattern or expected behavior. This technique may be used in various
domains like intrusion, detection, fraud detection, etc. It is also known as
Outlier Analysis or Outlier Mining.
18. Explain
RDD?
Ans. : Resilient Distributed Datasets
(RDD): RDD is a fundamental data structure of Apache Spark. It is an immutable
collection of objects which computes on the different node of the cluster.
Decomposing the name RDD:
Resilient: i.e. fault-tolerant with the
help of RDD lineage graph and so able to re compute missing or damaged
partitions due to node failures.
Distributed: Since Data resides on
multiple nodes.
Dataset: It represents records of the
data you work with. The user can load the data set externally which can be
either JSON file, CSV file, text file or database via JDBC with no specific
data structure.
Formally, an RDD is a read-only,
partitioned collection of records. RDD is a fault tolerant collection of
elements that can be operated in parallel. Spark makes use of the concept of
RDD to achieve faster and efficient MapReduce operations. RDDs can contain any
type of Python, Java, or Scala objects, including user-defined classes.
19. Describe
Architecture of Kafka?
Ans. : Apache Kafka is a framework
implementation of a software bus using stream processing. It is developed in
Scala and Java. Kafka aims to provide a unified, high throughput, low-latency
platform for handling real-time data feeds. It is fast, scalable and
distributed by design.
Apache Kafka is a distributed publish -
subscribe messaging system and a robust queue that can handle a high volume of
data and enables you to pass messages from one end-point to another. Kafka is
suitable for both offline and online message consumption. Kafka messages are
persisted on the disk and replicated within the cluster to prevent data loss.
Kafka is built on top of the ZooKeeper
synchronization service. It integrates very well with Apache Storm and Spark
for real-time streaming data analysis.
Architecture:
Kafka stores key-value messages that
come from arbitrarily many processes called producers. The data can be
partitioned into different "partitions" within different
"topics". Within a partition, messages are strictly ordered by their offsets
(the position of a message within a partition), and indexed and stored together
with a timestamp. Other processes called "consumers" can read
messages from partitions. For stream processing, Kafka offers the Streams API
that allows writing Java applications that consume data from Kafka and write
results back to Kafka
0 Comments