601 Recent Trends in IT | 4 Marks Questions with Answer

RTIT

Answer the following questions 


1.Explain Advantages And Disadvantages Of Data Mining?

Ans : Advantages of Data Mining 1 The data mining helps financial institutions and banks to identify probable defaulters and hence will help them whether to issue credit card, loan etc. or not. This is done based on past transactions, user behavior and data patterns.

2. The data mining based methods are cost effective and efficient compared to other statistical data applications.

3. It has been used in many different areas or domains viz. bioinformatics, medicine, genetics, education, agriculture, law enforcement, e-marketing, electrical power engineering etc. For example, in genetics it helps in predicting risk of diseases based on DNA sequence of individuals.

Disadvantages of Data Mining

1. The information obtained based on data mining by companies can be misused against a group of people.

2. The data mining techniques are not 100% accurate and may cause serious consequences in certain conditions.

3. Different data mining tools work in different manners due to different algorithms employed in their design. Hence the selection of the right data mining tool is a tedious and cumbersome task as one needs to obtain knowledge of algorithms, features etc. of various available tools.

2.       Describe Advantages And Disadvantages of data warehousing ?

Ans. : Advantages of Data Warehousing

1. Data Warehouse allows business users to quickly access critical data from some sources all in one place.

2. Data Warehouse stores large amount of historical data that helps users to analyze different time periods and trends to make future predictions. 3. Data Warehouse helps to reduce total turnaround time for analysis and reporting.

4. Data Warehouse provides consistent information on various cross-functional activities.

5. Data Warehouse helps to integrate many sources of data to reduce stress on the production system.

Disadvantages of Data Warehousing

1. Not an ideal option for unstructured data.

2. Difficult to make changes in data types and ranges, data source schema, indexes, and queries.

3. The data warehouse may seem easy, but it is too complex for the average users.

 

3.       What are the advantages & disadvantages of AI?

Ans. : Advantages of AI

1. Reduction in Human Error: Humans make mistakes from time to time, but computers don't if they programmed properly. With AI, the decisions are taken from the previously gathered information applying a certain set of algorithms.

2. Available 24/7: An average human will work for 6-8 hours a day excluding the breaks. But by using AI we can make machines work 24x7 without any breaks. 

3. Digital Assistance: Highly advanced organizations use digital assistants to interact with users which save the need of human resources. The digital assistant is also used in many websites to provide things that the user wants. We can chat with them about what we are looking for. Some chatbots are designed in such a way that it becomes hard to determine that we are chatting with a chatbot or a human being.

4. Faster Decisions: While making a decision humans will analyze many factors both emotionally and practically, but Al-powered machines work on what is programmed and deliver the results in a faster way.

Disadvantages of AI

Making Human Idle: AI is making humans idle with its applications automating most of the work. Humans tend to get addicted to these inventions which can cause problems for future generations.

2. Unemployment: As AI is replacing most of the repetitive tasks and other work with robots, human interference is becoming less, which will cause a major problem in the employment standards. Every organization is looking to replace the minimum qualified individuals with AI robots which can do similar work with more efficiency.

3. High Costs of Creation: As AI is updating every day the hardware and software need to get updated with time to meet the latest requirements. Machines need repairing and maintenance which need plenty of costs. Its creation requires huge costs as they are very complex machines.

4. No Emotions: There is no doubt that machines are much better when it comes to working efficiently but they cannot replace the human connection that makes the team. Machines cannot develop a bond with humans, which is an essential attribute when it comes to Team Management.

4.       What is Spark with Advantages And Disadvantages?

Ans. : Advantages of Spark

When it comes to Big Data, processing speed always matters. Spark is wildly popular with data scientists because of its speed. Spark is 100x faster than Hadoop for large scale data processing. Spark uses an in-memory (RAM) computing system whereas Hadoop uses local memory space to store data. Spark can handle multiple petabytes of clustered data of more than 8000 nodes at a time. 

2. Spark carries easy-to-use APIs for operating on large datasets. It offers over 80 high-level operators that make it easy to build parallel apps.

3. Spark not only supports 'MAP' and 'reduce'. It also supports Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc.

4. Spark can handle many analytics challenges because of its low-latency in memory data processing capability. It has well-built libraries for graph analytics

Disadvantages of Spark

1. In the case of Apache Spark, you need to optimize the code manually since it doesn't have any automatic code optimization process. This will turn into a disadvantage when all the other technologies and platforms are moving towards automation. 

2. Apache Spark doesn't come with its own file management system. It depends on some other platforms like Hadoop or other cloud-based platforms.

3. There are fewer algorithms present in the case of Apache Spark Machine Learning Spark MLib. It delays in terms of several available algorithms.

 

5.       Write Application Of Data mining?

Ans. : Data mining is used by many organizations to improve the customer base. They focus on customer behavioral patterns, market analysis, profit areas and product improvement. The essential areas where data mining is used are as follows:

Applications of Data Mining

(a) Education: Educational data mining deals with developing the methods to discover the knowledge from the education field. It is used to find out project students' areas of interests, future learning capacities and other aspects. Educational institutions can apply different data mining techniques and take appropriate/accurate decisions based on the outcome of the mining process. Also, the analysis of slow and fast learners and accordingly their teaching pattern can be determined.

(b) Health and Medicine: Data mining can effectively be used in health care systems. During Covid-19 pandemic, the predictions of the Covid-19 waves and the volume of patients was done using data mining. In Genetics also, data mining helps in determining the sequence of the genes and future trends.

(c) Market Analysis: Market analysis is based on a particular pattern of purchase followed by customers. These patterns help the shop owner to understand the buying pattern of customers and accordingly useful decisions can be implemented so as to increase the profit of the store. Also, the market analysis helps to find out the different methodologies to retain the existing customers and gain new ones.

(d) Fraud Detection: A fraud detection system helps in finding out the pattern of fraud, its potential attackers/criminal detection and possible solutions using different data mining algorithms. These data mining methods provide timely and efficient solutions for detection and prevention of the frauds. Intrusion and lie detection can also be addressed by these mechanisms.

6.       Give state space representation for “Water Jug Problem”?

Ans. : In this problem, we use two jugs called four liter and three liters; four holds a maximum of four liters of water and three a maximum of three liters of water. There is a pump that can be used to fill the jugs with water. How can we get two liters of water in the four-liter jug?

State Space Representation:

The state space is a set of prearranged pairs giving the number of liters of water in the pair of jugs at any time, i.e., (x, y) where,

x= 0, 1, 2, 3 or 4.

y = 0, 1, 2 or 3.

x-Represents the number of liters of water in the 4-liter jug.

y - Represents the number of liters of water in the 3-liter jug.

The start state is (0, 0) and the goal state is (2, n), where n may be any, but it is limite to three holding from 0 to 3 liters of water or empty.

x and y show the name and numerical number shows the amount of water in jugs I solving the water jug problem.

Shows how to solve the problem. Step 1: We have two jugs; one is 4 liters and the other is 3 liters.Where,   

7.       Give state space representation for “Missionary Cannibal Problem”?

Ans. : "Three missionaries and three cannibals are present at one side of a river and need to cross the river. There is only one boat available. At any point of time, the number of cannibals should not outnumber the number of missionaries at that bank. It is also known that only two persons can occupy the boat available at a time". The objective of the solution is to find the sequence of their transfer from one bank of river to another using the boat sailing through the river satisfying these constraints.

State Space Representation of Missionary Cannibal Problem Production Rules:

We can form various production rules.

Let Missionary be denoted by 'M' and Cannibal, by 'C'. These rules are described below:

Rule 1: (0, M): One missionary sailing the boat from bank-1 to bank-2

Rule 2: (M, 0): One missionary sailing the boat from bank-2 to bank-1 

Rule 3: (M, M): Two missionaries sailing the boat from bank-1 to bank-2

Rule 4: (M, M): Two missionaries sailing the boat from bank-2 to bank-1 

Rule 5: (M, C): One missionary and one Cannibal sailing the boat from bank-1 to bank-2

Rule 6: (C, M): One missionary and one Cannibal sailing the boat from bank-2 to bank-1

Rule 7: (C, C): Two Cannibals sailing the boat from bank-1 to bank-2

Rule 8: (C, C): Two Cannibals sailing the boat from bank-2 to bank-1

Rule 9: (0, C): One Cannibal sailing the boat from bank-1 to bank-2 

Rule 10: (C, 0): One Cannibal sailing the boat from bank-2 to bank-1

9.       Write down the algorithm of BFS with its Advantages?

Ans. : BFS (Breadth-first Search)

Breadth First searches are performed by exploring all nodes at a given depth before proceeding to the next level. This means that all immediate children of nodes are explored before any children's children are considered.

Construct a tree with the initial state as its root. Generate all its successors by applying all the rules that are appropriate. Fig. 3.1 shows how the tree looks at this point. Now for each leaf node, generate all its successors by applying appropriate rules. The tree at this point is shown in Fig. 3.2. Continue this process until some rule produces a goal state. This process is called Breadth First Search.

 

Algorithm To BFS:-

1.Create a variable called NODE-LIST and set it to the initial state. 2. Until goal state is found or NODE-LIST is empty:

(a) Remove the first element from NODE-LIST and call it E. If NODE-LIST was empty, quit.

(b) For each way that each rule can match the state described in E do:

(i) Apply the rule to generate a new state.

(ii) If the new state is a goal state, quit and return to this state.

(iii) Otherwise, add the new state to the end of NODE-LIST.

Advantages:-

1. BFS will not get trapped explaining a blind path which happens in depth first search.

2. If there is a solution then Breadth-First Search is guaranteed to find it out.

3. If there are multiple solutions then Breadth-First Search can find minimal solution i.e. one that requires the minimum number of steps, will be found.

Disadvantage of Breadth First Search:

1. High storage requirement exponential with tree depth. A BFS on a binary tree generally 76uiki mrequires more memory than a DFS.

2. When the search space is large the search performance will be poor compared to

other heuristic searches.

10.     Write down the algorithm of DFS With its advantages ?

Ans. : Depth first searches are performed by going downward into a tree as early as possible. Consider a single branch of the tree until it produces a solution or until a decision to terminate the path is made. It makes sense to terminate a path if it reaches a dead end, produces a previous state or becomes longer than some limit, in such cases backtracking occurs. To overcome such backtracking is known as Depth First Search. • Following Fig. 3.3 shows Depth First Search tree for Water Jug Problem.

                                         

Algorithm of Depth First Search:

1. If the initial state is a goal state, quit and return success.

2. Otherwise, do the following until success or failure is signaled.

 (a) Generate a successor, E, of the initial state. If there are no more successors,

signal failure. 

(b) Call depth first search with E as the initial state.

 (c) If success is returned, signal success. Otherwise continue in this loop.

Advantages of Depth First Search:

1. Depth First search requires less memory since only the nodes on the current path are stored. 2. If Depth First Search finds a solution without examining much of the search space at all. This is particularly significant if many acceptable solutions exist.Depth First Search can stop when one of them is found.

3. To solve simple problems like water jug problem, we are using those control strategies that cause motion and are systematic which will lead to the final state.But for solving complex problems, we need efficient control structure.

Disadvantages of Depth First Search:

1. May find a sub-optimal solution (one that is deeper or more costly than the best solution).

2. Incomplete without a depth bound. It may not find a solution even if one exists.

11.     Write down the algorithm of DLS With its advantages ?

Ans. : Depth limited search is the new search algorithm for uninformed search. The unbounded tree problem happens to appear in the depth first search algorithm, and it can be fixed by imposing a boundary or a limit to the depth of the search domain.

The Depth Limited Search (DLS) method is almost equal to Depth First Search (DFS). But DLS can work on the infinite state space problem because it bounds the depth of the search tree with a predetermined limit L. Nodes at this depth limit are treated as if they had no successors.

Failure conditions of DLS: 

Depth limited search can be terminated with two conditions of failure:

Standard failure value: It indicates that the problem does not have any solution. Cutoff failure value: It defines no solution for the problem within a given depth limit. 

Advantages of Depth Limited Search:

1. Depth limited search is better than DFS and requires less time and memory space.

2. DFS assures that the solution will be found if it exists infinite time.

3. There are applications of DLS in graph theory particularly similar to the DFS. 4. To struggle the disadvantages of DFS, we add a limit to the depth, and our search strategy performs recursively down the search tree.

Disadvantages of Depth Limited Search:

1. The depth limit is compulsory for this algorithm to execute. 2. The goal node may not exist in the depth limit set earlier, which will push the user to iterate further adding execution time.

3. The goal node will not be found if it does not exist in the desired limit.

13.     Describe the application of  Data Warehouse?

Ans. : Listed below are the applications of Data warehouses across numerous industry backgrounds.

1. Transportation Industry: In the transportation industry, data warehouses record customer data enabling traders to experiment with target marketing where the marketing campaigns are designed by keeping customer requirements in mind.

2. Services Sector: Data Warehouses find themselves to be of use in the service sector for maintenance of financial records, revenue patterns, customer profiling, resource management, and human resources.

3. Manufacturing and Distribution Industry: A manufacturing organization has to take several make-or-buy decisions which can influence the future of the sector, which is why they utilize high-end OLAP tools as a part of data warehouses to predict market changes, analyze current business trends, detect warning conditions, view marketing developments, and ultimately take better decisions.

4. Healthcare: In the Healthcare sector, all of their financial, clinical, and employee records are fed to warehouses as it helps them to strategize and predict outcomes, track and analyze their service feedback, generate patient reports, share data with tie-in insurance companies, medical aid services, etc.

5. Government and Education: The government uses data warehouses to maintain and analyze tax records, health policy records and their respective providers, and also their entire criminal law database is connected to the state's data warehouse. Criminal activity is predicted from the patterns and trends, results of the analysis of historical data associated with past criminals. Universities use warehouses for extracting of information used for the proposal of research grants, understanding their student demographics, and human resource management. The entire financial department of most universities depends on data warehouses, inclusive of the Financial Aid department.

6. Retailing: Retailers are the mediators between wholesalers and end customers. and that's why it is necessary for them to maintain the records of both parties. For helping them store data in an organized manner, the application of data warehousing comes into the frame.

14.     What is data cleaning & Describe various methods?

Ans. : The first step in data pre-processing is data cleaning. It is also known as scrubbing. Data cleaning includes handling missing data and noisy data.

(a) Missing data: Missing data is the case wherein some of the attributes or attribute data is missing or the data is not normalized. This situation can be handled by either ignoring the values or filling the missing value.

(b) Noisy data: This is data with error or data which has no meaning at all. This type of data can either lead to invalid results or can create the problem to the process of mining itself. The problem of noisy data can be solved with binning methods, regression and clustering.

15.     Explain major issues of data mining?

Ans. : Data mining issues should be always considered by all data miners and data mining algorithms before going for data mining. Following issues were faced while doing data mining.

1. Human Interaction: When a data mining task is to be undertaken, the goal is not clear. Users as well as the technical expert are unaware of the results. There is a need for a proper interface between the domain expert and users. The queries are formed by the experts based on the user's demand.

2. Overfitting: Overfitting is a statistical error. When a model is generated for a particular data set, it is supposed that the same model should accommodate future data sets as well. But overfitting occurs when the generated model is well suited for the training data set and it is not suited for the test data set or future data set 

3. Outliers: When a model is derived, there are some values of data that do not fit in the model. These values are significantly different from the normal values, or they do not fit in any cluster. These values are called outliers. They can also be called as exceptions in the model derived 

4. Interpretation of the results: Interpretation of the results obtained by data mining is a very crucial task. This interpretation is beyond only explanation of the results. This task requires expert analysis and interpretation. Hence, interpretation of the results is an issue in data mining. 

5. Visualization of the results: Visualization of the results is useful to understand and quickly view the output of the different database algorithms. 

6. Large data sets: Data Mining models are generally designed to test the small data sets. But, when these models are applied to large data sets i.e. data sets with larger size then these models either fail or they wobble. There are many such models that work very well for the normal data sets but are inefficient in handling large data sets. The large data set issue can be handled with sampling and parallelization.

7. High Dimensionality: Dimensionality of the database refers to the different attributes that are present in the database. High dimensionality in a database leads to more number of attributes leading to confusion of choosing the attributes for the particular task. An increase in the number of attributes increases the complexity and effectiveness of the algorithm. The solution to High Dimensionality is to reduce the number of attributes. 

8. Multimedia Data: Many users demand the mining tasks for graphical, video or audio data. The multimedia data can be an issue in data mining as traditionally data mining tasks are designed for numeric or alphanumeric data. 

9. Missing Data: Sometimes the data is incomplete or missing. During the KDD process, this data may be filled with nearest estimates. These estimates may give false or invalid results creating problems

10. Noisy Data: The data which has no meaning is called noisy data. These values need to be corrected or replaced with meaningful data.

16.     Explain Various accuracy measure in data mining?

Ans. : The accuracy of a classifier is given as the percentage of total correct predictions divided by the total number of instances. 

The information system consists of a number of different documents. And the various operations are done on these documents to retrieve useful information. The information is retrieved using queries. The similarity between the query and the retrieved document is calculated. This similarity measure is a set membership function describing the likelihood of the document that the retrieved document is relevant to users query.

Precision and Recall

The effectiveness of the system in processing a query is measured by precision and recall.

Precision is used to answers the question: "Are all documents retrieved one that I am interested in?" in short, precision is the fraction of relevant instances among the retrieved instances.

Recall answers "Have all relevant documents been retrieved?" Recall is the fraction of relevant instances that were retrieved. Both precision and recall are therefore based on relevance.

F-Measure

F-measure or F-score is a measure of accuracy of a model on a data set. A measure that combines precision and recall. It is the harmonic mean of precision and recall, traditional F-measure or balanced F-score.

The F-score is used for evaluating information retrieval systems such as search an also in natural language processing. engines, a It is calculated using,

F= 2 Precision-Recall Precision+ Recall

Confusion Matrix

A Confusion Matrix describes the accuracy of the solution to a classification problem. A confusion matrix is a table that is often used to describe the performance of a classification model.

Given m classes, a confusion matrix is an m x m matrix where row represents the actual truth labels, and the column represents the predicted labels. It is also known as the error matrix. The matrix consists of predictions result in a summarized form, which has a total number of correct predictions and incorrect predictions. The matrix looks like as below table:

17.     Describe teqniques of data mining ?

Ans. : There are many data mining techniques organizations can use to turn raw data into actionable insights. Some of them are given below:

1. Statistical techniques: Statistical techniques are at the core of most analytics involved in the data mining process. The different analytics models are based on statistical concepts, which output numerical values that are applicable to specific business objectives. For example, neural networks use complex statistics based on different weights and measures to determine if a picture is a dog or a cat in image recognition systems.

2. Classification:

• This technique is used to obtain important and actual information about data and metadata. It is considered to be a complex data method among other data mining techniques. Information is classified into different classes.

For example, credit customers can be classified according to three risk categories.

"low", "medium", or "high".

3. Clustering:

In this technique, the pieces of information are grouped according to their similarities. This technique helps to recognize the differences and similarities between the data. For example, different groups of customers are clustered together to find similarities and dissimilarities between the parts of information about them.

4. Regression:

This data mining tool is designed to identify and analyze the interactions between different variables. It's used for identification of the probability of a particular variable from other variables' existence. This method is also known as predictive power.

Regression analysis is also used to predict the future value of a specific entity (the given feature could be either linear or nonlinear). Regression techniques are quite advantageous, due to the power of neural networks which is a unique method that emulates the neural signals in the brain. Ultimately the goal of regression is to show the links between two pieces of information in one set.

5. Association:

This mining data technique is used to find an association between two or more events or properties. It drills down to an underlying model in the database systems. Outer detection (Outlier analysis):

This type of data mining technique relates to the observation of data items in the data set, which do not match an expected pattern or expected behavior. This technique may be used in various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outlier Mining. 

 

18.     Explain RDD?

Ans. : Resilient Distributed Datasets (RDD): RDD is a fundamental data structure of Apache Spark. It is an immutable collection of objects which computes on the different node of the cluster. Decomposing the name RDD:

Resilient: i.e. fault-tolerant with the help of RDD lineage graph and so able to re compute missing or damaged partitions due to node failures.

Distributed: Since Data resides on multiple nodes. 

Dataset: It represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

Formally, an RDD is a read-only, partitioned collection of records. RDD is a fault tolerant collection of elements that can be operated in parallel. Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

19.     Describe Architecture of Kafka?

Ans. : Apache Kafka is a framework implementation of a software bus using stream processing. It is developed in Scala and Java. Kafka aims to provide a unified, high throughput, low-latency platform for handling real-time data feeds. It is fast, scalable and distributed by design.

Apache Kafka is a distributed publish - subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss.

Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.

Architecture:

Kafka stores key-value messages that come from arbitrarily many processes called producers. The data can be partitioned into different "partitions" within different "topics". Within a partition, messages are strictly ordered by their offsets (the position of a message within a partition), and indexed and stored together with a timestamp. Other processes called "consumers" can read messages from partitions. For stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka

 

Post a Comment

0 Comments