Machine Learning PYQ 2017

Machine Learning End Term Examination 2017 (ETCS 402) 



Q1. a) What are the key tasks of machine learning?

The key tasks of machine learning can be summarized as follows:

Data Preprocessing: This involves cleaning and preparing the data for analysis. It includes tasks such as handling missing data, feature scaling, and data normalization.

Feature Selection and Engineering: Identifying and selecting the most relevant features from the dataset or creating new features that can improve the performance of the machine learning model.

Model Selection: Choosing the appropriate machine learning algorithm or model that best suits the problem at hand. This depends on the type of data, the nature of the problem (classification, regression, clustering, etc.), and the desired outcome.

Model Training: Training the selected model using the available dataset. This involves feeding the input data to the model and adjusting its internal parameters to optimize its performance.

Model Evaluation: Assessing the performance of the trained model using appropriate evaluation metrics. This helps determine how well the model generalizes to unseen data and if it meets the desired criteria.

Model Deployment and Monitoring: Implementing the trained model into a production environment and continuously monitoring its performance. This includes handling new data inputs, making predictions, and ensuring the model's accuracy and reliability over time.


b) Discuss Naive Bayes Theorem.

Naive Bayes theorem is a fundamental concept in probability theory and machine learning. It is based on Bayes' theorem, which describes the probability of an event occurring given prior knowledge or evidence.

In the context of machine learning, Naive Bayes is a classification algorithm that assumes strong independence between features. It is called "naive" because it assumes that the presence or absence of a particular feature in a class is independent of the presence or absence of any other feature.

Naive Bayes theorem calculates the probability of a class label given the features of an instance. It is formulated as:

P(y | x₁, x₂, ..., xn) = (P(y) * P(x₁ | y) * P(x₂ | y) * ... * P(xn | y)) / P(x₁, x₂, ..., xn)

where:

P(y | x₁, x₂, ..., xn) is the posterior probability of class y given the features x₁, x₂, ..., xn.

P(y) is the prior probability of class y.

P(x₁ | y), P(x₂ | y), ..., P(xn | y) are the conditional probabilities of the features given class y.

P(x₁, x₂, ..., xn) is the evidence or the probability of observing the features.

Naive Bayes algorithm assumes that the features are conditionally independent given the class. This assumption simplifies the calculations and makes the algorithm computationally efficient. However, it may not hold true in some real-world scenarios where features are correlated.


c) What is the difference between supervised learning and unsupervised learning?

The main difference between supervised learning and unsupervised learning lies in the presence or absence of labeled training data.

Supervised learning is a type of machine learning where the training data includes both input features and corresponding target labels. The goal is to learn a mapping function that can predict the correct label for new, unseen instances. It involves training a model using labeled examples and then using that model to make predictions on new data. Supervised learning can be further categorized into classification tasks (where the output is a category or class label) and regression tasks (where the output is a continuous value).

On the other hand, unsupervised learning deals with unlabeled data, where the training data consists only of input features without any corresponding labels or targets. The objective of unsupervised learning is to discover patterns, structures, or relationships in the data. It involves algorithms that automatically identify clusters or groups of similar instances, detect anomalies, or perform dimensionality reduction. Unsupervised learning is particularly useful in exploratory data analysis and finding hidden patterns in large datasets.

Supervised Learning
Unsupervised Learning
Input data is labelled
Input data is unlabeled
Uses training dataset
Uses just input dataset
Used for prediction
Used for analysis
Classification and regression
Clustering, density estimation and dimensionality reduction


d) Explain in brief logistic regression.

  • Logistic regression is a binary classification algorithm used for categorical target variables.
  • It models the relationship between input features and the probability of the target class using a logistic function.
  • The logistic function maps real-valued numbers to probabilities between 0 and 1.
  • The model learns coefficients for each feature to calculate the linear combination and transforms it using the logistic function.
  • The decision boundary is determined based on a chosen threshold probability.
  • Coefficients are adjusted during training to minimize the difference between predicted probabilities and actual class labels.
  • Logistic regression is simple, interpretable, and fast, and can handle linear and non-linear relationships between features and the target.
  • It assumes a linear decision boundary and may not perform well in complex or highly non-linear problems.


e) What is the Independent Component Analysis? Discuss. 

Independent Component Analysis (ICA) is a statistical technique used for separating mixed signals into their original source components. It aims to find a linear transformation of the observed signals that maximally decorrelates them, thus extracting the underlying independent sources. 

ICA assumes mutual statistical independence among the sources. The technique is used in signal processing, image analysis, speech recognition, and blind source separation.



Q2. a)What is machine learning? Discuss the issues in machine learning and the steps required for selecting right machine learning algorithm.(6.5)

Machine learning is a branch of artificial intelligence that focuses on developing algorithms and models that allow computers to learn from data and make predictions or decisions without being explicitly programmed.

Issues in machine learning:

Data quality: The quality and availability of data can greatly impact the performance of machine learning algorithms. Insufficient or noisy data can lead to inaccurate or biased results.

Overfitting and underfitting: Overfitting occurs when a model learns too much from the training data and fails to generalize well to unseen data. Underfitting, on the other hand, occurs when a model is too simple and fails to capture the underlying patterns in the data.

Feature selection: Choosing the right set of features that are relevant and informative for the learning task is crucial. Including irrelevant or redundant features can degrade the performance of the model.

Scalability: Machine learning algorithms should be scalable to handle large datasets efficiently. Some algorithms may struggle with high-dimensional or massive datasets.

Interpretability and explainability: In some applications, it is important to understand and interpret the decisions made by machine learning models. Some complex models may lack interpretability.

Steps for selecting the right machine learning algorithm:

Define the problem: Clearly understand the problem at hand and define the specific task you want the machine learning algorithm to solve, such as classification, regression, or clustering.

Gather and preprocess the data: Collect relevant data for the problem and preprocess it by cleaning, transforming, and normalizing the data as required.

Identify the type of problem: Determine whether the problem is supervised, unsupervised, or semi-supervised learning. This will help narrow down the choice of algorithms.

Explore different algorithms: Research and explore different machine learning algorithms that are suitable for the problem type. Consider factors such as the algorithm's assumptions, strengths, weaknesses, and applicability to the dataset.

Evaluate and compare algorithms: Evaluate the performance of different algorithms using appropriate evaluation metrics and cross-validation techniques. Compare their results to select the algorithm that performs the best.

Tune and optimize: Fine-tune the selected algorithm by adjusting its parameters and hyperparameters to further improve its performance.

Validate and test: Validate the chosen model using an independent test dataset to ensure its generalization and robustness.


b) What is learning? Discuss.(6) 

Learning is the process of acquiring knowledge or skills through study, experience, or training. In the context of machine learning, learning refers to the ability of a computer system to automatically improve its performance on a specific task by analyzing data and patterns.

Machine learning algorithms learn from data by extracting meaningful patterns, relationships, or representations that can be used to make predictions, classify objects, or discover insights.

Learning can be broadly classified into supervised learning, unsupervised learning, and reinforcement learning:

Supervised learning: In supervised learning, the algorithm learns from labeled training data where each example is associated with a known target value. The algorithm learns to map input data to output labels by minimizing the error or discrepancy between predicted and actual labels.

Unsupervised learning: In unsupervised learning, the algorithm learns from unlabeled data without any specific target values. The algorithm aims to discover hidden patterns or structures in the data, such as clustering similar instances or identifying anomalies.

Reinforcement learning: In reinforcement learning, an agent learns to interact with an environment and take actions to maximize a reward signal. The agent receives feedback or rewards based on its actions and learns to make decisions that lead to higher cumulative rewards over time.

Learning in machine learning involves algorithms that iteratively adjust their internal parameters or model representations to improve their performance on the given task. It often involves optimization techniques, statistical inference, and mathematical principles to optimize models and make accurate predictions or decisions.


3 a) Explain generative learning algorithms in detail. (8.5)

Generative learning algorithms are a class of machine learning algorithms that aim to model the underlying probability distribution of the input data and generate new samples from that distribution. These algorithms learn to capture the joint probability distribution of the input features and their corresponding output labels or classes. They can be used for both supervised and unsupervised learning tasks.

One popular generative algorithm is the Naive Bayes classifier. It assumes that the input features are conditionally independent given the class label and estimates the class-conditional probabilities using the training data. To make predictions, it applies Bayes' theorem to calculate the posterior probability of each class given the input features and selects the class with the highest probability.

Another example is the Gaussian Mixture Model (GMM), which represents the data distribution as a mixture of multiple Gaussian distributions. GMM estimates the parameters of the Gaussian components using the Expectation-Maximization algorithm and can be used for tasks such as clustering or density estimation.

Generative learning algorithms have several advantages:

  • They can generate new samples from the learned distribution, allowing data augmentation and synthesis.
  • They can handle missing data or incomplete input patterns by modeling the joint distribution.
  • They can provide insights into the underlying data distribution and dependencies between features.



b) Describe the limitations of the perception model. (4)

The perceptron is a linear binary classifier that learns a decision boundary separating two classes. While the perceptron model has its strengths, it also has some limitations:

Limited to linearly separable data: The perceptron model can only learn linear decision boundaries. It fails to handle data that is not linearly separable, resulting in poor performance or convergence issues.

Single-layer architecture: The perceptron consists of a single layer of neurons, making it unable to learn complex representations or capture nonlinear relationships in the data.

Lack of probabilistic outputs: The perceptron provides binary outputs, classifying instances as either positive or negative without providing probabilistic confidence scores. This limits its usefulness in tasks that require probability estimates or uncertainty quantification.

Sensitivity to input scaling: The perceptron model is sensitive to the scale of input features. If the features are not properly scaled or normalized, it can lead to biased or inaccurate predictions.

Inability to handle multiclass problems: The perceptron is originally designed for binary classification tasks and requires additional techniques to handle multiclass problems. One approach is to use one-vs-rest or one-vs-one strategies.

Convergence and stability issues: The perceptron learning algorithm may not converge if the data is not linearly separable. It can also be sensitive to the initial weights and order of training examples, affecting its stability and generalization performance.

Despite these limitations, the perceptron model played a crucial role in the development of neural networks and paved the way for more powerful and complex models, such as multilayer perceptrons and deep learning architectures.


Q4.a) Explain the decision tree algorithm with example.(6.5)

The decision tree algorithm is a popular supervised learning algorithm that can be used for both classification and regression tasks. It creates a tree-like model of decisions and their possible consequences based on the input features.

The decision tree algorithm starts with the entire dataset and selects the best feature to split the data based on a certain criterion (e.g., information gain or Gini impurity). The data is then partitioned into subsets based on the selected feature's values. This process is recursively applied to each subset, creating a tree structure until a stopping criterion is met, such as reaching a maximum tree depth or having pure leaf nodes.

Let's take an example of a decision tree for classifying fruits based on their features:



Dataset: We have a dataset of fruits with features like color, shape, and texture, and the corresponding labels of whether they are "apple" or "orange".

Splitting Criteria: The decision tree algorithm selects the best feature to split the data. For example, it might choose the "color" feature and split the data into subsets based on colors like "red", "green", and "yellow".

Recursive Splitting: The algorithm applies the splitting process recursively to each subset. Let's consider the subset of fruits with "red" color. It might further split the subset based on the "shape" feature into subsets like "round" and "irregular".

Leaf Nodes: The process continues until a stopping criterion is met. At the leaf nodes, we have pure subsets where all fruits belong to the same class (e.g., "apple" or "orange").

Classification: To classify a new fruit, we traverse the decision tree based on its features and reach a leaf node that corresponds to the predicted class.


b) Discuss bagging and boosting. Write the Ada boost algorithm.(6)

Bagging and boosting are ensemble learning techniques that combine multiple weak learners (base models) to create a stronger predictive model.

Bagging (Bootstrap Aggregating): Bagging aims to reduce the variance of a predictive model by training multiple base models on different bootstrap samples of the training data and averaging their predictions. Each base model is trained independently, and the final prediction is obtained by aggregating the predictions of all models (e.g., majority voting for classification or averaging for regression).

Boosting: Boosting focuses on reducing both bias and variance by iteratively improving the base models' performance. In boosting, each base model is trained sequentially, and subsequent models are trained to correct the mistakes made by previous models. The final prediction is obtained by combining the predictions of all models using a weighted sum.


AdaBoost (Adaptive Boosting) is a popular boosting algorithm. Here's how it works:

Initialize the sample weights: Assign equal weights to all training examples.

Train a weak learner: Train a base model on the training data with the current sample weights.

Update weights and calculate the model's weight: Increase the weights of misclassified examples to focus on the difficult instances. Calculate the weight of the model based on its accuracy.

Repeat steps 2 and 3: Iterate the process by updating the sample weights, training a new base model, and updating the model's weight.

Final prediction: Combine the predictions of all base models by weighted voting to obtain the final prediction.

AdaBoost focuses more on the misclassified examples, allowing subsequent models to learn from the mistakes of previous models. This iterative process results in a strong ensemble model that performs well on the training data.



Q5.a) Discus K-means clustering with example. (6.5)

K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points together. Here's a simple explanation of how it works:

Imagine you have a collection of various fruits and you want to group them based on their similarities. K-means clustering can help you with that. The algorithm works as follows:

  1. Initialization: You start by specifying the number of clusters you want to create, let's say K. Then, you randomly select K points from your dataset as initial cluster centroids.

  2. Assignment: Next, you assign each fruit to the nearest centroid based on their similarity. The similarity is usually measured using the distance between the data points.

  3. Update: After assigning all the fruits to their nearest centroids, you recalculate the centroids by taking the average of all the data points assigned to each cluster. This step updates the cluster centroids.

  4. Repeat: Steps 2 and 3 are repeated iteratively until the centroids no longer change significantly or a specified number of iterations is reached.

  5. Final Clusters: Once the algorithm converges, you will have K clusters, each containing similar fruits.

For example, using K-means clustering on a dataset of fruits, the algorithm may group apples together in one cluster, oranges in another, and bananas in a third cluster.



 b) Discuss sequential covering algorithm in detail. (6)

Sequential covering algorithm is a rule-based machine learning algorithm used for classification tasks. It operates by constructing a set of rules that collectively cover the training data. Here's a brief explanation of how it works:

  1. Initialization: You start with an empty rule set.

  2. Rule Generation: The algorithm iteratively generates rules to cover the training data. It selects the most promising attribute and creates a rule based on a specific condition.

  3. Rule Evaluation: Each rule is evaluated using a quality measure (e.g., accuracy, support) to assess its effectiveness in covering the training examples.

  4. Rule Pruning: Rules that do not meet certain criteria (e.g., low accuracy, low support) are pruned from the rule set to improve efficiency and reduce complexity.

  5. Repeat: Steps 2 to 4 are repeated until the algorithm converges or a predefined stopping criterion is met.

The final rule set obtained from the sequential covering algorithm represents a set of if-then rules that can be used for classification. Each rule covers a specific subset of the training data and provides insights into the relationship between the input features and the target variable.



Q6.a) What is Principal Component Analysis? Discus its steps in detail. (6) 

Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform a dataset into a lower-dimensional space while preserving the most important information. Here are the steps involved in PCA:

Standardization: The dataset is standardized by subtracting the mean and dividing by the standard deviation of each feature. This ensures that all features have similar scales.

Covariance Matrix: The covariance matrix is computed from the standardized dataset. It measures the relationships between different features.

Eigendecomposition: The covariance matrix is then eigendecomposed to obtain its eigenvectors and eigenvalues. The eigenvectors represent the principal components, and the corresponding eigenvalues indicate the amount of variance explained by each component.

Selection of Principal Components: The eigenvectors are ranked based on their corresponding eigenvalues. The top-k eigenvectors are selected to form the principal components, where k is the desired number of dimensions in the reduced space.

Projection: The original dataset is projected onto the selected principal components to obtain the transformed dataset in the lower-dimensional space.


b) What is spectral clustering? Discuss any one spectral clustering algorithm.(6.5)

Spectral clustering is a machine learning technique used to group similar data points together. It leverages the relationships between data points and creates clusters based on the patterns found in these relationships. It uses graph theory to partition data points into clusters. It takes advantage of the similarity between data points and their connections in a graph. 

Imagine you have a group of people and you want to divide them into smaller groups based on their friendships. Spectral clustering works by building a friendship network where each person is represented as a node, and the strength of their friendship determines the connection between nodes.

One example of a spectral clustering algorithm is the Normalized Cut algorithm:

Construct Similarity Graph: A similarity graph is constructed using the data points as nodes. The edges between nodes represent the similarity between data points. Common methods for defining the similarity include the Gaussian kernel or k-nearest neighbors.

Laplacian Matrix: The Laplacian matrix is computed from the similarity graph. It represents the connections between nodes in the graph.

Eigenvalue Decomposition: The Laplacian matrix is eigendecomposed to obtain its eigenvectors and eigenvalues. The eigenvectors represent the spectral embedding of the data points.

Dimensionality Reduction: The eigenvectors corresponding to the smallest eigenvalues are selected to form a lower-dimensional representation of the data.

Clustering: Traditional clustering algorithms, such as k-means or hierarchical clustering, are applied to the reduced-dimensional representation to assign data points to clusters.

Spectral clustering offers advantages in handling nonlinear data and discovering complex cluster structures that cannot be captured by traditional clustering algorithms based on distance measures alone. It can be particularly useful in image segmentation, social network analysis, and other domains where the relationships between data points play a crucial role in clustering.


Q7.a) Discuss bellman equations in reinforcement learning.

 Bellman equations are a fundamental concept in reinforcement learning. They express the value of being in a certain state by considering the expected rewards from that state and the future states that can be reached from it. The equations help in determining the optimal policy for an agent to maximize its long-term rewards. By recursively updating the values of states based on the immediate rewards and the values of the next states, the Bellman equations provide a way to estimate the optimal value function and policy in reinforcement learning.

Think of Bellman equations as a mathematical framework that helps an intelligent system learn how to make the best decisions in different situations.

It involves estimating the value of each possible action at a particular state by considering the immediate reward and the expected future rewards that can be achieved.

By repeatedly applying the Bellman equations, the intelligent system can gradually improve its decision-making abilities and optimize its actions to maximize long-term rewards.


b) What are the key terminologies of support vector machine? 

Support Vector Machine (SVM) is a popular machine learning algorithm used for classification tasks. It involves several key terminologies, including:

Support Vectors: These are the data points that lie closest to the decision boundary of the classifier.

Hyperplane: It is the decision boundary that separates the different classes in the data.

Kernel Function: This function transforms the input data into a higher-dimensional space, allowing the SVM to find a nonlinear decision boundary.

Margin: It is the region between the support vectors and the decision boundary. SVM aims to maximize the margin to improve generalization.

Soft Margin: It allows for some misclassification errors by allowing data points to be inside the margin or on the wrong side of the decision boundary.

C Parameter: It controls the trade-off between maximizing the margin and minimizing the misclassification errors.


c) What are classification errors?

Classification errors refer to the mistakes made by a classification model in assigning labels to the data points. There are two types of classification errors:

False Positive: It occurs when a data point is incorrectly classified as belonging to a positive class when it actually belongs to the negative class.

False Negative: It occurs when a data point is incorrectly classified as belonging to the negative class when it actually belongs to the positive class.

Classification errors can have different implications depending on the problem. In some cases, false positives may be more acceptable, while in others, false negatives may have more serious consequences. The goal is to minimize both types of errors to achieve accurate classification results.



Q8. Write short notes on any two:(6.25x2=12.5)

a) Applications of Machine Learning Algorithms

Image and Speech Recognition: Machine learning algorithms are widely used in image and speech recognition applications. They can learn patterns and features from large datasets to accurately identify and classify objects or recognize speech.

Natural Language Processing: Machine learning is crucial for natural language processing tasks such as language translation, sentiment analysis, and chatbots. It enables computers to understand and generate human language.

Recommender Systems: Many online platforms utilize machine learning algorithms to provide personalized recommendations to users. These algorithms analyze user behavior and preferences to suggest relevant products, movies, music, or articles.

Fraud Detection: Machine learning algorithms are employed in fraud detection systems to identify suspicious patterns and behaviors. They can detect fraudulent transactions, credit card frauds, or account breaches.

Autonomous Vehicles: Self-driving cars heavily rely on machine learning algorithms to perceive the environment, make decisions, and navigate safely. These algorithms process sensor data to understand the surroundings and make real-time driving decisions


b) Temporal difference learning 

Temporal Difference (TD) learning is a reinforcement learning technique that combines elements of dynamic programming and Monte Carlo methods. It allows an agent to learn from experience by updating value estimates based on the difference between expected and observed outcomes.

Temporal Difference (TD) learning is a learning technique that allows a learning system to learn from experience in a dynamic environment without needing a complete understanding of the environment.

It combines ideas from dynamic programming and Monte Carlo methods to update value estimates based on observed rewards and expected future rewards.

TD learning enables the learning system to make adjustments in real-time as it interacts with the environment, helping it to improve its decision-making abilities through trial and error.

TD Learning Algorithm: The agent interacts with the environment, observes rewards, and updates its value estimates using a learning rate and the temporal difference error.

Advantage of TD Learning: It enables learning from incomplete episodes, unlike Monte Carlo methods that require complete episodes. TD learning is more efficient for continuous tasks or situations where episodes are not clearly defined.

TD(0) and TD(λ): TD(0) is a one-step TD learning algorithm that updates value estimates based on the immediate reward and the value estimate of the next state. TD(λ) is a multi-step TD learning algorithm that considers multiple future steps and assigns credit to preceding states.

Eligibility Traces: TD(λ) utilizes eligibility traces to assign credit to multiple states preceding the current state. It allows the agent to learn from delayed rewards and make long-term predictions.


c) Hidden Markov Model

Hidden Markov Model (HMM) is a statistical model used to model sequential data with hidden states. It is widely applied in speech recognition, handwriting recognition, natural language processing, and bioinformatics.

Hidden Markov Model (HMM) is a statistical model used to understand and predict patterns in sequential data.

Imagine a system where you observe some outcomes but don't know the underlying process that generated them. HMM helps to uncover the hidden states that influenced these outcomes.

It assumes that the observed data depends on a series of hidden states, and the transitions between these states are governed by probabilities.

By analyzing the observed data, HMM helps in inferring the most likely sequence of hidden states and understanding the underlying patterns in the data.

Components of HMM: HMM consists of a set of hidden states, observable symbols emitted by each state, transition probabilities between states, and emission probabilities for each symbol in each state.

Forward-Backward Algorithm: The forward algorithm computes the probability of observing a sequence of symbols given the model. The backward algorithm computes the probability of being in a particular state at a given time, given the observed sequence.

Viterbi Algorithm: The Viterbi algorithm is used to find the most likely sequence of hidden states given the observed sequence. It utilizes dynamic programming to efficiently compute the maximum probability path.

Learning in HMM: The Baum-Welch algorithm, also known as the forward-backward algorithm, is used to train HMMs. It adjusts the model parameters to maximize the likelihood of the observed data.


d) Value function approximation

Value function approximation is a technique used in reinforcement learning to estimate the value function of states or state-action pairs. It is employed when the state or action space is large and explicit value representation becomes impractical.

Value Function Approximation is a technique used in reinforcement learning to estimate the value of different states or state-action pairs.

Think of it as a way to represent the expected future rewards for each state or action in a complex environment.

Instead of explicitly calculating the value for every state or action, value function approximation uses mathematical models like neural networks to make predictions based on observed data.

This approximation helps in efficiently dealing with large state spaces and enables the learning system to make informed decisions by estimating the value of unexplored states or actions.

Function Approximators: Value function approximation utilizes function approximators such as linear models, neural networks, or decision trees to estimate the value function. These approximators take input features and learn to predict the value of states or state-action pairs.

Generalization: Value function approximation allows generalization across similar states. Instead of maintaining a separate value for each state, the approximator can estimate values for unseen states based on their similarities to observed states.

Optimization: The function approximators are trained using optimization algorithms such as gradient descent or stochastic gradient descent. The goal is to minimize the error between predicted values and observed rewards.

Trade-off: Value function approximation involves a trade-off between approximation accuracy and computational efficiency. Choosing an appropriate function approximator and balancing model complexity is crucial to achieve good performance.

                                      -----------------------

Comments

Popular posts from this blog

Wireless Communication MID TERM 2022

Wireless Communication Important Notes

Mobile Computing | Unit 3 & 4