Graduate Certificate in Machine Learning for Genomic Data · Guide

Machine Learning Fundamentals

15 min read Updated 6 May 2026

Machine learning is a powerful field that is revolutionizing various industries, including genomics. In this course, Graduate Certificate in Machine Learning for Genomic Data, you will encounter a wide range of key terms and vocabulary that are essential for understanding the fundamental concepts of machine learning. Let's delve into these terms to provide you with a comprehensive understanding of the subject matter.

1. **Machine Learning**: Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. It focuses on the development of algorithms that can learn from and make predictions or decisions based on data.

2. **Genomic Data**: Genomic data refers to the vast amount of information stored in an individual's genome, including DNA sequences, gene expressions, and genetic variations. Analyzing genomic data is crucial for understanding genetic diseases, personalized medicine, and evolutionary biology.

3. **Supervised Learning**: Supervised learning is a type of machine learning where the model is trained on labeled data, meaning the input data is paired with the correct output. The goal is to learn a mapping function from input to output to make predictions on unseen data.

4. **Unsupervised Learning**: Unsupervised learning is a type of machine learning where the model is trained on unlabeled data. The goal is to discover hidden patterns or structures in the data without explicit guidance, such as clustering similar data points together.

5. **Reinforcement Learning**: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions, with the goal of maximizing cumulative rewards over time.

6. **Feature Engineering**: Feature engineering is the process of selecting, transforming, and creating new features from raw data to improve the performance of machine learning models. It involves domain knowledge, creativity, and data manipulation techniques.

7. **Feature Selection**: Feature selection is the process of choosing the most relevant features from the original set of features to improve model performance and reduce overfitting. It helps in simplifying the model and increasing its interpretability.

8. **Overfitting**: Overfitting occurs when a machine learning model performs well on the training data but fails to generalize to unseen data. It happens when the model is too complex and captures noise in the training data instead of underlying patterns.

9. **Underfitting**: Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data. It performs poorly on both the training and test data, indicating that it lacks the capacity to learn from the data effectively.

10. **Bias-Variance Tradeoff**: The bias-variance tradeoff is a fundamental concept in machine learning that deals with the balance between bias (error due to simplifying assumptions) and variance (error due to sensitivity to variations in the training data). Finding the right balance is essential for building predictive models.

11. **Cross-Validation**: Cross-validation is a technique used to evaluate the performance of machine learning models by splitting the data into multiple subsets, training the model on some subsets, and testing it on others. It helps in estimating the model's generalization error.

12. **Hyperparameters**: Hyperparameters are parameters that are set before the learning process begins and control the learning process of a machine learning algorithm. Examples include the learning rate, number of hidden layers in a neural network, and regularization strength.

13. **Grid Search**: Grid search is a method used to tune hyperparameters by searching through a predefined grid of parameter values and selecting the best combination based on model performance. It helps in finding the optimal hyperparameters for a given algorithm.

14. **Random Forest**: Random forest is an ensemble learning technique that builds multiple decision trees during training and combines their predictions to improve accuracy and generalization. It is robust to overfitting and performs well on a wide range of datasets.

15. **Support Vector Machine (SVM)**: Support Vector Machine is a supervised learning algorithm that classifies data by finding the hyperplane that best separates different classes in a high-dimensional space. It is effective for both linear and non-linear classification tasks.

16. **Neural Network**: Neural network is a type of machine learning model inspired by the human brain's neural network structure. It consists of interconnected layers of nodes (neurons) that process input data and learn to make predictions through training.

17. **Deep Learning**: Deep learning is a subfield of machine learning that focuses on neural networks with multiple hidden layers (deep neural networks). It has revolutionized various domains, including computer vision, natural language processing, and speech recognition.

18. **Convolutional Neural Network (CNN)**: Convolutional Neural Network is a type of deep neural network designed for processing structured grid-like data, such as images. It uses convolutional layers to extract features and pooling layers to reduce spatial dimensions.

19. **Recurrent Neural Network (RNN)**: Recurrent Neural Network is a type of neural network designed for sequential data, such as time series or natural language. It has feedback connections that allow it to capture temporal dependencies and make predictions based on previous inputs.

20. **Autoencoder**: Autoencoder is a type of neural network that learns to reconstruct input data by compressing it into a lower-dimensional representation (encoder) and then decoding it back to the original input (decoder). It is used for dimensionality reduction and unsupervised learning tasks.

21. **Natural Language Processing (NLP)**: Natural Language Processing is a field of artificial intelligence that focuses on enabling computers to understand, process, and generate human language. It is used in various applications, including sentiment analysis, machine translation, and chatbots.

22. **Transfer Learning**: Transfer learning is a machine learning technique where a model trained on one task is adapted to perform a different but related task. It leverages knowledge learned from one domain to improve performance in another domain with limited labeled data.

23. **Generative Adversarial Network (GAN)**: Generative Adversarial Network is a type of neural network architecture that consists of two networks, a generator and a discriminator, which are trained adversarially. The generator learns to generate realistic data, while the discriminator learns to distinguish between real and fake data.

24. **Cluster Analysis**: Cluster analysis is a method used to group similar data points together based on their characteristics or features. It is commonly used in unsupervised learning to discover hidden patterns in the data and segment it into meaningful clusters.

25. **Dimensionality Reduction**: Dimensionality reduction is a technique used to reduce the number of features in a dataset while preserving the most important information. It helps in visualizing high-dimensional data, speeding up learning algorithms, and improving model performance.

26. **Principal Component Analysis (PCA)**: Principal Component Analysis is a popular dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space by finding orthogonal components that capture the most variance in the data. It is used for data visualization and noise reduction.

27. **K-Means Clustering**: K-Means Clustering is a popular clustering algorithm that partitions data into k clusters by iteratively assigning data points to the nearest centroid and updating the centroids based on the mean of the assigned points. It is simple, efficient, and widely used in practice.

28. **Ensemble Learning**: Ensemble learning is a machine learning technique that combines multiple models (learners) to improve predictive performance. It leverages the wisdom of crowds by aggregating the predictions of individual models through voting or averaging.

29. **Bagging**: Bagging (Bootstrap Aggregating) is an ensemble learning technique that trains multiple models on random subsets of the training data with replacement and combines their predictions through averaging or voting. It helps in reducing variance and improving model robustness.

30. **Boosting**: Boosting is an ensemble learning technique that trains multiple weak learners sequentially, where each learner corrects the errors of its predecessor. It focuses on difficult-to-classify instances and combines the predictions to create a strong learner with high accuracy.

31. **Hyperparameter Tuning**: Hyperparameter tuning is the process of optimizing hyperparameters to improve the performance of a machine learning model. It involves searching through a hyperparameter space, evaluating different combinations, and selecting the best set of hyperparameters.

32. **Precision and Recall**: Precision and recall are evaluation metrics used to assess the performance of classification models. Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positive instances.

33. **F1 Score**: F1 Score is a metric that combines precision and recall into a single value to provide a balanced evaluation of a classification model. It is the harmonic mean of precision and recall, giving equal weight to both metrics.

34. **Confusion Matrix**: Confusion Matrix is a table that visualizes the performance of a classification model by comparing actual class labels with predicted class labels. It consists of four cells: true positive, false positive, true negative, and false negative, which are used to calculate various evaluation metrics.

35. **ROC Curve**: ROC (Receiver Operating Characteristic) Curve is a graphical representation of the tradeoff between true positive rate (sensitivity) and false positive rate (1-specificity) at different classification thresholds. It helps in visualizing the performance of binary classification models.

36. **Area Under the Curve (AUC)**: Area Under the Curve is a metric that quantifies the overall performance of a classification model based on the ROC curve. It represents the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example.

37. **Mean Squared Error (MSE)**: Mean Squared Error is a common loss function used in regression tasks to measure the average squared difference between predicted and actual values. It penalizes large errors more than small errors, making it sensitive to outliers.

38. **Root Mean Squared Error (RMSE)**: Root Mean Squared Error is the square root of the mean squared error, providing a more interpretable measure of error in regression tasks. It is commonly used to assess the accuracy of predictive models.

39. **Bias**: Bias is the error introduced by a model when it makes overly simplistic assumptions about the underlying patterns in the data. High bias can lead to underfitting, where the model fails to capture the true relationship between input and output.

40. **Variance**: Variance is the error introduced by a model due to its sensitivity to fluctuations in the training data. High variance can lead to overfitting, where the model performs well on training data but fails to generalize to unseen data.

41. **Regularization**: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages large weights or complex models. It helps in simplifying the model and improving its generalization to unseen data.

42. **Dropout**: Dropout is a regularization technique used in neural networks to prevent overfitting by randomly deactivating a fraction of neurons during training. It helps in reducing co-adaptation of neurons and improving the robustness of the model.

43. **Batch Normalization**: Batch Normalization is a technique used to improve the training of deep neural networks by normalizing the input to each layer. It helps in stabilizing training, reducing internal covariate shift, and accelerating convergence.

44. **Gradient Descent**: Gradient Descent is an optimization algorithm used to minimize the loss function and update the parameters of a machine learning model iteratively. It calculates the gradient of the loss function with respect to the parameters and moves in the direction of steepest descent.

45. **Stochastic Gradient Descent (SGD)**: Stochastic Gradient Descent is a variant of gradient descent that updates the model parameters using a random subset of training data (mini-batch) at each iteration. It helps in speeding up training and handling large datasets.

46. **Learning Rate**: Learning Rate is a hyperparameter that controls the step size of parameter updates during optimization. It determines how quickly the model converges to the optimal solution and plays a crucial role in the training process.

47. **Activation Function**: Activation Function is a non-linear function applied to the output of a neuron in a neural network to introduce non-linearity and enable the network to learn complex patterns. Common activation functions include ReLU, Sigmoid, and Tanh.

48. **Loss Function**: Loss Function is a function that quantifies the error between predicted and actual values in a machine learning model. It guides the optimization process by providing a measure of how well the model is performing on the training data.

49. **Cross-Entropy Loss**: Cross-Entropy Loss is a popular loss function used in classification tasks to measure the difference between predicted and actual class probabilities. It is particularly effective for multi-class classification and is commonly used in neural networks.

50. **Backpropagation**: Backpropagation is an algorithm used to train neural networks by computing the gradients of the loss function with respect to the model parameters. It propagates the error back through the network, updating the weights to minimize the loss.

51. **Batch Size**: Batch Size is the number of data points used in each iteration of training a machine learning model. It affects the speed of convergence, memory requirements, and the quality of parameter updates during optimization.

52. **Epoch**: Epoch is a single pass through the entire training dataset during the training of a machine learning model. Training is typically done over multiple epochs to allow the model to learn from the data and improve its performance iteratively.

53. **Early Stopping**: Early Stopping is a regularization technique used to prevent overfitting by monitoring the model's performance on a validation set and stopping training when the performance starts to degrade. It helps in finding the optimal number of training epochs.

54. **Data Augmentation**: Data Augmentation is a technique used to increase the size of the training dataset by applying transformations such as rotation, flipping, and scaling to the existing data. It helps in improving model generalization and robustness.

55. **One-Hot Encoding**: One-Hot Encoding is a technique used to represent categorical variables as binary vectors where each category is encoded as a binary value. It is commonly used in machine learning models that require numerical input data.

56. **Feature Scaling**: Feature Scaling is a preprocessing technique used to standardize or normalize the range of independent variables in a dataset. It helps in improving the convergence of optimization algorithms and preventing features with large scales from dominating the model.

57. **Imbalanced Data**: Imbalanced Data refers to a situation where one class of data samples dominates the dataset, leading to biased model performance. It is common in binary classification tasks where the positive class is rare compared to the negative class.

58. **SMOTE (Synthetic Minority Over-sampling Technique)**: SMOTE is an oversampling technique used to address class imbalance by generating synthetic samples for the minority class based on the feature space similarity of existing samples. It helps in improving the performance of models on imbalanced data.

59. **Transfer Learning**: Transfer Learning is a machine learning technique where a model trained on one task is adapted to perform a different but related task. It leverages knowledge learned from one domain to improve performance in another domain with limited labeled data.

60. **Reinforcement Learning**: Reinforcement Learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions, with the goal of maximizing cumulative rewards over time.

61. **Policy Gradient**: Policy Gradient is a reinforcement learning technique used to train agents to learn a policy that maps observations to actions. It optimizes the policy directly by maximizing the expected cumulative reward through gradient ascent.

62. **Q-Learning**: Q-Learning is a model-free reinforcement learning algorithm that learns an action-value function (Q-function) to estimate the expected cumulative reward of taking a specific action in a given state. It uses the Bellman equation to update Q-values iteratively.

63. **Deep Q-Network (DQN)**: Deep Q-Network is a deep reinforcement learning algorithm that combines Q-Learning with deep neural networks to approximate the Q-function. It uses experience replay and target networks to stabilize training and improve sample efficiency.

64. **Exploration-Exploitation Tradeoff**: Exploration-Exploitation Tradeoff is a fundamental concept in reinforcement learning that deals with the balance between exploring new actions and exploiting known actions. It aims to maximize cumulative reward by finding the optimal strategy.

65. **Markov Decision Process (MDP)**: Markov Decision Process is a mathematical framework used to model sequential decision-making problems in reinforcement learning. It consists of states, actions, transition probabilities, rewards, and a discount factor to optimize the agent's policy.

66. **Monte Carlo Method**: Monte Carlo Method is a simulation technique used in reinforcement learning to estimate the value of states or actions by averaging the returns of multiple episodes. It samples trajectories from the environment to update value estimates.

67. **Temporal Difference Learning**: Temporal Difference Learning is a reinforcement learning algorithm that updates value estimates based on the difference between predicted and actual rewards at each time step. It combines ideas from dynamic programming and Monte Carlo methods.

68. **Deep Reinforcement Learning**: Deep Reinforcement Learning is a subfield of reinforcement learning that combines deep learning with reinforcement learning to solve complex decision-making tasks. It uses deep neural networks to approximate value functions or policies.

69. **Policy Gradient Methods**: Policy Gradient Methods are a family of reinforcement learning algorithms that directly optimize the policy by maximizing the expected cumulative reward. They use gradient ascent to update the policy parameters based on rewards received.

70. **Actor-Critic Method**: Actor-Critic Method is a reinforcement learning algorithm that combines the benefits of both policy gradient and value-based methods. It consists of two components: an actor that learns the policy and a critic that estimates the value function to guide the actor.

71. **Proximal Policy Optimization (PPO)**: Proximal Policy Optimization is a policy gradient method that addresses the problem of instability in reinforcement learning by constraining the policy updates to prevent large changes. It uses clipped surrogate objectives to update the policy.

72. **Deep Deterministic Policy Gradient (DDPG)**: Deep Deterministic Policy Gradient is an actor-critic algorithm that extends policy gradient methods to continuous action spaces. It uses deterministic policies and target networks to learn a deterministic policy and value function.

73. **Multi-Armed Bandit**: Multi-Armed Bandit is a classic reinforcement learning problem where an agent must decide which arm (action) to pull to maximize cumulative reward over time. It balances exploration and exploitation to find the arm with the highest expected reward.

74. **Contextual Bandit**: Contextual Bandit is an extension of the multi-armed bandit problem where the rewards for each arm depend on a context or state. The agent must learn a policy that maps contexts to actions to maximize cumulative reward.

75. **Bayesian Optimization**: Bayesian Optimization is a sequential model-based optimization technique used to find the optimal set of hyperparameters for machine learning models. It uses a probabilistic model of the objective function to guide the search efficiently.

76. **Gaussian Process**: Gaussian Process is a flexible probabilistic model used in Bayesian Optimization to model the objective function and its uncertainty. It captures the distribution over functions and provides a posterior distribution over the optimal hyperparameters.

77. **Thompson Sampling**: Thompson Sampling is a Bayesian bandit algorithm that balances exploration and exploitation by sampling arm selections from the posterior distribution over rewards. It selects arms based on the probability of being the best arm.

78. **Meta-Learning**: Meta-Learning is a subfield

Key takeaways

In this course, Graduate Certificate in Machine Learning for Genomic Data, you will encounter a wide range of key terms and vocabulary that are essential for understanding the fundamental concepts of machine learning.
**Machine Learning**: Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed.
**Genomic Data**: Genomic data refers to the vast amount of information stored in an individual's genome, including DNA sequences, gene expressions, and genetic variations.
**Supervised Learning**: Supervised learning is a type of machine learning where the model is trained on labeled data, meaning the input data is paired with the correct output.
The goal is to discover hidden patterns or structures in the data without explicit guidance, such as clustering similar data points together.
**Reinforcement Learning**: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment.
**Feature Engineering**: Feature engineering is the process of selecting, transforming, and creating new features from raw data to improve the performance of machine learning models.

Machine Learning Fundamentals

Key takeaways

More from Graduate Certificate in Machine Learning for Genomic Data