Data Preprocessing and Feature Engineering
Expert-defined terms from the Professional Certificate in AI for Chemical Process Engineering course at London School of Planning and Management. Free to read, free to share, paired with a globally recognised certification pathway.
Data Preprocessing #
Data Preprocessing
Data preprocessing is the initial step in the data analysis pipeline where raw d… #
This process involves handling missing values, removing outliers, scaling features, encoding categorical variables, and splitting the data into training and testing sets. Data preprocessing is crucial for ensuring the quality and reliability of the data before feeding it into machine learning models.
Feature Engineering #
Feature Engineering
Feature engineering is the process of selecting, transforming, and creating new… #
This step involves extracting relevant information from the dataset, combining existing features to create new ones, and selecting the most important features for the model. Feature engineering plays a critical role in enhancing the predictive power of machine learning models and can significantly impact their performance.
Missing Values #
Missing Values
Missing values refer to the absence of data in a dataset, which can occur due to… #
Handling missing values is a crucial step in data preprocessing as they can negatively impact the performance of machine learning models. Common techniques for dealing with missing values include imputation, deletion, or using predictive models to estimate missing values.
Outliers #
Outliers
Outliers are data points that deviate significantly from the rest of the data in… #
These anomalies can distort the analysis and modeling process, leading to inaccurate results. Outliers can be detected using statistical methods such as z-scores, box plots, or clustering algorithms. Handling outliers is essential in data preprocessing to ensure the robustness and reliability of machine learning models.
Scaling #
Scaling
Scaling is the process of standardizing the range of features in a dataset to en… #
Common scaling techniques include min-max scaling, z-score normalization, and robust scaling. Scaling is essential in feature engineering to prevent certain features from dominating the model and to improve the convergence speed and accuracy of machine learning algorithms.
Encoding #
Encoding
Encoding is the process of converting categorical variables into numerical forma… #
Common encoding techniques include one-hot encoding, label encoding, and target encoding. Encoding is a crucial step in data preprocessing to handle categorical variables and ensure that the model can interpret and learn from these features effectively.
Training Set #
Training Set
The training set is a subset of the dataset that is used to train a machine lear… #
The training set contains input features and corresponding target labels that the model learns from to make predictions. The training set is used to optimize the model parameters and evaluate its performance before applying it to unseen data.
Testing Set #
Testing Set
The testing set is a subset of the dataset that is used to evaluate the performa… #
The testing set contains input features but does not include the target labels, which are used to assess the model's predictive accuracy on unseen data. The testing set helps measure the generalization ability of the model and identify any overfitting or underfitting issues.
Imputation #
Imputation
Imputation is the process of filling in missing values in a dataset using statis… #
Common imputation methods include mean, median, mode imputation, k-nearest neighbors imputation, and regression imputation. Imputation is a critical step in data preprocessing to ensure that the dataset is complete and suitable for machine learning analysis.
One #
Hot Encoding
One #
hot encoding is a technique used to convert categorical variables into binary format by creating dummy variables for each category. Each category is represented by a binary vector where one element is set to 1 and the rest are set to 0. One-hot encoding is commonly used in feature engineering to handle categorical variables and prevent the model from assuming ordinal relationships between categories.
Label Encoding #
Label Encoding
Label encoding is a technique used to convert categorical variables into numeric… #
Label encoding is suitable for ordinal variables where the order of categories is meaningful. However, label encoding may introduce unintended ordinal relationships between categories, which can impact the performance of machine learning models.
Feature Selection #
Feature Selection
Feature selection is the process of choosing the most relevant features from a d… #
Feature selection techniques include filter methods, wrapper methods, and embedded methods. Feature selection helps reduce overfitting, improve model interpretability, and enhance computational efficiency by focusing on the most informative features.
Dimensionality Reduction #
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features in a… #
Common dimensionality reduction techniques include principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and linear discriminant analysis (LDA). Dimensionality reduction helps alleviate the curse of dimensionality, improve model performance, and enhance visualization of high-dimensional data.
Regularization #
Regularization
Regularization is a technique used to prevent overfitting in machine learning mo… #
Common regularization methods include L1 regularization (Lasso), L2 regularization (Ridge), and elastic net regularization. Regularization helps control the complexity of the model, promote feature sparsity, and improve its generalization ability on unseen data.
Cross #
Validation
Cross #
validation is a statistical technique used to assess the performance of machine learning models by partitioning the dataset into multiple subsets. Common cross-validation methods include k-fold cross-validation, leave-one-out cross-validation, and stratified cross-validation. Cross-validation helps estimate the model's predictive performance, identify overfitting issues, and optimize hyperparameters without biasing the results.
Hyperparameters #
Hyperparameters
Hyperparameters are parameters that are set before training a machine learning m… #
Examples of hyperparameters include the learning rate, regularization strength, number of hidden layers, and activation functions. Hyperparameters are tuned through grid search, random search, or Bayesian optimization to optimize the model's performance on the validation set.
Overfitting #
Overfitting
Overfitting occurs when a machine learning model performs well on the training s… #
Overfitting is caused by the model capturing noise and irrelevant patterns in the training data, leading to poor generalization. Common techniques to prevent overfitting include regularization, cross-validation, early stopping, and reducing model complexity.
Underfitting #
Underfitting
Underfitting occurs when a machine learning model is too simple to capture the u… #
Underfitting can be mitigated by increasing model complexity, adding more features, or reducing regularization. Balancing model complexity is essential to prevent underfitting and achieve optimal performance.
Curse of Dimensionality #
Curse of Dimensionality
The curse of dimensionality refers to the challenges and limitations that arise… #
High-dimensional data requires exponentially more samples to cover the feature space adequately, leading to sparsity and increased computational complexity. Techniques such as dimensionality reduction, feature selection, and regularization are used to mitigate the curse of dimensionality and improve the performance of machine learning models.
Feature Importance #
Feature Importance
Feature importance is a measure of the contribution of each feature to the predi… #
Common methods for assessing feature importance include mean decrease impurity, permutation importance, and SHAP values. Understanding feature importance helps identify the most informative features, interpret model predictions, and improve the overall performance of the model.
Feature Extraction #
Feature Extraction
Feature extraction is the process of transforming raw data into a more compact r… #
Common feature extraction techniques include principal component analysis (PCA), linear discriminant analysis (LDA), and autoencoders. Feature extraction helps reduce noise, improve computational efficiency, and enhance the interpretability of machine learning models.
Feature Transformation #
Feature Transformation
Feature transformation is the process of converting the original features in a d… #
Feature transformation helps improve the distribution of features, reduce skewness, and make the data more suitable for machine learning algorithms. Feature transformation plays a crucial role in data preprocessing and feature engineering to enhance the model's performance.
End #
to-End Machine Learning Pipeline
An end #
to-end machine learning pipeline is a sequence of steps that encompass data collection, preprocessing, feature engineering, model training, evaluation, and deployment. Building an end-to-end machine learning pipeline involves integrating various components to automate the process from data ingestion to model deployment. End-to-end pipelines streamline the machine learning workflow, improve reproducibility, and accelerate the development of machine learning applications.
Transfer Learning #
Transfer Learning
Transfer learning is a machine learning technique that leverages knowledge from… #
Transfer learning involves fine-tuning the parameters of a pre-trained model on a new dataset or extracting features from intermediate layers. Transfer learning helps improve model performance, reduce training time, and address data scarcity in specific domains.
Batch Normalization #
Batch Normalization
Batch normalization is a technique used to normalize the activations of each lay… #
Batch normalization helps stabilize the training process, reduce internal covariate shift, and improve the convergence speed of deep neural networks. Batch normalization is commonly applied after the activation function in neural network architectures.
Data Augmentation #
Data Augmentation
Data augmentation is a technique used to artificially increase the size of a tra… #
Data augmentation helps improve model generalization, prevent overfitting, and enhance the robustness of deep learning models to variations in the input data. Data augmentation is widely used in computer vision and natural language processing tasks.
Recurrent Neural Networks (RNN) #
Recurrent Neural Networks (RNN)
Recurrent Neural Networks (RNN) are a class of neural networks designed to handl… #
RNNs are well-suited for tasks such as time series forecasting, speech recognition, and natural language processing. However, RNNs suffer from vanishing and exploding gradient problems, which limit their ability to capture long-range dependencies.
Long Short #
Term Memory (LSTM)
Long Short #
Term Memory (LSTM) is a variant of recurrent neural networks that addresses the issue of capturing long-range dependencies in sequential data. LSTMs use a gating mechanism to control the flow of information through the network, allowing them to retain information over long periods. LSTMs are widely used in applications that require modeling temporal dependencies, such as speech recognition, machine translation, and sentiment analysis.
Convolutional Neural Networks (CNN) #
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN) are a class of neural networks designed to p… #
CNNs use convolutional layers to extract hierarchical features from the input data and pooling layers to reduce spatial dimensions. CNNs are widely used in computer vision tasks such as object detection, image classification, and image segmentation due to their ability to capture spatial patterns effectively.
Autoencoders #
Autoencoders
Autoencoders are neural networks designed to learn efficient representations of… #
Autoencoders consist of an encoder that compresses the input into a latent space representation and a decoder that reconstructs the input from the latent space. Autoencoders are used for dimensionality reduction, data denoising, and feature learning in unsupervised learning tasks.
Generative Adversarial Networks (GAN) #
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN) are a class of neural networks that consis… #
The generator generates synthetic data samples, while the discriminator distinguishes between real and generated samples. GANs are used for generating realistic images, text, and audio data, as well as for data augmentation and unsupervised representation learning.
Hyperparameter Tuning #
Hyperparameter Tuning
Hyperparameter tuning is the process of optimizing the hyperparameters of a mach… #
Common methods for hyperparameter tuning include grid search, random search, Bayesian optimization, and genetic algorithms. Hyperparameter tuning helps find the optimal set of hyperparameters that maximize the model's predictive accuracy and generalization ability.
Gradient Descent #
Gradient Descent
Gradient descent is an optimization algorithm used to update the parameters of a… #
Gradient descent iteratively minimizes the loss function by computing the gradient of the parameters with respect to the loss and updating the parameters in the opposite direction of the gradient. Gradient descent variants include stochastic gradient descent, mini-batch gradient descent, and adaptive gradient descent algorithms.
Backpropagation #
Backpropagation
Backpropagation is a key algorithm in training neural networks that computes the… #
Backpropagation propagates the error backward through the network layers to update the weights using the chain rule of calculus. Backpropagation enables neural networks to learn complex patterns and relationships in the data by adjusting the parameters to minimize the prediction error.
Activation Function #
Activation Function
An activation function is a mathematical function applied to the output of a neu… #
Common activation functions include sigmoid, tanh, ReLU, Leaky ReLU, and softmax. Activation functions control the output range of neurons, prevent vanishing gradients, and improve the convergence speed of deep neural networks.
Loss Function #
Loss Function
A loss function is a mathematical function that measures the discrepancy between… #
Common loss functions include mean squared error, cross-entropy, hinge loss, and KL divergence. Loss functions are used to quantify the model's performance during training and guide the optimization process to minimize prediction errors.
Optimization Algorithm #
Optimization Algorithm
An optimization algorithm is a method used to minimize the loss function and upd… #
Common optimization algorithms include gradient descent, stochastic gradient descent, Adam, RMSprop, and Adagrad. Optimization algorithms play a crucial role in training neural networks by efficiently updating the parameters to converge to the optimal solution.
Learning Rate #
Learning Rate
The learning rate is a hyperparameter that controls the step size of parameter u… #
A high learning rate can lead to overshooting and divergence, while a low learning rate can result in slow convergence. The learning rate is tuned to balance between fast convergence and stable training of the model to optimize its performance.
Epoch #
Epoch
An epoch is a single iteration over the entire training dataset in the training… #
Training a model involves multiple epochs, where the model learns from the training data and updates its parameters to minimize the loss function. The number of epochs is a hyperparameter that determines how many times the model sees the entire training dataset during training.
Batch Size #
Batch Size
The batch size is the number of training samples processed in each iteration of… #
Training data is divided into batches to improve computational efficiency, reduce memory requirements, and introduce stochasticity in the optimization process. The batch size is a hyperparameter that influences the model's convergence speed and generalization ability.
Dropout #
Dropout
Dropout is a regularization technique used to prevent overfitting in neural netw… #
Dropout introduces noise into the network and forces neurons to learn more robust features, improving the model's generalization ability. Dropout is applied to hidden layers in neural networks to reduce co-adaptation of neurons and enhance model performance.
Early Stopping #
Early Stopping
Early stopping is a regularization technique used to prevent overfitting in mach… #
Early stopping helps prevent the model from memorizing noise in the training data and improves its generalization ability on unseen data. Early stopping is a simple yet effective technique to optimize model performance.
Model Evaluation Metrics #
Model Evaluation Metrics
Model evaluation metrics are measures used to assess the performance of machine… #
Common evaluation metrics include accuracy, precision, recall, F1 score, ROC AUC, mean squared error, and log loss. Model evaluation metrics help quantify the model's predictive accuracy, class balance, and generalization ability, enabling comparison between different models and hyperparameter settings.
Confusion Matrix #
Confusion Matrix
A confusion matrix is a table that visualizes the performance of a classificatio… #
The confusion matrix contains four elements: true positive, false positive, true negative, and false negative, which are used to calculate evaluation metrics such as accuracy, precision, recall, and F1 score. The confusion matrix provides insight into the model's predictive performance across different classes.
Precision #
Precision
Precision is an evaluation metric that measures the proportion of correctly pred… #
Precision focuses on the accuracy of positive predictions and is calculated as the ratio of true positives to the sum of true positives and false positives. Precision is used in scenarios where minimizing false positives is critical, such as fraud detection or medical diagnosis.
Recall #
Recall
Recall, also known as sensitivity or true positive rate, is an evaluation metric… #
Recall focuses on the completeness of positive predictions and is calculated as the ratio of true positives to the sum of true positives and false negatives. Recall is used in scenarios where detecting all positive instances is crucial, such as disease screening or anomaly detection.
F1 Score #
F1 Score
The F1 score is a harmonic mean of precision and recall that balances between pr… #
The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates poor model performance. The F1 score is commonly used to evaluate the overall performance of a classification model by considering both false positives and false negatives in the predictions.
ROC Curve #
ROC Curve
The receiver operating characteristic (ROC) curve is a graphical representation… #
The ROC curve helps visualize the trade-off between true positive and false positive rates and assess the model's performance at various decision boundaries. The area under the ROC curve (AUC) quantifies the model's discriminative power, with higher AUC values indicating better performance.
Mean Squared Error (MSE) #
Mean Squared Error (MSE)
Mean squared error (MSE) is a common loss function used to measure the average s… #
MSE penalizes large prediction errors more severely than small errors and is sensitive to outliers in the data. Minimizing the MSE during training helps optimize the model's parameters to reduce prediction errors and improve the model's predictive accuracy.
Log Loss #
Log Loss
Log loss, also known as cross #
entropy loss, is a loss function used to measure the difference between the predicted probabilities and the actual class labels in a classification model. Log loss penalizes