Neural Networks

Neural Networks are the backbone of many machine learning and deep learning models, designed to recognize patterns and make predictions. Here's a breakdown of key types and concepts:

CNN (Convolutional Neural Networks):
- Primarily used for image processing.
- A layered approach that uses convolutional layers, pooling layers, and fully connected layers.
- Feature detection: Filters are applied to input images to extract important features.
- Commonly used in:
  - Image classification
  - Object detection
  - Image segmentation (e.g., autonomous driving)
RNN (Recurrent Neural Networks):
- Specializes in sequential data (e.g., time-series data or text).
- RNNs maintain a memory of past inputs, which helps them understand context over time.
- Common use cases include:
  - Stock predictions
  - Machine translation (e.g., translating text from one language to another)
  - Speech recognition and audio processing
- Variants to help with state dilution over previous time steps:
  - LSTM (Long Short-Term Memory): A more advanced version of RNNs, capable of capturing long-range dependencies.
  - GRU (Gated Recurrent Unit): A simpler version of LSTM, but still capable of handling long-range dependencies while being computationally more efficient.

Activation Functions

Activation functions define output of a neuron given input signals:

Linear: Doesn't do much, can't backpropogate
Binary Step: Either on or off
Sigmoid / Logistic / TanH: Scales input nicely but subject to vanishing gradient problem
Rectified Linear Unit (ReLU): Solves many problems but issues occur at negative or zero values causing linearity
Leaky ReLU: Improves on ReLU by introducing a negative slope below zero
Parametric ReLU: Improves on ReLU by introducing a backpropogated negative slope below zero
Swish: Developed by Google, good for very deep networks
Maxout: Outputs the max of inputs

Softmax

Function used on final output layer of classification to convert outputs to probabilities.

Learning Rate

Neural networks are trained with gradient descent, we start at a random point and sample solutions/weights in order to minimize a cost function over epochs. The distance between the samples is the learning rate. Too high of a learning rate means overshooting whereas a small one would result in training too long. The batch size is how many training samples are used per epoch. Smaller batch sizes can work better to escape training out of local minima. Larger batch sizes can can converge at the wrong solution at random.

Training & Tuning

Proper training and tuning are essential to building effective models that generalize well to unseen data. The following techniques help optimize model performance:

Hyperparameter Tuning Techniques

Grid Search: Exhaustively searches through a defined parameter grid. It can be very time-consuming but ensures all combinations are explored.
Random Search: Randomly samples parameter values from predefined ranges. It's more efficient than grid search but doesn't guarantee the optimal combination.
Bayesian Optimization: Treats hyperparameter tuning as a regression problem, creating a model to predict the best parameters based on past evaluations.
Hyperband: Dynamically allocates resources to the most promising hyperparameter configurations based on early performance, making it efficient for iterative training.

Hyperparameter Optimization

Learning rate: Determines how much to adjust the model's weights after each update during training.
- Too high → Model might overshoot optimal weights and diverge.
- Too low → Training becomes slow and may get stuck in suboptimal solutions.
- Common optimization: Start with a small learning rate and increase it or use adaptive learning rates (e.g., Adam optimizer).
Batch size: The number of training samples processed before the model is updated.
- Small batch size: Helps avoid local minima, but training can be slower.
- Large batch size: Training is faster, but can lead to poor generalization and might get stuck in suboptimal solutions.
Regularization:
- L1 Regularization: Adds a penalty to the model for having large coefficients on irrelevant features. It's also used for feature selection.
- L2 Regularization: Penalties large weights but allows all features to contribute to the model. It helps avoid overfitting.

Overfitting Solutions

Overfitting occurs when the model learns to memorize the training data instead of generalizing to unseen data. Here are some ways to address it:

Early stopping: Interrupts training when the validation loss begins to increase, preventing overfitting.
Dropout: Randomly drops neurons during training to prevent the model from relying too heavily on any one neuron.
Data augmentation: Generates new training data from the existing data, improving the model’s robustness. This can include transformations like rotations, flipping, and scaling for image data.

Gradient Issues

Vanishing Gradient Problem: In deep networks, gradients can become extremely small as they are backpropagated, leading to difficulty in learning. When the slope of the learning rate approaches zero, things get stuck.
Solutions: - LSTM and GRU networks can mitigate this problem. - ResNet (Residual Networks) use skip connections to bypass layers and prevent gradients from vanishing. - ReLU activation functions mitigate this by not saturating for positive values.
Gradient Checking: A technique used to debug the backpropagation implementation by numerically approximating the gradient using derivates and comparing it to the analytically computed gradient.

L1 and L2 Regularization

A technique to prevent overfitting by adding a new regularization term as model is trained.

L1: sum of weights • Performs feature selection – entire features go to 0 • Computationally inefficient • Sparse output Choose if you don't want to use all features. • L2: sum of square of weights • All features remain considered, just weighted • Computationally efficient • Dense output Choose if you want to use all features.

Evaluation Metrics

Evaluating models with the right metrics ensures that they are performing as expected. Common metrics include:

Confusion Matrix: Visualizes performance with counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Recall: The fraction of relevant instances that were retrieved by the model. Critical for tasks where missing a relevant case is costly (e.g., fraud detection). Aka sensitivity, true positive rate or completness.
- Formula: Recall = (TP) / (TP + FN)
Precision: The fraction of retrieved instances that are actually relevant. Important when false positives are costly (e.g., drug testing). Percent of relevant results.
- Formula: Precision= (TP) / (TP + FP)
Specificity: The true negative rate = (TN) / (TN + FP)
F1 Score: The harmonic mean of Precision and Recall. It’s useful when the classes are imbalanced and you need a balance between precision and recall.
- Formula: F1 = 2 * ( (Precision * Recall) / (Precision + Recall) )
RMSE (Root Mean Squared Error): Measures how well the model’s predictions align with the actual values. It’s commonly used for regression tasks.
ROC Curve: Plots Recall vs. False Positive Rate (FPR) and helps visualize model performance across different thresholds.
AUC (Area Under Curve): Measures the overall performance of the classifier. A value of 1.0 indicates perfect classification while 0.5 is useless.
PR Curve (Precision-Recall Curve): Plots Precision vs. Recall, useful for information retrieval or imbalanced datasets. The higher the area under the curve the better.

Visual Breakdown:

🔹 True Positives (TP) → Correctly predicted spam
🔹 False Positives (FP) → Legit email wrongly marked as spam (Type I error)
🔹 False Negatives (FN) → Spam wrongly marked as legit (Type II error)
🔹 True Negatives (TN) → Correctly predicted legit email

Now, let’s map each metric:

📌 Precision = TP / (TP + FP) → "Of the emails we predicted as spam, how many were really spam?"
📌 Recall (Sensitivity) = TP / (TP + FN) → "Of all actual spam emails, how many did we catch?"
📌 Specificity = TN / (TN + FP) → "Of all legit emails, how many did we NOT wrongly mark as spam?"
📌 Accuracy = (TP + TN) / (TP + FP + FN + TN) → "Overall, how many were correctly classified?"

Ensemble Methods

Ensemble methods combine multiple models to improve performance by reducing bias or variance:

Bagging (Bootstrap Aggregating): Involves training multiple models on random subsets of the data and averaging their predictions. Helps reduce variance. Generate N new training sets by random sampling with replacement. Avoids overfitting and easier to parallelize.
- Example: Random Forests.
Boosting: Models are trained sequentially, and each new model corrects the errors of the previous one. It reduces bias. Weights are adjusted each cycle after being equal at first. Generally leads to better accuracy.
- Example: AdaBoost, Gradient Boosting, and XGBoost.
XGBoost: A highly efficient implementation of gradient boosting, known for speed and performance. It has regularization built in, which helps prevent overfitting.

Automatic Model Tuning & Regularization

Hyperparameter tuning automates the process of optimizing model settings:

Optimize one or two hyperparameters at a time: This simplifies the search space and makes the process more manageable.
Log scale for hyperparameters: Helps visualize hyperparameter effects, especially when values vary by orders of magnitude (e.g., learning rate).
Early Stopping: Prevents wasting computational resources when training cycles aren't yielding improvement.

Machine Learning with Amazon Web Services Transformers