notes
ml
sagemaker
Common Algorithms

Algorithms

Linear Learner

  • Type: Classification or regression
  • Data Input: RecordIO-wrapped ProtoBuf - Float32 data only! CSV - First column assumed to be the label. File or Pipe mode both supported
  • Data Output: Model artifacts (serialized model files such as .tar.gz)
  • Training:
    • For long training times, use Pipe Mode to avoid storing large datasets in memory.
    • Normalization of the training data is required for optimal performance.
    • Suitable for both single and multi-machine configurations with CPU/GPU.
    • Multi-GPU does not provide significant benefits for this algorithm.
  • Hyperparameters:
    • balance_multiclass_weights –Ensures equal importance for all classes in loss functions, useful for imbalanced datasets.
    • learning_rate –Controls step size for weight updates during training.
    • mini_batch_size –Number of samples processed per training step.
    • L1 –Lasso regularization; encourages sparsity in model weights.
    • Wd (Weight Decay, L2 Regularization) –Helps prevent overfitting by penalizing large weight values.
    • target_precision –Used when binary_classifier_model_selection_criteria = recall_at_target_precision; keeps precision fixed while maximizing recall.
    • target_recall –Used when binary_classifier_model_selection_criteria = precision_at_target_recall; keeps recall fixed while maximizing precision.

XGBoost

  • Type: Gradient Boosting. Uses gradient descent to minimize loss as new trees added, new trees made to correct errors of the old.
  • Use Case: Classification, regression, ranking
  • Data Input: CSV, LibSVM, or Parquet
  • Data Output: Model artifacts (serialized model files such as .json or .bin)
  • Key Parameters:
    • eval_metric: Used to optimize AUC.
    • scale_pos_weight: Adjusts the balance of positive and negative weights to prevent bias towards the majority class.
    • max_depth: Controls tree depth; higher values can lead to overfitting.
    • eta: Controls step size shrinkage to prevent overfitting.
    • subsample: Defines the fraction of samples used for each tree.
  • Training:
    • Memory-bound; it can utilize P2 or P3 instances for GPU acceleration. M5 is a good choice.
    • Supports single-instance GPU or distributed GPU training.
    • Tuning parameters like subsample and max_depth can significantly impact model performance and generalization.
  • Hyperparameters:
    • subsample –Controls the fraction of samples used for each tree to prevent overfitting.
    • eta –Step size shrinkage; reduces overfitting by shrinking the impact of each tree.
    • gamma –Minimum loss reduction required to create a partition; larger values make the model more conservative.
    • alpha –L1 regularization term; larger values increase regularization and make the model more conservative.
    • lambda –L2 regularization term; larger values increase regularization and make the model more conservative.
    • eval_metric –Metric used to evaluate model performance, such as AUC, error, RMSE, etc.
    • scale_pos_weight –Adjusts the balance of positive and negative weights, useful for imbalanced classes. Often set to the ratio of negative to positive cases.
    • max_depth –Maximum depth of the decision tree; deeper trees can lead to overfitting.

LightGBM

  • Type: Gradient Boosting Decision Tree (GBDT)
  • Use Case: Classification, regression, ranking, ensemble prediction
  • Data Input: CSV, LibSVM, or Parquet
  • Data Output: Model artifacts (serialized model files such as .bin)
  • Key Parameters:
    • Learning rate: Controls how much each tree influences the model.
    • num_leaves: Number of leaves in a tree; higher values increase model complexity.
    • feature_fraction: Fraction of features used for each tree.
    • bagging_fraction & bagging_freq: For random sampling of data to avoid overfitting.
    • max_depth: Prevents overfitting by limiting tree depth.
    • min_data_in_leaf: Minimum number of data points in each leaf node.
  • Training:
    • CPU-only model; optimized for large datasets.
    • Memory-bound, so it’s suitable for high-performance CPU instances.

Seq2Seq

  • Type: Sequence-to-sequence model
  • Use Case: Machine translation, text summarization, speech-to-text
  • Data Input: Text files (tokenized sequences of words or symbols), RecordIO ProtoBuf (tokens must be integers)
  • Data Output: Text files (predicted sequence)
  • Training:
    • Input and output are sequences of tokens (e.g., text or speech).
    • Uses single or multi-GPU on one machine.
    • Tuning parameters: Batch size, optimizers, BLEU score (for translation quality), and perplexity (for language modeling).
  • Hyperparameters:
    • batch_size –Number of training samples per batch; influences training time and memory usage.
    • optimizer_type –The type of optimizer used, e.g., Adam, SGD, RMSProp.
    • learning_rate –Controls the step size during training; affects convergence speed and stability.
    • num_layers_encoder –Number of layers in the encoder; higher values can capture more complex patterns.
    • num_layers_decoder –Number of layers in the decoder; higher values increase model complexity.

DeepAR

  • Type: Time-series forecasting
  • Use Case: Forecasting 1-dimensional time series data, such as stock prices or demand forecasting.
  • Data Input: CSV, Parquet, or JSON (time-series data)
  • Data Output: Forecasted data (often in the form of CSV or JSON files)
  • Training:
    • Can train on cross time-series data.
    • Always ensure the inclusion of the entire time series during training for accurate forecasting.
    • Supports both CPU and GPU with single or multiple machines.
  • Hyperparameters:
    • context_length –The number of time points the model looks at before making a prediction. It can be smaller than seasonalities, as the model will still account for seasonality trends (e.g., a yearly lag).
    • epochs –The number of complete passes through the entire dataset during training. More epochs can lead to better model performance, but may also risk overfitting.
    • mini_batch_size –The size of the mini-batches used in each step of training. Smaller batch sizes can lead to more frequent updates but require more iterations.
    • learning_rate –Controls the step size during training, impacting the speed and stability of model convergence.
    • num_cells –The number of cells (units) in the recurrent layers (e.g., LSTM, GRU). More cells allow the model to capture more complex patterns but increase computation.

BlazingText

  • Type: Text classification (Word2Vec)
  • Use Case: Information retrieval, sentence-level classification (not entire documents).
  • Data Input: Text file (one sentence per line)
  • Data Output: Word embeddings (e.g., CSV or TSV files)
  • Key Concepts:
    • Uses Word2Vec to create word embeddings and find similarity between words.
    • Can be trained with CBOW (Continuous Bag of Words) or Skip-gram methods.
    • Requires a text file as input data.
  • Hyperparameters
    • Word2Vec:
      • Mode –Defines the architecture used:
        • batch_skipgram: Optimizes based on context words around a target word.
        • skipgram: Similar to batch_skipgram but works with single words in a sequence.
        • cbow (Continuous Bag of Words): Uses context words to predict the target word.
      • learning_rate –Controls the rate of learning during training. Too high can cause instability, too low can slow down convergence.
      • window_size –The number of context words around a target word to consider for training.
      • vector_dim –The size of the vector that represents each word in the word embedding.
      • negative_samples –The number of negative samples used to train the model. Helps in creating better word embeddings by contrasting positive and negative samples.
    • Text Classification:
      • epochs –Number of times the entire training dataset is passed through the model. More epochs may lead to better learning but risk overfitting.
      • learning_rate –Determines the step size in optimization algorithms, affecting convergence speed and stability.
      • word_ngrams –N-gram level feature extraction for text; captures word sequences (e.g., bigrams or trigrams) for more context in classification.
      • vector_dim –The dimensionality of the word vectors used in the model, affecting the representation capacity and computation requirements.

Object2Vec

  • Type: Embedding layer for objects
  • Use Case: Genre prediction, creating embeddings for higher-dimensional objects.
  • Data Input: Text file or CSV (tokenized into integers)
  • Data Output: Low-dimensional embeddings (e.g., CSV)
  • Key Concepts:
    • Generates low-dimensional embeddings for higher-dimensional data.
    • Requires tokenized data, which is transformed into integers before training.
  • Hyperparameters: Typical deep-learning ones.

Object Detection

  • Type: CNN-based detection
  • Use Case: Identifying and localizing objects in images.
  • Data Input: Image files (e.g., JPEG, PNG)
  • Data Output: Predicted bounding boxes with labels (CSV, JSON)
  • Training:
    • Uses MXNet CNN Single Shot Detection or TensorFlow for training object detection models.
  • Hyerparameters:
    • Mini_batch_size
    • Learning_rate
    • Optimizer

Image Classification

  • Type: Object classification without localization
  • Use Case: Classifying images into predefined categories (doesn’t identify object locations).
  • Data Input: Image files (e.g., JPEG, PNG)
  • Data Output: Predicted class labels (CSV, JSON)
  • Hyperparameters: Typical deep-learning ones.

Semantic Segmentation

  • Type: Pixel-level object classification
  • Use Case: Classifying each pixel in an image to assign a category (e.g., identifying the region of interest for autonomous driving).
  • Data Input: Image files (e.g., JPEG, PNG)
  • Data Output: Segmented image masks (PNG or JSON)
  • Hyperparameters:
    • Epochs, learning rate, batch size, optimizer, etc
    • Algorithm
    • Backbone

Random Cut Forest

  • Type: Anomaly detection
  • Use Case: Identifying anomalies in data. Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
  • Data Input: CSV, Parquet
  • Data Output: Anomaly scores (CSV)
  • Training:
    • Unsupervised algorithm that builds a forest of trees for anomaly detection.
    • Key Parameters: Number of trees, number of samples per tree.
    • Does not take advantage of GPUs
  • Hyperparameters:
    • Num_trees
    • Num_samples_per_tree

Neural Topic Model

  • Type: Topic modeling
  • Use Case: Organizing documents into topics. Unsupervised. Four data channels: train required, validation, test and auxiliary optional.
  • Data Input: Tokenized integer sequences (CSV or JSON format)
  • Data Output: Topics and topic distributions (CSV)
  • Hyperparameters:
    • mini_batch_size
    • Num_topics

LDA (Latent Dirichlet Allocation)

  • Type: Topic modeling (not deep learning)
  • Use Case: Unsupervised organization of documents into topics based on shared words. Harmonic analysis for music.
  • Data Input: Text file or CSV (tokenized words)
  • Data Output: Topic distributions per document (CSV)
  • Training:
    • Single-instance CPU model.
  • Hyperparameters:
    • Alpha0 - Result uniformity
    • Num_topics

KNN (K-Nearest Neighbors)

  • Type: Instance-based learning
  • Use Case: Classification or regression based on proximity to k closest data points in the feature space.
  • Data Input: CSV, Parquet
  • Data Output: Predicted class labels or regression values (CSV)
  • Hyperparameters:
    • K!
    • Sample_size

K-Means

  • Type: Clustering (Unsupervised)
  • Use Case: Dividing data into k groups based on feature similarity.
  • Data Input: CSV, RecordIO-Protobuf. Train channel, optional test channel.
  • Data Output: Cluster assignments (CSV)
  • Training:
    • CPU recommended due to its simplicity.
  • Hyperparameters:
    • K!:
      • Choosing the optimal number of clusters is tricky.
      • Plot within-cluster sum of squares (WCSS) as a function of K to identify the best K value.
      • Use the elbow method to determine the point where adding more clusters doesn’t significantly improve the model.
      • Optimize for tightness of clusters (minimizing variance within clusters).
    • Mini_batch_size:
      • Size of data batches processed in each iteration.
      • Can help speed up convergence and reduce computation time by using smaller data chunks.
    • Extra_center_factor:
      • Controls the number of additional cluster centers.
      • A higher value may result in more clusters and finer granularity.
    • Init_method:
      • Method used for initializing cluster centroids (e.g., random initialization, k-means++).
      • Affects the quality of the final clustering and convergence speed.

PCA (Principal Component Analysis)

  • Type: Dimensionality reduction
  • Use Case: Reduces the number of features while retaining as much variance as possible.
  • Data Input: CSV, Parquet
  • Data Output: Reduced feature data (CSV)
  • Training:
    • Unsupervised learning method.
    • Reduces the number of features to a smaller number of components.
    • Covariance matrix is created then singular value decomposition is used: either regular or randomized
  • Hyperparameters:
    • Algorithm_mode
    • Subtract_mean: Unbias data

Factorization Machines

  • Type: Supervised learning for sparse data
  • Use Case: Recommender systems, click prediction.
  • Data Input: RecordIO-protobuf with Float32, CSV impractical with sparse data
  • Data Output: Predicted ratings or classifications (CSV)
  • Training:
    • Deals with sparse data represented in large matrices.
    • Suitable for classification or regression tasks.
    • Limited to pair-wise interactions
    • CPU recommended, GPU usually used for dense data
  • Hyperparameters:
    • Initialization methods for bias, factors and linear terms

IP Insights

  • Type: Anomaly detection for IP addresses
  • Use Case: Identifying anomalous behavior based on IP addresses.
  • Data Input: CSV (IP logs)
  • Data Output: Anomaly scores (CSV)
  • Training:
    • Unsupervised learning algorithm for detecting irregularities or threats in network data.
    • GPU recommended
  • Hyperparameters:
    • Num_entity_vectors: Hash size, set to twice amount of unique entities
    • Vector_dim: Size of embedding vectors
    • Epochs
    • Learning rate
    • Batch size