Common Algorithms

Algorithms

Linear Learner

Type: Classification or regression
Data Input: RecordIO-wrapped ProtoBuf - Float32 data only! CSV - First column assumed to be the label. File or Pipe mode both supported
Data Output: Model artifacts (serialized model files such as .tar.gz)
Training:
- For long training times, use Pipe Mode to avoid storing large datasets in memory.
- Normalization of the training data is required for optimal performance.
- Suitable for both single and multi-machine configurations with CPU/GPU.
- Multi-GPU does not provide significant benefits for this algorithm.
Hyperparameters:
- balance_multiclass_weights –Ensures equal importance for all classes in loss functions, useful for imbalanced datasets.
- learning_rate –Controls step size for weight updates during training.
- mini_batch_size –Number of samples processed per training step.
- L1 –Lasso regularization; encourages sparsity in model weights.
- Wd (Weight Decay, L2 Regularization) –Helps prevent overfitting by penalizing large weight values.
- target_precision –Used when binary_classifier_model_selection_criteria = recall_at_target_precision; keeps precision fixed while maximizing recall.
- target_recall –Used when binary_classifier_model_selection_criteria = precision_at_target_recall; keeps recall fixed while maximizing precision.

XGBoost

Type: Gradient Boosting. Uses gradient descent to minimize loss as new trees added, new trees made to correct errors of the old.
Use Case: Classification, regression, ranking
Data Input: CSV, LibSVM, or Parquet
Data Output: Model artifacts (serialized model files such as .json or .bin)
Key Parameters:
- eval_metric: Used to optimize AUC.
- scale_pos_weight: Adjusts the balance of positive and negative weights to prevent bias towards the majority class.
- max_depth: Controls tree depth; higher values can lead to overfitting.
- eta: Controls step size shrinkage to prevent overfitting.
- subsample: Defines the fraction of samples used for each tree.
Training:
- Memory-bound; it can utilize P2 or P3 instances for GPU acceleration. M5 is a good choice.
- Supports single-instance GPU or distributed GPU training.
- Tuning parameters like subsample and max_depth can significantly impact model performance and generalization.
Hyperparameters:
- subsample –Controls the fraction of samples used for each tree to prevent overfitting.
- eta –Step size shrinkage; reduces overfitting by shrinking the impact of each tree.
- gamma –Minimum loss reduction required to create a partition; larger values make the model more conservative.
- alpha –L1 regularization term; larger values increase regularization and make the model more conservative.
- lambda –L2 regularization term; larger values increase regularization and make the model more conservative.
- eval_metric –Metric used to evaluate model performance, such as AUC, error, RMSE, etc.
- scale_pos_weight –Adjusts the balance of positive and negative weights, useful for imbalanced classes. Often set to the ratio of negative to positive cases.
- max_depth –Maximum depth of the decision tree; deeper trees can lead to overfitting.

LightGBM

Type: Gradient Boosting Decision Tree (GBDT)
Use Case: Classification, regression, ranking, ensemble prediction
Data Input: CSV, LibSVM, or Parquet
Data Output: Model artifacts (serialized model files such as .bin)
Key Parameters:
- Learning rate: Controls how much each tree influences the model.
- num_leaves: Number of leaves in a tree; higher values increase model complexity.
- feature_fraction: Fraction of features used for each tree.
- bagging_fraction & bagging_freq: For random sampling of data to avoid overfitting.
- max_depth: Prevents overfitting by limiting tree depth.
- min_data_in_leaf: Minimum number of data points in each leaf node.
Training:
- CPU-only model; optimized for large datasets.
- Memory-bound, so it’s suitable for high-performance CPU instances.

Seq2Seq

Type: Sequence-to-sequence model
Use Case: Machine translation, text summarization, speech-to-text
Data Input: Text files (tokenized sequences of words or symbols), RecordIO ProtoBuf (tokens must be integers)
Data Output: Text files (predicted sequence)
Training:
- Input and output are sequences of tokens (e.g., text or speech).
- Uses single or multi-GPU on one machine.
- Tuning parameters: Batch size, optimizers, BLEU score (for translation quality), and perplexity (for language modeling).
Hyperparameters:
- batch_size –Number of training samples per batch; influences training time and memory usage.
- optimizer_type –The type of optimizer used, e.g., Adam, SGD, RMSProp.
- learning_rate –Controls the step size during training; affects convergence speed and stability.
- num_layers_encoder –Number of layers in the encoder; higher values can capture more complex patterns.
- num_layers_decoder –Number of layers in the decoder; higher values increase model complexity.

DeepAR

Type: Time-series forecasting
Use Case: Forecasting 1-dimensional time series data, such as stock prices or demand forecasting.
Data Input: CSV, Parquet, or JSON (time-series data)
Data Output: Forecasted data (often in the form of CSV or JSON files)
Training:
- Can train on cross time-series data.
- Always ensure the inclusion of the entire time series during training for accurate forecasting.
- Supports both CPU and GPU with single or multiple machines.
Hyperparameters:
- context_length –The number of time points the model looks at before making a prediction. It can be smaller than seasonalities, as the model will still account for seasonality trends (e.g., a yearly lag).
- epochs –The number of complete passes through the entire dataset during training. More epochs can lead to better model performance, but may also risk overfitting.
- mini_batch_size –The size of the mini-batches used in each step of training. Smaller batch sizes can lead to more frequent updates but require more iterations.
- learning_rate –Controls the step size during training, impacting the speed and stability of model convergence.
- num_cells –The number of cells (units) in the recurrent layers (e.g., LSTM, GRU). More cells allow the model to capture more complex patterns but increase computation.

BlazingText

Type: Text classification (Word2Vec)
Use Case: Information retrieval, sentence-level classification (not entire documents).
Data Input: Text file (one sentence per line)
Data Output: Word embeddings (e.g., CSV or TSV files)
Key Concepts:
- Uses Word2Vec to create word embeddings and find similarity between words.
- Can be trained with CBOW (Continuous Bag of Words) or Skip-gram methods.
- Requires a text file as input data.
Hyperparameters
- Word2Vec:
  - Mode –Defines the architecture used:
    - batch_skipgram: Optimizes based on context words around a target word.
    - skipgram: Similar to batch_skipgram but works with single words in a sequence.
    - cbow (Continuous Bag of Words): Uses context words to predict the target word.
  - learning_rate –Controls the rate of learning during training. Too high can cause instability, too low can slow down convergence.
  - window_size –The number of context words around a target word to consider for training.
  - vector_dim –The size of the vector that represents each word in the word embedding.
  - negative_samples –The number of negative samples used to train the model. Helps in creating better word embeddings by contrasting positive and negative samples.
- Text Classification:
  - epochs –Number of times the entire training dataset is passed through the model. More epochs may lead to better learning but risk overfitting.
  - learning_rate –Determines the step size in optimization algorithms, affecting convergence speed and stability.
  - word_ngrams –N-gram level feature extraction for text; captures word sequences (e.g., bigrams or trigrams) for more context in classification.
  - vector_dim –The dimensionality of the word vectors used in the model, affecting the representation capacity and computation requirements.

Object2Vec

Type: Embedding layer for objects
Use Case: Genre prediction, creating embeddings for higher-dimensional objects.
Data Input: Text file or CSV (tokenized into integers)
Data Output: Low-dimensional embeddings (e.g., CSV)
Key Concepts:
- Generates low-dimensional embeddings for higher-dimensional data.
- Requires tokenized data, which is transformed into integers before training.
Hyperparameters: Typical deep-learning ones.

Object Detection

Type: CNN-based detection
Use Case: Identifying and localizing objects in images.
Data Input: Image files (e.g., JPEG, PNG)
Data Output: Predicted bounding boxes with labels (CSV, JSON)
Training:
- Uses MXNet CNN Single Shot Detection or TensorFlow for training object detection models.
Hyerparameters:
- Mini_batch_size
- Learning_rate
- Optimizer

Image Classification

Type: Object classification without localization
Use Case: Classifying images into predefined categories (doesn’t identify object locations).
Data Input: Image files (e.g., JPEG, PNG)
Data Output: Predicted class labels (CSV, JSON)
Hyperparameters: Typical deep-learning ones.

Semantic Segmentation

Type: Pixel-level object classification
Use Case: Classifying each pixel in an image to assign a category (e.g., identifying the region of interest for autonomous driving).
Data Input: Image files (e.g., JPEG, PNG)
Data Output: Segmented image masks (PNG or JSON)
Hyperparameters:
- Epochs, learning rate, batch size, optimizer, etc
- Algorithm
- Backbone

Random Cut Forest

Type: Anomaly detection
Use Case: Identifying anomalies in data. Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
Data Input: CSV, Parquet
Data Output: Anomaly scores (CSV)
Training:
- Unsupervised algorithm that builds a forest of trees for anomaly detection.
- Key Parameters: Number of trees, number of samples per tree.
- Does not take advantage of GPUs
Hyperparameters:
- Num_trees
- Num_samples_per_tree

Neural Topic Model

Type: Topic modeling
Use Case: Organizing documents into topics. Unsupervised. Four data channels: train required, validation, test and auxiliary optional.
Data Input: Tokenized integer sequences (CSV or JSON format)
Data Output: Topics and topic distributions (CSV)
Hyperparameters:
- mini_batch_size
- Num_topics

LDA (Latent Dirichlet Allocation)

Type: Topic modeling (not deep learning)
Use Case: Unsupervised organization of documents into topics based on shared words. Harmonic analysis for music.
Data Input: Text file or CSV (tokenized words)
Data Output: Topic distributions per document (CSV)
Training:
- Single-instance CPU model.
Hyperparameters:
- Alpha0 - Result uniformity
- Num_topics

KNN (K-Nearest Neighbors)

Type: Instance-based learning
Use Case: Classification or regression based on proximity to k closest data points in the feature space.
Data Input: CSV, Parquet
Data Output: Predicted class labels or regression values (CSV)
Hyperparameters:
- K!
- Sample_size

K-Means

Type: Clustering (Unsupervised)
Use Case: Dividing data into k groups based on feature similarity.
Data Input: CSV, RecordIO-Protobuf. Train channel, optional test channel.
Data Output: Cluster assignments (CSV)
Training:
- CPU recommended due to its simplicity.
Hyperparameters:
- K!:
  - Choosing the optimal number of clusters is tricky.
  - Plot within-cluster sum of squares (WCSS) as a function of K to identify the best K value.
  - Use the elbow method to determine the point where adding more clusters doesn’t significantly improve the model.
  - Optimize for tightness of clusters (minimizing variance within clusters).
- Mini_batch_size:
  - Size of data batches processed in each iteration.
  - Can help speed up convergence and reduce computation time by using smaller data chunks.
- Extra_center_factor:
  - Controls the number of additional cluster centers.
  - A higher value may result in more clusters and finer granularity.
- Init_method:
  - Method used for initializing cluster centroids (e.g., random initialization, k-means++).
  - Affects the quality of the final clustering and convergence speed.

PCA (Principal Component Analysis)

Type: Dimensionality reduction
Use Case: Reduces the number of features while retaining as much variance as possible.
Data Input: CSV, Parquet
Data Output: Reduced feature data (CSV)
Training:
- Unsupervised learning method.
- Reduces the number of features to a smaller number of components.
- Covariance matrix is created then singular value decomposition is used: either regular or randomized
Hyperparameters:
- Algorithm_mode
- Subtract_mean: Unbias data

Factorization Machines

Type: Supervised learning for sparse data
Use Case: Recommender systems, click prediction.
Data Input: RecordIO-protobuf with Float32, CSV impractical with sparse data
Data Output: Predicted ratings or classifications (CSV)
Training:
- Deals with sparse data represented in large matrices.
- Suitable for classification or regression tasks.
- Limited to pair-wise interactions
- CPU recommended, GPU usually used for dense data
Hyperparameters:
- Initialization methods for bias, factors and linear terms

IP Insights

Type: Anomaly detection for IP addresses
Use Case: Identifying anomalous behavior based on IP addresses.
Data Input: CSV (IP logs)
Data Output: Anomaly scores (CSV)
Training:
- Unsupervised learning algorithm for detecting irregularities or threats in network data.
- GPU recommended
Hyperparameters:
- Num_entity_vectors: Hash size, set to twice amount of unique entities
- Vector_dim: Size of embedding vectors
- Epochs
- Learning rate
- Batch size

Transformers Sagemaker