Algorithms
Linear Learner
- Type: Classification or regression
- Data Input: RecordIO-wrapped ProtoBuf - Float32 data only! CSV - First column assumed to be the label. File or Pipe mode both supported
- Data Output: Model artifacts (serialized model files such as
.tar.gz
) - Training:
- For long training times, use Pipe Mode to avoid storing large datasets in memory.
- Normalization of the training data is required for optimal performance.
- Suitable for both single and multi-machine configurations with CPU/GPU.
- Multi-GPU does not provide significant benefits for this algorithm.
- Hyperparameters:
- balance_multiclass_weights –Ensures equal importance for all classes in loss functions, useful for imbalanced datasets.
- learning_rate –Controls step size for weight updates during training.
- mini_batch_size –Number of samples processed per training step.
- L1 –Lasso regularization; encourages sparsity in model weights.
- Wd (Weight Decay, L2 Regularization) –Helps prevent overfitting by penalizing large weight values.
- target_precision –Used when
binary_classifier_model_selection_criteria = recall_at_target_precision
; keeps precision fixed while maximizing recall. - target_recall –Used when
binary_classifier_model_selection_criteria = precision_at_target_recall
; keeps recall fixed while maximizing precision.
XGBoost
- Type: Gradient Boosting. Uses gradient descent to minimize loss as new trees added, new trees made to correct errors of the old.
- Use Case: Classification, regression, ranking
- Data Input: CSV, LibSVM, or Parquet
- Data Output: Model artifacts (serialized model files such as
.json
or.bin
) - Key Parameters:
- eval_metric: Used to optimize AUC.
- scale_pos_weight: Adjusts the balance of positive and negative weights to prevent bias towards the majority class.
- max_depth: Controls tree depth; higher values can lead to overfitting.
- eta: Controls step size shrinkage to prevent overfitting.
- subsample: Defines the fraction of samples used for each tree.
- Training:
- Memory-bound; it can utilize P2 or P3 instances for GPU acceleration. M5 is a good choice.
- Supports single-instance GPU or distributed GPU training.
- Tuning parameters like subsample and max_depth can significantly impact model performance and generalization.
- Hyperparameters:
- subsample –Controls the fraction of samples used for each tree to prevent overfitting.
- eta –Step size shrinkage; reduces overfitting by shrinking the impact of each tree.
- gamma –Minimum loss reduction required to create a partition; larger values make the model more conservative.
- alpha –L1 regularization term; larger values increase regularization and make the model more conservative.
- lambda –L2 regularization term; larger values increase regularization and make the model more conservative.
- eval_metric –Metric used to evaluate model performance, such as AUC, error, RMSE, etc.
- scale_pos_weight –Adjusts the balance of positive and negative weights, useful for imbalanced classes. Often set to the ratio of negative to positive cases.
- max_depth –Maximum depth of the decision tree; deeper trees can lead to overfitting.
LightGBM
- Type: Gradient Boosting Decision Tree (GBDT)
- Use Case: Classification, regression, ranking, ensemble prediction
- Data Input: CSV, LibSVM, or Parquet
- Data Output: Model artifacts (serialized model files such as
.bin
) - Key Parameters:
- Learning rate: Controls how much each tree influences the model.
- num_leaves: Number of leaves in a tree; higher values increase model complexity.
- feature_fraction: Fraction of features used for each tree.
- bagging_fraction & bagging_freq: For random sampling of data to avoid overfitting.
- max_depth: Prevents overfitting by limiting tree depth.
- min_data_in_leaf: Minimum number of data points in each leaf node.
- Training:
- CPU-only model; optimized for large datasets.
- Memory-bound, so it’s suitable for high-performance CPU instances.
Seq2Seq
- Type: Sequence-to-sequence model
- Use Case: Machine translation, text summarization, speech-to-text
- Data Input: Text files (tokenized sequences of words or symbols), RecordIO ProtoBuf (tokens must be integers)
- Data Output: Text files (predicted sequence)
- Training:
- Input and output are sequences of tokens (e.g., text or speech).
- Uses single or multi-GPU on one machine.
- Tuning parameters: Batch size, optimizers, BLEU score (for translation quality), and perplexity (for language modeling).
- Hyperparameters:
- batch_size –Number of training samples per batch; influences training time and memory usage.
- optimizer_type –The type of optimizer used, e.g., Adam, SGD, RMSProp.
- learning_rate –Controls the step size during training; affects convergence speed and stability.
- num_layers_encoder –Number of layers in the encoder; higher values can capture more complex patterns.
- num_layers_decoder –Number of layers in the decoder; higher values increase model complexity.
DeepAR
- Type: Time-series forecasting
- Use Case: Forecasting 1-dimensional time series data, such as stock prices or demand forecasting.
- Data Input: CSV, Parquet, or JSON (time-series data)
- Data Output: Forecasted data (often in the form of CSV or JSON files)
- Training:
- Can train on cross time-series data.
- Always ensure the inclusion of the entire time series during training for accurate forecasting.
- Supports both CPU and GPU with single or multiple machines.
- Hyperparameters:
- context_length –The number of time points the model looks at before making a prediction. It can be smaller than seasonalities, as the model will still account for seasonality trends (e.g., a yearly lag).
- epochs –The number of complete passes through the entire dataset during training. More epochs can lead to better model performance, but may also risk overfitting.
- mini_batch_size –The size of the mini-batches used in each step of training. Smaller batch sizes can lead to more frequent updates but require more iterations.
- learning_rate –Controls the step size during training, impacting the speed and stability of model convergence.
- num_cells –The number of cells (units) in the recurrent layers (e.g., LSTM, GRU). More cells allow the model to capture more complex patterns but increase computation.
BlazingText
- Type: Text classification (Word2Vec)
- Use Case: Information retrieval, sentence-level classification (not entire documents).
- Data Input: Text file (one sentence per line)
- Data Output: Word embeddings (e.g., CSV or TSV files)
- Key Concepts:
- Uses Word2Vec to create word embeddings and find similarity between words.
- Can be trained with CBOW (Continuous Bag of Words) or Skip-gram methods.
- Requires a text file as input data.
- Hyperparameters
- Word2Vec:
- Mode –Defines the architecture used:
- batch_skipgram: Optimizes based on context words around a target word.
- skipgram: Similar to batch_skipgram but works with single words in a sequence.
- cbow (Continuous Bag of Words): Uses context words to predict the target word.
- learning_rate –Controls the rate of learning during training. Too high can cause instability, too low can slow down convergence.
- window_size –The number of context words around a target word to consider for training.
- vector_dim –The size of the vector that represents each word in the word embedding.
- negative_samples –The number of negative samples used to train the model. Helps in creating better word embeddings by contrasting positive and negative samples.
- Mode –Defines the architecture used:
- Text Classification:
- epochs –Number of times the entire training dataset is passed through the model. More epochs may lead to better learning but risk overfitting.
- learning_rate –Determines the step size in optimization algorithms, affecting convergence speed and stability.
- word_ngrams –N-gram level feature extraction for text; captures word sequences (e.g., bigrams or trigrams) for more context in classification.
- vector_dim –The dimensionality of the word vectors used in the model, affecting the representation capacity and computation requirements.
- Word2Vec:
Object2Vec
- Type: Embedding layer for objects
- Use Case: Genre prediction, creating embeddings for higher-dimensional objects.
- Data Input: Text file or CSV (tokenized into integers)
- Data Output: Low-dimensional embeddings (e.g., CSV)
- Key Concepts:
- Generates low-dimensional embeddings for higher-dimensional data.
- Requires tokenized data, which is transformed into integers before training.
- Hyperparameters: Typical deep-learning ones.
Object Detection
- Type: CNN-based detection
- Use Case: Identifying and localizing objects in images.
- Data Input: Image files (e.g., JPEG, PNG)
- Data Output: Predicted bounding boxes with labels (CSV, JSON)
- Training:
- Uses MXNet CNN Single Shot Detection or TensorFlow for training object detection models.
- Hyerparameters:
- Mini_batch_size
- Learning_rate
- Optimizer
Image Classification
- Type: Object classification without localization
- Use Case: Classifying images into predefined categories (doesn’t identify object locations).
- Data Input: Image files (e.g., JPEG, PNG)
- Data Output: Predicted class labels (CSV, JSON)
- Hyperparameters: Typical deep-learning ones.
Semantic Segmentation
- Type: Pixel-level object classification
- Use Case: Classifying each pixel in an image to assign a category (e.g., identifying the region of interest for autonomous driving).
- Data Input: Image files (e.g., JPEG, PNG)
- Data Output: Segmented image masks (PNG or JSON)
- Hyperparameters:
- Epochs, learning rate, batch size, optimizer, etc
- Algorithm
- Backbone
Random Cut Forest
- Type: Anomaly detection
- Use Case: Identifying anomalies in data. Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
- Data Input: CSV, Parquet
- Data Output: Anomaly scores (CSV)
- Training:
- Unsupervised algorithm that builds a forest of trees for anomaly detection.
- Key Parameters: Number of trees, number of samples per tree.
- Does not take advantage of GPUs
- Hyperparameters:
- Num_trees
- Num_samples_per_tree
Neural Topic Model
- Type: Topic modeling
- Use Case: Organizing documents into topics. Unsupervised. Four data channels: train required, validation, test and auxiliary optional.
- Data Input: Tokenized integer sequences (CSV or JSON format)
- Data Output: Topics and topic distributions (CSV)
- Hyperparameters:
- mini_batch_size
- Num_topics
LDA (Latent Dirichlet Allocation)
- Type: Topic modeling (not deep learning)
- Use Case: Unsupervised organization of documents into topics based on shared words. Harmonic analysis for music.
- Data Input: Text file or CSV (tokenized words)
- Data Output: Topic distributions per document (CSV)
- Training:
- Single-instance CPU model.
- Hyperparameters:
- Alpha0 - Result uniformity
- Num_topics
KNN (K-Nearest Neighbors)
- Type: Instance-based learning
- Use Case: Classification or regression based on proximity to k closest data points in the feature space.
- Data Input: CSV, Parquet
- Data Output: Predicted class labels or regression values (CSV)
- Hyperparameters:
- K!
- Sample_size
K-Means
- Type: Clustering (Unsupervised)
- Use Case: Dividing data into k groups based on feature similarity.
- Data Input: CSV, RecordIO-Protobuf. Train channel, optional test channel.
- Data Output: Cluster assignments (CSV)
- Training:
- CPU recommended due to its simplicity.
- Hyperparameters:
- K!:
- Choosing the optimal number of clusters is tricky.
- Plot within-cluster sum of squares (WCSS) as a function of K to identify the best K value.
- Use the elbow method to determine the point where adding more clusters doesn’t significantly improve the model.
- Optimize for tightness of clusters (minimizing variance within clusters).
- Mini_batch_size:
- Size of data batches processed in each iteration.
- Can help speed up convergence and reduce computation time by using smaller data chunks.
- Extra_center_factor:
- Controls the number of additional cluster centers.
- A higher value may result in more clusters and finer granularity.
- Init_method:
- Method used for initializing cluster centroids (e.g., random initialization, k-means++).
- Affects the quality of the final clustering and convergence speed.
- K!:
PCA (Principal Component Analysis)
- Type: Dimensionality reduction
- Use Case: Reduces the number of features while retaining as much variance as possible.
- Data Input: CSV, Parquet
- Data Output: Reduced feature data (CSV)
- Training:
- Unsupervised learning method.
- Reduces the number of features to a smaller number of components.
- Covariance matrix is created then singular value decomposition is used: either regular or randomized
- Hyperparameters:
- Algorithm_mode
- Subtract_mean: Unbias data
Factorization Machines
- Type: Supervised learning for sparse data
- Use Case: Recommender systems, click prediction.
- Data Input: RecordIO-protobuf with Float32, CSV impractical with sparse data
- Data Output: Predicted ratings or classifications (CSV)
- Training:
- Deals with sparse data represented in large matrices.
- Suitable for classification or regression tasks.
- Limited to pair-wise interactions
- CPU recommended, GPU usually used for dense data
- Hyperparameters:
- Initialization methods for bias, factors and linear terms
IP Insights
- Type: Anomaly detection for IP addresses
- Use Case: Identifying anomalous behavior based on IP addresses.
- Data Input: CSV (IP logs)
- Data Output: Anomaly scores (CSV)
- Training:
- Unsupervised learning algorithm for detecting irregularities or threats in network data.
- GPU recommended
- Hyperparameters:
- Num_entity_vectors: Hash size, set to twice amount of unique entities
- Vector_dim: Size of embedding vectors
- Epochs
- Learning rate
- Batch size