SageMaker
Amazon SageMaker is a comprehensive suite of machine learning (ML) services designed to help build, train, and deploy models in the cloud. These services include various tools for natural language processing (NLP), speech recognition, image processing, and more. Below is an expanded overview of key SageMaker services:
- AutoPilot (AutoML): Automates model selection, preprocessing, tuning
- Clarify: Detects bias & feature attribution
- Debugger: Monitors & automates model debugging
- Model Registry: CI/CD integration for models
- Model Monitor: Detect alerts on quality deviations, data drift, anomalies or outliers
- TensorBoard Integration: Model training visualization
- Training Compiler: Optimized deep learning container (DLC)
- Warm Pools: Retains provisioned infra, reduces startup latency
- Checkpointing: Saves training snapshots
- Feature Store: A repo for features
- Canvas: No code machine learning for analysts
- Code Wrangler: This is a tool designed to generate code for importing, transforming, or processing data, streamlining the process for users who need to integrate external data sources into their workflow.
Bias Metrics
1. Class Imbalance (CI)
- Occurs when one demographic group has significantly fewer training samples than another.
- Can lead to models that favor the majority class, reducing fairness and performance on underrepresented groups.
2. Difference in Proportions of Labels (DPL)
- Measures imbalance in positive outcomes between different groups (facets).
- High DPL indicates that some demographic groups receive favorable predictions more often than others, leading to potential bias.
3. Kullback-Leibler (KL) Divergence & Jensen-Shannon (JS) Divergence
- Measures how much the outcome distributions of different facets diverge from each other.
- KL Divergence: Asymmetric, quantifies how much one distribution differs from another.
- JS Divergence: A symmetric and smoothed version of KL Divergence.
- Used to compare probability distributions of predicted outcomes across different groups.
4. Lp-norm (LP)
- Measures the p-norm difference between outcome distributions of different demographic facets.
- Can be used to quantify overall disparity in model predictions.
5. Total Variation Distance (TVD)
- Uses L1-norm to measure the difference between outcome distributions from different groups.
- High TVD suggests significant disparities in how predictions vary across demographic facets.
6. Kolmogorov-Smirnov (KS) Statistic
- Measures the maximum divergence between cumulative distribution functions (CDFs) of outcomes for different facets.
- A large KS value means one group systematically receives more favorable or unfavorable outcomes than another.
7. Conditional Demographic Disparity (CDD)
- Evaluates demographic disparities across groups as a whole and within subgroups.
- Provides a more granular understanding of bias at different levels of analysis.
Partial Dependence Plots (PDPs)
Show dependance of target response based on set of inputs. Can show how feature values influence a prediction.
Shapley Values
Algorithm designed to show the contribution of each feature towards a model's prediction.
Managed AI Services
Comprehnd (NLP)
Amazon Comprehend is a natural language processing (NLP) service that helps you analyze text and extract useful information. It can be used to classify text, detect entities, and understand sentiment.
- Custom Classification: Train custom models to classify text into categories based on your specific needs, such as classifying customer support emails into different topics (e.g., billing, technical issues).
- Example Use Case: Classifying emails to automate routing to the right department in an organization.
- Store trained data in S3 and feed into Comprehend: You can store labeled text data in Amazon S3 and use it to train or fine-tune your custom classification model in Comprehend.
- Example Use Case: Store datasets of emails and use them to train a model to predict which department the email should be directed to.
- Named Entity Recognition (NER): Extract entities such as names, locations, dates, and more from text. This helps in structuring unstructured text.
- Example Use Case: Automatically identify and extract company names and locations from news articles.
- Custom Entity Recognition: Extend NER to recognize domain-specific entities that are important for your business or application. You can train the model with your own set of entities.
- Example Use Case: Identifying specific product names in customer reviews.
- Share custom models within the same region via ARN and IAM policy: Custom models can be shared across AWS services and accounts within the same region, ensuring controlled access via IAM policies and resource management.
- Example Use Case: Sharing your NLP models with different teams within the same organization.
Translate
Amazon Translate is a neural machine translation (NMT) service that allows you to translate text between various languages.
- Neural Network-based Translation: Translate supports multiple languages using neural networks for more accurate and natural translations. This ensures high-quality results for languages like English, Spanish, French, and many others.
- Example Use Case: Translating customer support tickets from various languages to ensure faster responses.
Transcribe (ASR - Automatic Speech Recognition)
Amazon Transcribe provides automatic speech recognition (ASR) to convert speech into text. It can handle audio files, video files, and streaming audio sources.
- PII Removal: Transcribe can automatically remove or mask Personally Identifiable Information (PII), ensuring privacy and compliance with regulations like GDPR.
- Example Use Case: Redacting sensitive information (like phone numbers) from call center recordings.
- Automatic Language Identification: Automatically detects the language of the spoken content, so you don’t need to specify the language before transcription.
- Example Use Case: Transcribing multi-lingual podcasts or conference calls without needing manual language detection.
- Custom Vocabulary & Models: Customize the ASR service with your own vocabulary or specific terms, helping the model recognize domain-specific words (e.g., technical jargon or brand names). Both should be used for the most accuracy.
- Example Use Case: Training the model to recognize specialized medical terms during transcription of healthcare provider meetings.
- Detects Toxicity, Pitch, and Text-based Cues: Transcribe can analyze tone, pitch, and sentiment in the speech, helping identify specific emotional cues like sarcasm, anger, or enthusiasm.
- Example Use Case: Analyzing customer service interactions for sentiment and emotional tone to assess service quality.
Polly
Amazon Polly converts text into lifelike speech using advanced deep learning technologies.
- Text-to-Speech Synthesis: Polly can generate high-quality speech in various languages and voices, including both male and female voices with customizable accents.
- Example Use Case: Creating voice responses for interactive voice response (IVR) systems or virtual assistants.
Rekognition
Amazon Rekognition is an image and video analysis service that can identify objects, people, text, and scenes in images and videos.
- OCR (Optical Character Recognition): Detects and extracts text from images, such as scanned documents or photographs containing written content.
- Example Use Case: Scanning invoices or receipts to extract key information like amounts and dates.
- Object, People, and Text Detection: Rekognition can detect objects (e.g., cars, chairs), recognize people, and read text within images.
- Example Use Case: Identifying people and objects in surveillance video for security applications.
- Custom Labels for Training Image Data: You can create custom labels for training Rekognition to recognize specific objects relevant to your business (e.g., product types, brand logos).
- Example Use Case: Identifying and categorizing products in retail inventory management.
- A2I (Augmented AI) for Human Review: For high-stakes or uncertain predictions, Rekognition integrates with A2I to route predictions to human reviewers for validation before making final decisions.
- Example Use Case: Reviewing automated object detection results to ensure accuracy in highly regulated industries like healthcare.
Forecast
Amazon Forecast provides time-series data forecasting using machine learning models to predict future trends.
- Time-Series Data Forecasting: Automatically analyzes historical data (e.g., sales data, web traffic) and forecasts future values, helping businesses make data-driven decisions.
- Example Use Case: Predicting product demand, inventory needs, or financial metrics for budget planning.
Lex
Amazon Lex is a service for building conversational interfaces (chatbots) using voice and text.
- Chatbot Development with Lambda Integration: You can develop chatbots using Lex, and integrate them with AWS Lambda to run code in response to user input, making it dynamic and interactive.
- Example Use Case: Building a virtual assistant for customer service that can handle queries and make appointments.
- Slots as Input Parameters: Lex allows you to define slots for collecting required information from users (e.g., dates, locations) to ensure the bot gathers enough details for task completion.
- Example Use Case: A travel booking bot that asks for flight dates and locations before making bookings.
Personalize
Amazon Personalize provides real-time personalized recommendations using machine learning.
- Real-Time Personalized Recommendations: Personalize uses built-in algorithms, or recipes, to recommend products, content, or services tailored to individual user preferences.
- Example Use Case: Recommending movies on a streaming platform based on past viewing history.
- Uses Built-in Algorithms ("Recipes"): Pre-built machine learning models allow developers to quickly integrate personalized recommendations without needing deep ML expertise.
- Example Use Case: Building a product recommendation engine for an e-commerce site that adapts to user behavior.
Textract
Amazon Textract extracts text and data from documents.
- Document Text Extraction (OCR): Textract can automatically extract text from scanned documents (PDFs, images) and analyze the layout to extract tables and forms.
- Example Use Case: Extracting data from invoices and tax forms for automated processing in accounting systems.
Kendra
Amazon Kendra is an AI-powered enterprise search service that allows organizations to search across large sets of unstructured data. Capable of incremental learning from past searches.
- AI-Powered Document Search: Kendra enables users to perform intelligent searches within large datasets by understanding the context and meaning behind the query.
- Example Use Case: Enabling employees to search for internal documents like policies, manuals, or FAQs with natural language queries.
Amazon Augmented AI (A2I)
A2I (Augmented AI) allows human review to intervene in automated machine learning processes.
- Human Oversight in ML Predictions: For tasks that require human judgment or where models are not certain, A2I lets human reviewers approve or reject predictions made by automated systems, ensuring more reliable and accurate results.
- Example Use Case: Using human review for approving fraud detection in financial transactions before final confirmation.
Lookout (Anomaly Detection)
- Overview:
- AWS Lookout is a suite of services for anomaly detection across different data types and applications.
- Purpose: Detects unexpected patterns or outliers in datasets without needing to define specific thresholds or rules.
- Features and Terminology:
- Features: These are the dimensions or attributes of the data (e.g., product category, geographic location).
- Data: This refers to the measures or numerical values tied to the features (e.g., sales revenue, transaction volume).
- Lookout uses machine learning models to analyze both features and data to automatically identify anomalies.
Fraud Detection
- Overview:
- AWS Fraud Detection is a managed service that uses machine learning to help detect fraudulent activities in various industries (e.g., financial services, e-commerce).
- It uses machine learning models to analyze transactional data and spot fraudulent patterns in real-time.
- Feature Importance Insights:
- The service provides insights into feature importance, helping users understand which data points (e.g., user behavior, transaction history) are most relevant for detecting fraud.
- These insights can help improve model accuracy and give users a better understanding of potential fraud risks.
Amazon Q (Internal Data-Trained AI Assistant)
- Overview:
- Amazon Q is an AI assistant trained on internal Amazon data to assist with a variety of tasks, primarily aimed at improving business processes within Amazon itself.
- It’s designed to understand and leverage Amazon-specific data to help teams make informed decisions quickly and efficiently.
- The assistant uses natural language processing (NLP) and other AI technologies to provide insights and support for internal business operations.
- Supports Data Connectors using fully managed RAGs
Distributed Training
Distributed training enables training large-scale models and datasets by splitting workloads across multiple GPUs or even multiple machines. This approach helps overcome the limitations of single-machine training and accelerates model development.
Data Parallelism
- Definition: In data parallelism, the dataset is split into smaller chunks and distributed across multiple GPUs. Each GPU processes a portion of the data and computes gradients independently, after which the results are aggregated to update the model parameters.
- Use Case: Data parallelism is ideal when the model fits into the memory of a single GPU, but the dataset is too large to fit into one device. It helps achieve faster training by processing larger datasets in parallel.
- Example: Training a deep learning model on a massive image dataset, where each GPU processes a different batch of images simultaneously.
Model Parallelism
- Definition: In model parallelism, the model is split across multiple GPUs, with each GPU handling different parts (layers or components) of the model. The forward and backward passes are coordinated between GPUs to compute gradients and update model parameters.
- Use Case: Model parallelism is useful when the model is too large to fit into the memory of a single GPU, so it is divided across multiple GPUs. It is particularly useful for training models like transformers that require large amounts of memory.
- Example: Training a very deep neural network where the weights of each layer exceed the memory capacity of a single GPU.
AllReduce
- Definition: AllReduce is a collective communication operation used in distributed training to aggregate the gradients computed by each GPU and ensure they are synchronized across all devices. This operation is critical for ensuring that each GPU has the same model weights after each training step.
- Use Case: This operation is typically used in data parallelism to share and combine gradient information across GPUs, allowing them to update the model parameters in sync.
- Example: When using multiple GPUs to train a model, each GPU computes gradients on its batch of data. AllReduce ensures that the gradients from all GPUs are averaged and applied uniformly to the model weights.
AllGather
- Definition: AllGather is a collective communication operation where each GPU shares the gradients it computed with the other GPUs. The primary goal is to offload part of the communication to the CPU and reduce GPU-to-GPU communication overhead.
- Use Case: AllGather helps reduce network congestion and balances the load between CPUs and GPUs during training. It is often used in conjunction with model or data parallelism to improve communication efficiency.
- Example: Each GPU performs computations for different layers of a neural network. AllGather helps combine these computations and share necessary information with other GPUs, ensuring the model is synchronized.
DeepSpeed / Horovod
- DeepSpeed: A deep learning optimization library by Microsoft that focuses on improving the efficiency of large-scale model training. It implements techniques such as ZeRO (Zero Redundancy Optimizer) for memory optimization and mixed precision training for faster computation.
- Horovod: An open-source framework for distributed deep learning that uses AllReduce to synchronize gradients across multiple GPUs. Horovod allows easy scaling of training across multiple nodes and is commonly used with TensorFlow, Keras, and PyTorch.
- Use Case: These frameworks optimize the distributed training process, enabling faster convergence and better memory utilization. They are often used in large-scale machine learning models with a focus on improving both speed and scalability.
- Example: Using DeepSpeed or Horovod to train a large GPT model across several nodes in a cluster.
Sharded Data Parallelism
- Definition: Sharded Data Parallelism combines both data parallelism and model parallelism. It splits the model and the data across multiple GPUs to optimize training. Each GPU holds a part of the model (model parallelism) and works with a subset of the data (data parallelism), and the gradients are synchronized after every update.
- Use Case: This technique is used when models and datasets are both large and cannot fit into a single machine. It helps in efficiently scaling large models while managing the memory constraints.
- Example: Training a large-scale model like BERT or GPT, where the model is divided across GPUs and the data is also partitioned.
EFA (Elastic Fabric Adapter)
- Definition: Elastic Fabric Adapter (EFA) is a high-performance networking interface designed for distributed training on AWS. EFA provides low latency and high throughput communication between EC2 instances, helping scale distributed deep learning workloads.
- Use Case: EFA is beneficial when you need fast, reliable inter-node communication for large-scale distributed training. It enhances the performance of applications that require high network throughput.
- Example: Training a model across several EC2 instances in a cluster, where EFA minimizes communication overhead and accelerates the convergence of the training process.
MiCS (Minimize Communication Scale)
- Definition: MiCS is a technique to minimize the scale of communication during distributed training by reducing the amount of data exchanged between GPUs or nodes. It enables the training of models with trillions of parameters by optimizing how data is communicated across the system.
- Use Case: MiCS allows scaling of machine learning models to the trillion-parameter level by reducing the burden of network communication. This is particularly important for next-generation large-scale models that need to operate at extreme scales.
- Example: Training a model with trillions of parameters, such as large language models (LLMs) like GPT-3, without overwhelming the communication infrastructure.
SageMaker Data Input Methods
Often ideal data is formatted in Protobuf or RecordIO.
S3 File Mode
- Definition: S3 File Mode involves copying training data from Amazon S3 into a container before training begins. The data is stored locally within the container, enabling quick access during training.
- Use Case: This method is typically used when the dataset fits comfortably in the local storage of the container, and you prefer to have all the data preloaded for training. It reduces latency as the data is already available locally.
- Example: Copying a dataset of images or text from S3 into the training container and using it for model training without additional streaming or fetching during training.
S3 Fast File Mode
- Definition: S3 Fast File Mode provides an optimized approach for reading data directly from Amazon S3 while still leveraging the benefits of local file system access. It allows fast access to files by reducing latency when reading from S3.
- Use Case: Best suited for scenarios where data resides in S3 and needs to be read frequently during training, but it isn't necessary to copy the entire dataset to the local container. It helps in optimizing read access to large datasets, especially when combined with large-scale training.
- Example: Training a model on a large dataset that doesn't fit into the container's local storage, where real-time access from S3 is required without waiting for the entire dataset to be copied.
Pipe Mode
- Definition: Pipe Mode is a specialized method for streaming data directly from Amazon S3 into the training process in real-time. Instead of copying the entire dataset into a container or fetching large chunks, it streams data as needed during the training process.
- Use Case: Ideal for training with data that is too large to fit into memory or when continuous data streaming is needed. Pipe mode enables real-time ingestion of data without the overhead of data transfer and storage, ensuring efficient handling of large or constantly changing datasets.
- Example: Streaming data directly from S3 into a real-time inference model or for processing large, unstructured datasets like videos or sensor data.
S3 Express One Zone
- Definition: S3 Express One Zone is a performance-optimized storage solution that stores data in a single Availability Zone (AZ) for faster data access, without the higher cost of multi-AZ replication. It's designed for workloads that can tolerate lower redundancy but require high throughput.
- Use Case: Suitable for workloads requiring high throughput and low-latency data access but with a single-zone fault tolerance. This is often used when high performance is a priority, and the dataset is large but can be replicated manually or in a non-critical environment.
- Example: Storing and accessing large training datasets that don't require the global availability of multi-AZ, but need high-performance data access within one region.
FSx For Lustre
- Definition: Amazon FSx for Lustre is a high-performance distributed storage service, compatible with Lustre, that is designed for workloads with high throughput and low-latency requirements. It is often used in conjunction with data stored in Amazon S3 and integrates well with large-scale training environments.
- Use Case: This method is commonly used for workloads requiring distributed storage and parallel processing, such as scientific computing, high-performance computing (HPC), and large-scale machine learning training.
- Example: A machine learning model that processes high-resolution images or large video files, where the data needs to be processed in parallel across multiple compute instances.
Amazon EFS
- Definition: Amazon Elastic File System (EFS) is a scalable, network-attached file storage service that provides shared access to data from multiple instances. It can be accessed from multiple EC2 instances and supports scalable storage for workloads that require shared access.
- Use Case: Suitable for workloads that require a file system interface, such as web servers or applications that need shared access to data stored on a file system. EFS is ideal for use cases requiring high availability and scalability.
- Example: Training a deep learning model across multiple EC2 instances where each instance needs access to the same dataset concurrently, such as large-scale natural language processing (NLP) models.
Debugger
Saves internal state of models at periodic intervals. SMDebug client libraries allows for hooks to access training data.
Dockerfile Structure
opt/ml
├── input
│ ├── config
│ │ ├── hyperparameters.json
│ │ └── resourceConfig.json
│ └── data
│ └── channel_name
│ └── input data
├── model
│
├── code
│ └── script files
│
└── output
└── failure