Infrastructure
- Elastic Block Store (EBS):
- Provides persistent block storage for EC2 instances.
- Acts like virtual drives that can be attached to EC2 instances.
- Has provisioned capacity
- Cannot decrease capacity
- Default behaviour: The EBS volume is deleted upon instance termination unless configured otherwise.
- Availability: One EBS volume per Availability Zone (AZ) per instance.
- Elastic Load Balancer (ELB):
- Distributes incoming traffic across multiple targets (EC2 instances) to ensure scalability and fault tolerance.
- Supports multiple load balancing types: Application Load Balancer (ALB), Network Load Balancer (NLB), and Classic Load Balancer (CLB).
- Auto Scaling Group (ASG):
- Automatically adjusts the number of EC2 instances based on defined scaling policies (e.g., traffic, CPU utilization).
- Ensures that the application has the right compute capacity.
- Elastic Volumes:
- Allows dynamic resizing of EBS volumes (e.g., increasing size, adjusting IOPS).
- Can change volume types (e.g., SSD to HDD) or adjust performance without downtime.
1. Amazon EFS (Elastic File System)
- Type: Network file system (NFS-based)
- Performance: Scalable, shared storage with multiple EC2 instances
- Durability: Highly durable, replicated across multiple Availability Zones (AZs)
- Persistence: Persistent storage, remains after instance termination
- Use Cases:
- Shared storage for multiple EC2 instances
- Web server clusters
- Big data and analytics workloads
- Container storage for EKS/ECS
2. Amazon EBS (Elastic Block Store)
- Type: Block storage (attached to a single EC2 instance at a time)
- Performance: High IOPS (SSD-backed) or throughput-optimized (HDD-backed)
- Durability: Persistent, replicated within a single Availability Zone (AZ)
- Persistence: Remains even after the instance is terminated (if not deleted)
- Use Cases:
- Databases (MySQL, PostgreSQL)
- Application servers
- Boot volumes for EC2 instances
- Any workload requiring low-latency, high-performance block storage
3. Instance Store (Ephemeral Storage)
- Type: Directly attached SSD/HDD storage (physical disks on the instance)
- Performance: Extremely low latency, high throughput
- Durability: Non-persistent (data is lost when the instance is stopped/terminated)
- Persistence: Data is wiped when the instance stops
- Use Cases:
- Temporary storage (caching, buffers, scratch space)
- High-speed ephemeral workloads (e.g., big data processing)
- Applications that can regenerate data if lost
Comparison Table
Feature | EFS (Elastic File System) | EBS (Elastic Block Store) | Instance Store |
---|---|---|---|
Type | Network file system | Block storage | Directly attached storage |
Performance | Scalable, shared | High-performance SSD/HDD | Extremely fast |
Persistence | Persistent | Persistent | Non-persistent |
Availability | Multi-AZ | Single-AZ | Single instance |
Use Case | Shared file storage | Databases, app storage | Caching, temp data |
Latency | Higher (network-based) | Low | Lowest (local disk) |
Cost |
Data Stores
- Data Lakes - Store raw data in its native format, including structured, semi-structured, and unstructured data
- Data Warehouse - Store cleaned, structured data
- Data Lakehouse - hybrid, best of both worlds approach
- Data Mesh - Domain based data management
Data Storage Classes
1. Amazon S3 Standard - General Purpose
- Use Case in ML & SageMaker:
- Best suited for frequently accessed datasets such as training datasets, feature stores, and real-time model inference files.
- Ideal for datasets used in active ML model training and preprocessing.
- Key Features:
- Low latency and high throughput for rapid data access.
- Redundant storage across multiple Availability Zones.
- No retrieval fees or minimum storage duration.
2. Amazon S3 Standard-Infrequent Access (IA)
- Use Case in ML & SageMaker:
- Suitable for datasets used less frequently but still needed for training/testing.
- Ideal for historical training data that is occasionally used for model retraining.
- Key Features:
-
Lower cost than S3 Standard but charges for data retrieval.
-
Designed for long-term storage with occasional access.
-
Redundant across multiple Availability Zones.
-
3. Amazon S3 One Zone-Infrequent Access (IA)
- Use Case in ML & SageMaker:
- Best for storing intermediate or temporary ML datasets that are not mission-critical.
- Useful for cost-effective storage of processed features or past model versions.
- Key Features:
-
20% cheaper than S3 Standard-IA but stored in a single availability zone (AZ).
-
No cross-region redundancy.
-
Suitable for backup copies of processed ML datasets that can be regenerated if lost.
-
4. Amazon S3 Glacier Instant Retrieval
- Use Case in ML & SageMaker:
- Best for ML model checkpoints, older training datasets, or archived features that require fast retrieval.
- Ideal for storing older trained models that may be reloaded for comparison or rollback.
- Key Features:
-
Ultra-low-cost storage with millisecond retrieval.
-
Lower availability compared to Standard or IA.
-
Suitable for datasets that are rarely used but need quick access.
-
5. Amazon S3 Glacier Flexible Retrieval
- Use Case in ML & SageMaker:
- Suitable for long-term storage of ML datasets and trained models that are rarely accessed.
- Good for compliance and audit purposes, storing past model predictions or datasets for regulatory needs.
- Key Features:
-
Flexible retrieval times: Can take minutes to hours.
-
Lower cost than Glacier Instant Retrieval.
-
Suitable for ML archives that don’t require instant access.
-
6. Amazon S3 Glacier Deep Archive
- Use Case in ML & SageMaker:
- Lowest-cost option for long-term storage of ML training datasets that are unlikely to be used but need retention.
- Good for historical model versions and old experimental data.
- Key Features:
- Lowest storage cost, but retrieval can take up to 12 hours.
- Designed for data retention policies where ML models or datasets need to be stored for years.
7. Amazon S3 Intelligent Tiering
- Use Case in ML & SageMaker:
- Best for unpredictable ML workloads where access patterns change over time.
- Ideal for datasets used in active ML training but may become inactive after deployment.
- Key Features:
- Automatically moves data between Standard, IA, and Glacier tiers based on access patterns.
- Reduces storage costs without impacting performance.
- Suitable for dynamic ML datasets that may go from active to infrequent use.
Choosing The Right S3 Storage Class
- Frequent Access tier (automatic): default tier
- Infrequent Access tier (automatic): objects not accessed for 30 days
- Archive Instant Access tier (automatic): objects not accessed for 90 days
- Archive Access tier (optional): configurable from 90 days to 700+ days
- Deep Archive Access tier (optional): config. from 180 days to 700+ days
Use Case | Recommended S3 Storage Class |
---|---|
Active ML training data, real-time feature stores | S3 Standard |
Historical datasets for occasional retraining | S3 Standard-IA |
Intermediate feature engineering results | S3 One Zone-IA |
Archived model checkpoints, past ML versions (instant access needed) | S3 Glacier Instant Retrieval |
Long-term ML archives with occasional retrieval | S3 Glacier Flexible Retrieval |
Rarely used ML datasets for compliance (long-term) | S3 Glacier Deep Archive |
Unpredictable ML workloads with fluctuating access patterns | S3 Intelligent Tiering |
File Systems
- Elastic File System (EFS):
- Network file system (NFS) providing scalable, shared file storage for EC2 instances.
- Supports multiple attachments across all Availability Zones within a region for simultaneous access by many instances.
- FSx (Amazon FSx):
- Supports both scratch and persistent file system options
- High-performance file systems for specialized use cases:
- FSx for Windows: Supports SMB protocol and Windows NTFS file system.
- FSx for Lustre: Designed for High-Performance Computing (HPC), suitable for large-scale parallel processing, scientific computing, and machine learning. Seamless integration with S3. Can be used on premises.
- FSx for NetApp ONTAP: Offers data management features like point-in-time cloning and data deduplication.
- FSx for OpenZFS: Supports point-in-time cloning and provides ZFS features like snapshotting and compression.
- S3 (Amazon Simple Storage Service):
- Upload and File Size Limits:
- Max upload size: 5 GB for a single object in one PUT request.
- Max file size: 5 TB for individual objects in S3 storage.
- Buckets must have a globally unique name
- Buckets are defined at the region level
- Keys are composed of prefix + object name
- Cross-Region Replication (CRR) and Same-Region Replication (SRR) require versioning turned on
- Can replicate existing objects using S3 Batch Replication
- Upload and File Size Limits:
Data Processing
Formats
- Avro - Binary format what stores both data and schema for serialization
- Parquet - Columnar storage optimized for anallytics
Three V's of Data
- Volume - How much data
- Velocity - Speed new data is generated
- Variety - Refers to different structures, types and sources of data
EMR (Elastic MapReduce)
- Overview:
- Amazon EMR is a cloud-native big data platform that simplifies running big data frameworks like Apache Hadoop, Apache Spark, and Presto on AWS. It offers both managed and serverless cluster options, leveraging EC2 nodes for scalable computing. While Hadoop and its HDFS (Hadoop Distributed File System) provide reliable distributed storage for large datasets, Spark enhances performance with in-memory processing, making it faster than MapReduce for many tasks. YARN (Yet Another Resource Negotiator) manages resources across the cluster, scheduling jobs and allocating resources efficiently. Spark can be used alongside MapReduce in EMR to reduce latency and improve performance. The combination of these technologies allows for flexible, scalable, and fault-tolerant data processing, making EMR a powerful tool for handling massive datasets.
- EMR Cluster: An EMR cluster is a group of EC2 instances that work together to process large amounts of data in parallel. Consists of master nodes, core nodes to host data and task nodes which are a good use of spot instances.
- Managed Capacity vs. Serverless:
- Managed Capacity refers to predefined cluster configurations where you specify the number and type of EC2 instances for your cluster.
- Serverless EMR allows users to run workloads without managing the underlying infrastructure. AWS handles scaling, provisioning, and management of resources automatically.
- Use of Hadoop:
- Hadoop is a framework that enables the distributed storage and processing of large data sets. EMR supports Hadoop Distributed File System (HDFS) and MapReduce as part of its core components, enabling scalable storage and computation.
- Spark Overhead:
- Spark adds a 10% overhead compared to traditional Hadoop MapReduce jobs, but it is typically faster and more flexible due to its in-memory computing capabilities, making it ideal for iterative algorithms and real-time analytics.
- AWS Integrations:
- Amazon EC2 for the instances that comprise the nodes in the cluster
- Amazon VPC to configure the virtual network in which you launch your clusters
- Amazon S3 to store input and output data
- Amazon CloudWatch to monitor cluster performance and configure alarms
- AWS IAM to configure permissions
- AWS CloudTrail to audit requests made to the service
- AWS Data Pipeline to schedule and start your clusters
- Hardware: Typically m4.large is fine, or m4.xlarge or spot instances
Hardware
- Trainium & Inferentia:
- Purpose: Specialized hardware designed to accelerate AI/ML workloads on AWS.
- Trainium:
- AWS-designed chip optimized for training machine learning models.
- Focuses on high throughput and low-cost, energy-efficient training.
- Inferentia:
- AWS-designed chip optimized for inference tasks (running machine learning models after they are trained).
- Provides high throughput, low latency, and cost-effective inference for large-scale machine learning applications.
- Environmental Footprint: Both Trainium and Inferentia are built with low environmental impact in mind, ensuring efficient resource use during training and inference.
Balancing
- Unbalanced Data: When dealing with unbalanced datasets, one common technique to address this is oversampling the minority class. This increases the representation of the underrepresented class by duplicating or synthesizing new examples. Another approach is undersampling the majority class, which reduces the number of instances in the dominant class to balance the dataset. Sometimes it is important to remove outliers.
- SMOTE (Synthetic Minority Over-sampling Technique): This technique uses KNN (K-Nearest Neighbors) to generate synthetic data points for the minority class. SMOTE works by selecting a data point from the minority class and generating new points by interpolating between that point and its nearest neighbors. This helps to overcome class imbalance and improves the model's performance on the minority class.
- Variance and Standard Deviation:
- Variance is the average of the squared differences between each data point and the mean. It provides a measure of the data's spread.
- Standard Deviation is the square root of variance and is more interpretable since it’s in the same unit as the original data.
- Quantile Binning: This method divides the data into bins such that each bin contains an equal number of data points. It’s useful when you want to maintain the distribution of data across bins.
- One-Hot Encoding: One-hot encoding is used to convert categorical features into binary format. It creates a binary column for each category (or class) and assigns a "1" to the column corresponding to the category present in the instance, while all other columns are "0". This is especially useful for categorical variables with a small number of possible values.
- Training dataset shuffling could also help
AWS Glue
AWS Glue is a managed ETL (Extract, Transform, Load) service that extracts structure, metadata, and schema from unstructured data. It automatically discovers the structure of your data sources (e.g., CSV files, databases) and generates the appropriate schema for further processing. This simplifies the process of handling and transforming large datasets.
- Runs Spark, Scala, or Python-based ETL jobs
- Includes a Data Catalog for metadata management
- Serverless
- Glue Studio lets you visualize workflow
- Glue DataBrew is a visual data preparation and transformation tool that can handle PII
AWS Batch
AWS Batch is a fully managed service that enables you to run batch processing jobs at any scale. It provides an efficient, serverless solution for running batch jobs without having to manage infrastructure or scale compute resources.
Key Features:
-
Runs Batch Jobs as Docker Images:
- AWS Batch allows you to run your batch jobs in Docker containers, providing flexibility to package the applications and their dependencies into a single unit. This simplifies the execution of diverse workloads in a consistent and isolated environment.
-
Auto-Provisions Resources:
- AWS Batch automatically provisions the required compute resources, such as EC2 instances or Fargate compute, based on the resource requirements of the batch jobs. This eliminates the need for manual resource allocation and scaling, allowing jobs to run efficiently without over-provisioning or under-utilization.
-
Fully Serverless:
- AWS Batch operates in a serverless manner, meaning you don’t need to manage or configure the underlying infrastructure. It automatically scales resources based on job demands and frees you from the complexity of managing servers or clusters.
-
Suitable for Cleanup Tasks:
- AWS Batch is often used for periodic or resource-intensive tasks such as data processing, ETL jobs, log analysis, or cleanup tasks. These tasks can be scheduled and executed without manual intervention, making it ideal for workloads that are not time-sensitive but require significant compute power.
Use Cases:
- Data Processing: Running jobs that process large datasets, such as scientific simulations, data transformations, or image/video processing.
- Cleanup Tasks: Running periodic jobs like log file cleanup, deleting outdated files, or backing up data.
- ETL Workflows: Running Extract, Transform, Load (ETL) processes to move and manipulate large datasets.
Benefits:
- Cost-Effective: By automatically provisioning and deprovisioning resources, AWS Batch ensures you only pay for the compute resources you actually use.
- Scalable: It can handle jobs ranging from a few to thousands of parallel tasks, scaling up or down as needed.
- Easy to Use: With built-in job scheduling, resource management, and execution, AWS Batch reduces the need for manual intervention and infrastructure management.
Athena
- Overview: Amazon Athena is a serverless query service that allows users to run SQL queries directly on data stored in Amazon S3 without the need to load it into a database. It’s designed to make it easy to analyze large datasets quickly and at scale. Uses Presto under the hood.
- Serverless.
- Data Formats: Athena works best with columnar data formats like Parquet or ORC, which are optimized for read-heavy analytics and offer better performance compared to row-based formats (e.g., CSV or JSON). These formats enable Athena to only read the necessary columns, reducing I/O and improving query performance.
- Query Access: Query access is controlled through workgroups, which define the permissions and resources available to different sets of users or queries. Athena integrates with CloudWatch for logging query execution details, and IAM for managing access permissions.
- Optimization: You can optimize queries in Athena by using the binpack command, which optimizes the reading of data from columns in S3. This can improve performance when working with large datasets.
- Best Practices: Athena performs best when working with a small number of files and partitions in S3. If there are too many small files or excessive partitioning, query performance can degrade. To maximize performance, it’s recommended to organize the data into fewer, larger files and optimize partitioning strategies.
- Can support ACID transactions with Apache Iceberg.
Streaming
Kinesis Data Stream
- Real-time data streaming service.
- Retention up to a year.
- Data can't be removed until it expires
- Limits: Hard 1 MB per record limit.
- Supports KMS at rest encryption
- Provisioned mode allows choosing a number of shards whereas On-Demand mode auto scales
- Structure:
- A data stream consists of shards (unit of throughput).
- Shards determine the stream's capacity.
- Data Firehose:
- Used to send data to destinations like S3, Redshift, Elasticsearch, etc.
- Can invoke Lambda for real-time transformations.
- Near real-time processing with optional buffering.
- Fully managed service
- Can support conversions to Avro, Parquet, etc.
- Client API:
PutRecords
to write data.GetRecords
to read data (empty payloads are expected due to buffering).
- Flink on Kinesis:
- Used for real-time streaming ETL and metric generation.
- SQL Anomaly Detection:
- Uses Random Cut Forest (RCF) algorithm for anomaly detection in data streams.
- Data Analytics:
- Allows for post-processing of data
- Can be used for streaming ETL, metric generation or responsive analytics
- Unmanaged
Amazon MSK - Kafka
- Open-source distributed event streaming platform.
- Fully managed Kafka clusters on AWS.
- MSK Serverless option available
- Supports KMS at rest encryption
- Amazon MSK (Managed Streaming for Apache Kafka):
- Security: TLS encryption in-flight between brokers.
- Data Limits: 1-10 MB per message.
- Kafka topics: Organized into partitions (only new partitions can be added).
- Encryption: Supports plaintext or TLS encryption for payloads.
- Network Security:
- Controlled via client security groups.
- Authentication & Authorization:
- Defines who can read/write to a topic.
- IAM does not manage Kafka ACLs (Access Control Lists).
- MSK Connect:
- Fully managed Kafka Connect for integrating with external data sources.
Feature Engineering
Feature engineering in machine learning often involves reducing the complexity of data through techniques like Principal Component Analysis (PCA) and K-means clustering. PCA minimizes the number of features (dimensions) in a dataset by projecting the data into a smaller set of orthogonal dimensions, capturing the most significant variance. K-means clustering helps in grouping data points into clusters, reducing the dimensionality by summarizing data into central cluster points. The curse of dimensionality is that too many features can lead to sparse data. Logs of IDF are used since frequencies are exponential. N-grams are computed typically over actual words.
Measure of relevancy might then be TF / DF.
For natural language processing (NLP), TF-IDF (Term Frequency-Inverse Document Frequency) is a key method for evaluating the importance of a word in a document relative to a corpus. It is computed by multiplying two components:
- Term Frequency (TF): The frequency of a term in a specific document.
- Document Frequency (DF): The frequency of a term in a set of documents.
- Inverse Document Frequency (IDF): A measure of how rare or common a term is across the entire corpus, calculated as the logarithm of the inverse of the fraction of documents containing the term.
Imputing
When dealing with missing data, both mean imputation and dropping missing values are generally not ideal approaches. Mean imputation can introduce bias, as it oversimplifies the data and may distort relationships between features. Dropping missing values can lead to the loss of valuable data, especially in small datasets, and may reduce the model's ability to generalize.
Better alternatives for imputing missing data include:
- K-Nearest Neighbors (KNN): This method assumes that missing values can be predicted by looking at similar data points. It works well with numerical data, where missing values are imputed based on the values of the nearest neighbors in feature space.
- Deep Learning: Deep learning models, particularly autoencoders, can learn to fill in missing values by understanding the underlying patterns in the data. These models are useful for complex datasets with non-linear relationships.
- Regression Imputation: This technique models the missing feature as a function of other features using regression models. It provides more accurate imputations compared to simple statistical methods like mean imputation.
- MICE (Multiple Imputation by Chained Equations): MICE creates multiple imputations for missing data points, accounting for uncertainty and variability. It iteratively models each feature with missing values as a function of other features and improves over time.
- Getting More Data: If feasible, gathering more data can help fill gaps in missing values and ensure the model has enough information for accurate training.
Infrastructure as Code (IaC)
AWS CloudFormation
- Declarative infrastructure deployment
- Configuration as code (YAML only)
AWS Infrastructure Composer
- Visual canvas for CloudFormation
- Simplifies infrastructure as code
AWS Cloud Development Kit (CDK)
- Define cloud infrastructure using familiar programming languages
- Converts code into CloudFormation templates
CI/CD & Deployment
Deployment
- Deployment Safeguards:
- Blue/Green: All-at-once, Canary, Linear.
- Shadow tests: Compare performance before deploying.
- SageMaker Neo:
- Train once, run anywhere.
- Compiler and runtime.
- Deploys to AWS IoT Greengrass for edge devices.
- Managed Spot Training: Waits for Spot instances to become available.
AWS CodeDeploy
- Automates application deployment
- Supports on-premise servers and EC2 instances
- Requires installing the CodeDeploy agent
AWS CodeBuild
- Continuous integration & build service
AWS CodePipeline
- Automates build and deployment workflows
- Pushes changes from commit to deployment
Event-Driven Architecture
AWS EventBridge
- Schedules CRON jobs
- Reacts to event patterns (e.g., IAM user sign-ins)
AWS Step Functions
- Visual workflow orchestration tool
- Provides advanced error handling and retry mechanisms
- Tracks workflow execution history
- Uses state machines with the following states:
- Task: Executes a function
- Choice: Implements conditional logic
- Wait: Adds delay
- Parallel: Runs branches concurrently
- Map: Iterates over datasets
- Pass, Succeed, Fail: Control flow
Apache Airflow
- Batch-oriented workflow tool
- Defined using Python
- Uses Directed Acyclic Graphs (DAGs)
- Managed by Amazon MWAA (Managed Workflows for Apache Airflow)
Data Lake & Governance
AWS Lake Formation
- Simplifies secure data lake setup
- Built on AWS Glue
- Manages data transformation and access control
- Supports:
- Cross-account permissions
- IAM-based security
- Column, row, and cell-level security
- Governed Tables with ACID transactions
Security & Identity
Encryption
SSE-KMS (Server-Side Encryption with AWS Key Management Service)
- Overview: SSE-KMS is a server-side encryption method that uses AWS KMS (Key Management Service) to manage encryption keys. With SSE-KMS, data is encrypted at rest before it is stored in services like Amazon S3, EBS, and Redshift.
- Throttling: SSE-KMS introduces throttling as KMS has a default limit on the number of requests per second (TPS). This can be an issue in high-performance environments, and customers may need to request a limit increase if their workload requires more requests.
- Key Management: KMS manages the encryption keys, and customers can create, rotate, and control access to these keys.
- Use Cases: SSE-KMS is often used when customers need fine-grained control over key management and require auditability and compliance.
SSE-C (Server-Side Encryption with Customer-Provided Keys)
- Overview: SSE-C allows customers to use their own encryption keys to encrypt objects in Amazon S3. The encryption key is provided by the customer via the request headers.
- Encryption Key Handling: In SSE-C, the customer must manage the encryption keys. Each time the customer uploads an object to S3, the encryption key is passed along in the request header. Similarly, when downloading the object, the customer must pass the encryption key in the request header to decrypt the data.
- Security Considerations: SSE-C allows for greater control over encryption keys, but it also requires careful management of these keys. If the key is lost, access to the encrypted data will be lost as well.
- Use Cases: SSE-C is often used when customers want to retain full control over their encryption keys and are comfortable managing them securely.
CSE (Client-Side Encryption)
- Overview: CSE is a method where the client fully manages the encryption process. The client encrypts the data before sending it to an AWS service such as S3. After the data is uploaded, it is stored in its encrypted form.
- Key Management: The client manages the encryption keys and ensures the data is encrypted before it reaches AWS. The key is not passed to AWS; it remains within the control of the client.
- Benefits: CSE gives clients complete control over the encryption and decryption process, including the ability to choose custom algorithms and manage the lifecycle of keys.
- Use Cases: This encryption method is preferred when clients need complete control over the encryption process and key management, especially in highly regulated environments.
- HTTPS must be used
Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Enabled by Default
Encryption in Transit: SSL/TLS
- Overview: SSL/TLS (Secure Sockets Layer / Transport Layer Security) is used to encrypt data in transit between the client and AWS services, ensuring data integrity and confidentiality while being transmitted over the network.
- Purpose: SSL/TLS protects data during transit to prevent interception, eavesdropping, and tampering by unauthorized parties.
- Use Cases: It is used for encrypting data sent between web servers and browsers, APIs, and other communication channels. For example, HTTPS (which uses TLS) is used for secure communication over the web.
AWS Key Management Service (KMS)
- Symmetric Keys: Single key for encryption & decryption
- Asymmetric Keys: Public-private key pair
- Customer-managed keys ($1/month)
- Supports automatic key rotation (region-specific)
Bucket Policy and Access Control
- Bucket Policy Evaluation: A bucket policy in Amazon S3 defines access permissions for objects within a bucket. It is evaluated when a request is made to access an S3 object, and before any encryption headers (e.g., SSE-S3, SSE-KMS, SSE-C) are processed. This means that the bucket policy determines whether the requester is authorized to access the object before it can even consider the encryption settings for the object.
- Consideration: Since the bucket policy is evaluated before the headers, a user may be denied access to an object if they do not have the correct permissions, regardless of the encryption method being used. For example, even if a user has the correct decryption key for SSE-KMS, they will still be denied access if the bucket policy does not allow them to access the object.
IAM (Identity and Access Management)
- Global AWS service
- Root user created by default
- Key elements:
- Policies: Define permissions
- Roles: Assignable identities for EC2, Lambda, etc.
- Conditions: Restrict access based on specific rules
IAM Policy Structure
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ExampleStatement",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::example-bucket/*"
}
]
}
Amazon ECS – IAM Roles for ECS
- EC2 Instance Profile (EC2 Launch Type only):
- Used by the ECS agent.
- Makes API calls to the ECS service.
- Sends container logs to CloudWatch Logs.
- Pulls Docker images from Amazon ECR.
- References sensitive data from Secrets Manager or SSM Parameter Store.
- ECS Task Role:
- Allows each task to have a specific IAM role.
- Enables different roles for different ECS services.
- Task Role is defined in the task definition.
AWS Macie
- Machine learning-based detection of PII in S3 data
AWS WAF (Web Application Firewall)
- Protects against Layer 7 (HTTP) attacks
- Deploys on:
- Load balancers
- API Gateway
- CloudFront
- AppSync GraphQL API
- Cognito User Pools
AWS Shield
- DDoS protection service
Networking
NAT
- NAT Gateway: AWS-managed
- NAT Instance: Self-managed
Network ACL (NACL)
- Stateless security rules at the subnet level
Virtual Private Cloud (VPC)
- Private cloud environment within AWS
- Components:
- Subnet: Network partition tied to an AZ
- Internet Gateway: Provides internet access
- NAT Gateway/Instance: Internet access for private subnets
- Security Groups: Stateful, EC2/ENI-level rules
- VPC Peering: Connects VPCs (non-transitive)
- VPC Endpoints: Private AWS service access within a VPC
- VPC: Virtual Private Cloud
- NACL: Stateless, subnet rules for inbound and outbound
- Security Groups: Stateful, operate at the EC2 instance level or ENI
- VPC Peering: Connect two VPC with non overlapping IP ranges, non transitive
- VPC Endpoints: Provide private access to AWS Services within VPC
- VPC Flow Logs: network traffic logs
- Site to Site VPN: VPN over public internet between on-premises DC and AWS
- Direct Connect: direct private connection to a AWS
- PrivateLink: Link two VPCs between accounts
SageMaker + VPC:
- Training jobs run in a Virtual Private Cloud (VPC).
- You can use a private VPC for additional security.
- S3 VPC endpoints must be set up.
- Custom endpoint policies and S3 bucket policies enhance security.
- Notebooks are Internet-enabled by default, which can be a security risk.
- If disabled, your VPC needs an interface endpoint (PrivateLink) or NAT Gateway with outbound connections for training and hosting to work.
- Training and Inference Containers are Internet-enabled by default.
- Network isolation is an option but prevents S3 access.
AWS PrivateLink
- Securely connects VPCs within the AWS Marketplace
Monitoring & Auditing
AWS CloudWatch
- Performance metrics
- Event alerting
- Log aggregation
AWS X-Ray
- Tracing and debugging for distributed applications
- Supports:
- Java, Python, Node.js, .NET
- IAM & KMS integration
AWS CloudTrail
- Logs API activity for compliance & governance
- CloudTrail Insights: Detects unusual activity
- Stores logs for 90 days by default
AWS Config
- Tracks AWS resource configurations
- Evaluates compliance against rules
- Does not enforce compliance
CloudWatch vs CloudTrail vs Config
• CloudWatch • Performance monitoring (metrics, CPU, network, etc…) & dashboards • Events & Alerting • Log Aggregation & Analysis • CloudTrail • Record API calls made within your Account by everyone • Can define trails for specific resources • Global Service • Config • Record configuration changes • Evaluate resources against compliance rules • Get timeline of changes and compliance
Cost Management
AWS Budget
- Sends alarms for:
- Cost & usage
- Reservation & Savings Plans
AWS Cost Explorer
- Granular cost and usage insights
- Forecasts up to 12 months
AWS Trusted Advisor
- Provides insights on:
- Cost optimization
- Performance
- Security
- Fault tolerance
- Service limits
- Operational excellence
Overview
Amazon Bedrock
- Model evaluation tools
Amazon SageMaker
- End-to-end ML platform
- Clarify: Bias detection & model explainability
- Model Monitor: Alerts for inaccurate responses
- Augmented AI: Human-in-the-loop ML workflows
ML Lifecycle
-
Data Collection (Label, Ingest, Aggregate)
-
Preprocessing (Clean, Partition, Scale, Balance, Augment)
-
Feature Engineering (Selection, Transformation, Extraction)
-
Training & Tuning
- Hyperparameter tuning
- Validation metrics
- Model artifacts
-
Deployment & Inference
- Real-time monitoring
- CloudWatch & Model Monitor integration
-
Optimization
- Reduce training costs
- Optimize inference latency
SageMaker Components
- Experiments: AutoML & hyperparameter tuning
- Pipelines: ML workflows
- Lineage Tracking: Model versioning
- Feature Store: Centralized feature management
- Jumpstart: Prebuilt models
Data Preprocessing Tools
- Glue DataBrew: Data encryption & obfuscation
- Athena/Redshift: Query processing
- Wrangler: Interactive data analysis
- EMR: Big data processing
Compute Services
- ECS: Container platform.
- EKS: Kubernetes.
- Fargate: Serverless.
- ECR: Container registry.
- EKS Storage:
- Attach data volume to an EKS cluster.
- Requires StorageClass manifest.
- Uses a CSI-compliant driver.