AWS Big Data Blog

Amazon Redshift announces history mode for zero-ETL integrations to simplify historical data tracking and analysis

This post will explore brief history of zero-ETL, its importance for customers, and introduce an exciting new feature: history mode for Amazon Aurora PostgreSQL-Compatible Edition, Amazon Aurora MySQL-Compatible Edition, Amazon Relational Database Service (Amazon RDS) for MySQL, and Amazon DynamoDB zero-ETL integration with Amazon Redshift.

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

In this post, we demonstrate how to build a scalable AWS WAF log analysis solution using Firehose and Apache Iceberg. Firehose simplifies the entire process—from log ingestion to storage—by allowing you to configure a delivery stream that delivers AWS WAF logs directly to Apache Iceberg tables in Amazon S3. The solution requires no infrastructure setup and you pay only for the data you process.

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Creating a new cluster with Express brokers is straightforward, as described in Amazon MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

In this post, we discuss the foundational building blocks of SageMaker Unified Studio and how, by abstracting complex technical implementations behind user-friendly interfaces, organizations can maintain standardized governance while enabling efficient resource management across business units. This approach provides consistency in infrastructure deployment while providing the flexibility needed for diverse business requirements.

Amazon Redshift Serverless adds higher base capacity of up to 1024 RPUs

In this post, we explore the new higher base capacity of 1024 RPUs in Redshift Serverless, which doubles the previous maximum of 512 RPUs. This enhancement empowers you to get high performance for your workload containing highly complex queries and write-intensive workloads, with concurrent data ingestion and transformation tasks that require high throughput and low latency with Redshift Serverless.

Use DeepSeek with Amazon OpenSearch Service vector database and Amazon SageMaker

OpenSearch Service provides rich capabilities for RAG use cases, as well as vector embedding-powered semantic search. You can use the flexible connector framework and search flow pipelines in OpenSearch to connect to models hosted by DeepSeek, Cohere, and OpenAI, as well as models hosted on Amazon Bedrock and SageMaker. In this post, we build a connection to DeepSeek’s text generation model, supporting a RAG workflow to generate text responses to user queries.

Handle errors in Apache Flink applications on AWS

This post discusses strategies for handling errors in Apache Flink applications. However, the general principles discussed here apply to stream processing applications at large.

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

At Open Universities Australia (OUA), we empower students to explore a vast array of degrees from renowned Australian universities, all delivered through online learning. In this post, we show you how we used AWS services to replace our existing third-party ETL tool, improving the team’s productivity and producing a significant reduction in our ETL operational costs.

Hybrid big data analytics with Amazon EMR on AWS Outposts

In this post, we dive into the transformative features of EMR on Outposts, showcasing its flexibility as a native hybrid data analytics service that allows seamless data access and processing both on premises and in the cloud.

How MuleSoft achieved cloud excellence through an event-driven Amazon Redshift lakehouse architecture

In our previous thought leadership blog post Why a Cloud Operating Model we defined a COE Framework and showed why MuleSoft implemented it and the benefits they received from it. In this post, we’ll dive into the technical implementation describing how MuleSoft used Amazon EventBridge, Amazon Redshift, Amazon Redshift Spectrum, Amazon S3, & AWS Glue to implement it.