Skip to main content

Last Week in AWS: S3 and the Evolution of Storage with Andy Warfield

When S3 Tables came out in December 2024, I was a little confused about what exactly the value proposition was over hosting your lake-style files in S3. Andy Warfield answers that question here:

What compaction is doing is, like I said, you’ve got one giant Parquet file, possibly as an initial table. Over time, you’re adding additional Parquet files. Each one of those adds a bunch of metadata files, fragmenting your data like crazy. The simple task of compaction is to take all of those changes, throw away the stuff that was deleted, keep the stuff that’s alive, and fold it into a single or a small number of very large files. This allows you to get back to doing large reads of just the columns of the database that you care about, maximizing the utilization of your request path. That gets you huge performance and the most usable bytes read per bytes used.

The challenge is that the way the customer workload updates the data in the table completely changes the complexity of compaction from workload to workload. A read-only database, like a table that has never changed, obviously doesn’t need compaction—it just sits as it is. The one exception is that you might decide to restructure that table in the background over time if you notice that queries are accessing it in a way that the table is not well laid out for.