How not to use Apache Iceberg !

Ajantha
5 min readJan 23, 2024

--

Although Apache Iceberg is a powerful tool, using it incorrectly can incur significant costs and may not result in the desired performance outcome. Based on my observations from working with Apache Iceberg for over two years, I believe these are common mistakes that some users make while utilizing Apache Iceberg.

1. Using Hadoop catalog in production!

The Hadoop catalog in Iceberg is a great way to get started. You don’t need approval for additional software or libraries for the catalog. However, it also comes with limitations. Certain features, such as rename and drop, may not be safe depending on the storage layer, and you may also require a lock manager for atomic behaviour and safe concurrent operations.

Many users complete proof of concept (POC) with the Hadoop catalog and proceed to production without anticipating potential issues. Hence, avoid using the Hadoop catalog in production. Iceberg provides various catalog types, including Hive (HMS), Glue, Nessie, REST, JDBC, etc. Choose the one that best suits your use case.

Note: In upcoming blogs, I will delve into the details of the pros and cons of each catalog and provide guidance on selecting the most suitable one based on your use cases.

2. Not using metadata tables.

Metadata tables are a super powerful tool for analyzing the current state of the table. Once you master the schema of each metadata table, you can use JOIN queries between the metadata tables to compute specific information.

I highly recommend watching this talk on Iceberg metadata tables
https://www.youtube.com/watch?v=s5eKriX6_EU&t=1641s&ab_channel=TheASF

3. Not tuning the table properties for high concurrent operations.

Iceberg employs optimistic concurrency control. When conflicts arise during commit, Iceberg retries the commit by rebasing only the metadata. Even though this process is automatically handled by Iceberg, the default configurations may not be sufficient for retries to succeed during high concurrency. So, tune the properties below based on the requirement.

Property                 Default      Description
commit.retry.num-retries 4 Number of times to retry a commit before failing
commit.retry.min-wait-ms 100 Minimum time in milliseconds to wait before retrying a commit
commit.retry.max-wait-ms 60000(1 min) Maximum time in milliseconds to wait before retrying a commit

4. Not preventing the small files problem in the first place.

If a streaming service commits to an Iceberg table every second, it can generate millions of snapshots and Iceberg metadata files on a long run. Leading to degraded query performance and wasted storage space.

We can utilize compaction to optimize the table. However, as prevention is better than cure, it’s advisable to address the issue proactively if possible. Some of my suggestions for preventing the small file problem include:

  • Partition the table correctly.
    Sometimes, we partition based on columns that resulting in very few rows per partition. In such cases, even if you perform compaction, as it happens within each partition, the output file can still be a small file.
    Also while using the partition transforms like day(), no need to have multiple partitions like year/month/day because the day() itself is days from 1970–01–01.
  • Implement a logic to batch process the data by tuning the frequency of data ingestion.
  • Adjust the target file size of the underlying file format.

5. Not utilizing the full power of compaction.

Rewriting data files is a powerful way to optimize the table for queries by compacting them. Users need to be aware of all the options available during the rewrite, and some of them are discussed below.

  • If most of the queries are on hot data (certain partitions), we can compact only those partitions using filters.
  • While compacting the whole table, the default file group concurrency (5) will result in slow compaction, so we need to tune this property.
  • Partial commit is another option to commit compacted data via multiple commits (commits per file group).
  • Configuring Zorder or sort order is a great way to optimize the table for queries.
  • By default, delete files are not considered for compaction. Reducing the delete-file-threshold to 1 or a suitable number will consider the delete files for compaction and rewrite the new files by applying the delete filter.
  • rewrite_data_files will also write a new manifest file, so there is no need to run rewrite_manifests blindly before or after that.
  • rewrite-all option is useful if we need to force all the files to involve in rewrite irrespective of the file size. One use case is to rewrite all the files as per new schema.

More details: https://iceberg.apache.org/docs/latest/spark-procedures/#options

6. Not using the GC properly.

expire_snapshots and remove_orphan_files are used to garbage collect Iceberg data and metadata files. However, the order of execution can benefit in cleaning up the files quickly.

  • If you are compacting the data files, it makes sense to execute expire_snapshots after the compaction, not before, as it can cleanup the compacted data files. Executing it before will result in just cleaning up of expired metadata files.
  • Similarly, no need to run remove_orphan_files periodically if you are aware that there are no (or very few) orphan files present in the table location.
  • Utilize the snapshot tagging feature to retain only required snapshots.
  • Be aware of the default 5-day retention period. Without tuning the `older_than` property, users should not expect their fresh snapshots to be garbage collected.

7. Not tuning the metrics properties for table with wide columns

Iceberg by default collects the column stats (value_counts, null_value_counts, nan_value_counts, lower_bound, upper_bound) for the first 100 columns. Mostly the filter queries are always on fewer columns. So, make sure to write the column stats only for the columns on which filter will be applied using the below property.

write.metadata.metrics.column.col1 = "full"
write.metadata.metrics.column.col5 = "truncate(2)"

8. Errors during manual Installation.

Iceberg documentation clearly mentions that only the runtime-jars are enough when we copy it to the query engine path. But many users supplement individual jars along with runtime-jars and always end up with class loader issues.
Also, while using with spark, make sure the iceberg runtime-jars Scala version is matching with the Spark’s Scala version.

Do you use Apache Iceberg? Feel open to sharing your experiences, especially regarding any misuses of Apache Iceberg. I will incorporate them into this post to assist new users.

--

--