Skip to content

Instantly share code, notes, and snippets.

@jjasont
Last active December 22, 2023 16:22
Show Gist options
  • Select an option

  • Save jjasont/133328745851b572dda5f6a297e6b71f to your computer and use it in GitHub Desktop.

Select an option

Save jjasont/133328745851b572dda5f6a297e6b71f to your computer and use it in GitHub Desktop.
Spark Write Out Config

The latest and faster Spark write out can be done with S3A/magic committer

Ensure the usage of s3a prefix when reading and writing from/to S3 for better performance

spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3a.committer.magic.enabled: "true"
spark.hadoop.fs.s3a.committer.name: "magic"
spark.sql.sources.commitProtocolClass: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
spark.sql.parquet.output.committer.class: "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
# other alternative
spark.sql.parquet.output.committer.class: "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"

# this will replace an overlapping partition when there's conflict
spark.hadoop.fs.s3a.committer.staging.conflict-mode: "replace" 

Reference: link

Dynamic Partition Overwrite doesn't work with PathOutputCommitter

Add/change the Spark config for the following setting

spark.sql.sources.partitionOverwriteMode: "dynamic"
spark.sql.sources.commitProtocolClass: "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment