Ensure the usage of s3a prefix when reading and writing from/to S3 for better performance
- What's the difference between
s3,s3n, ands3a? - Spark - Committing work into cloud storage safely and fast
- Hadoop - The Magic Committer
spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3a.committer.magic.enabled: "true"
spark.hadoop.fs.s3a.committer.name: "magic"
spark.sql.sources.commitProtocolClass: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
spark.sql.parquet.output.committer.class: "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
# other alternative
spark.sql.parquet.output.committer.class: "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
# this will replace an overlapping partition when there's conflict
spark.hadoop.fs.s3a.committer.staging.conflict-mode: "replace"
Reference: link
- Cloudera - Why can't I partitionOverwrite in "dynamic" mode
- AWS - What commit protocol class should I use for "dynamic" partitionOverwrite?
Add/change the Spark config for the following setting
spark.sql.sources.partitionOverwriteMode: "dynamic"
spark.sql.sources.commitProtocolClass: "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol"