jjasont/spark_write_config.md

The latest and faster Spark write out can be done with S3A/magic committer

Ensure the usage of s3a prefix when reading and writing from/to S3 for better performance

spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3a.committer.magic.enabled: "true"
spark.hadoop.fs.s3a.committer.name: "magic"
spark.sql.sources.commitProtocolClass: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
spark.sql.parquet.output.committer.class: "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
# other alternative
spark.sql.parquet.output.committer.class: "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"

# this will replace an overlapping partition when there's conflict
spark.hadoop.fs.s3a.committer.staging.conflict-mode: "replace"

Reference: link

Dynamic Partition Overwrite doesn't work with PathOutputCommitter

Add/change the Spark config for the following setting

spark.sql.sources.partitionOverwriteMode: "dynamic"
spark.sql.sources.commitProtocolClass: "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol"

jjasont/spark_write_config.md

Select an option

No results found

Select an option

No results found

The latest and faster Spark write out can be done with S3A/magic committer

Dynamic Partition Overwrite doesn't work with PathOutputCommitter