dfdeshom/es-hadoop-settings.md

Paralellism in ES and Hadoop/Spark

1 shared corresponds to 1 Spark partition.

Reading from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-reading . Beware of increasing the number of shards on ES for performance reasons:

 A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop or Spark task.

Writing from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-writing . Write performance can be increased by having more partitions:

   elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distributes the writes between these. The more splits/partitions available, the more mappers/reducers can write data in parallel to Elasticsearch.

Note that nothing is said about ES handling these number of writes automatically. Production testing will determine what an ES cluster can handle in terms of write load.

ES serialization settings

Settings here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html#configuration-serialization

In particulaar, es.batch.write.retry.count says that if a bulk write is retried and fails, the whole Hadoop/Spark job fails. The default number of retries is 3. es.batch.size.entries and es.batch.size.bytes are 2 settings that can be increased to make writes more poerformant.

Runtime settings

Disable speculative execution, especially if we run into duplicate data:

  speculative execution is an optimization, enabled by default, that allows Hadoop to create duplicates tasks of those which it considers hanged or slowed down. When doing data crunching or reading resources, having duplicate tasks is harmless and means at most a waste of computation resources; however when writing data to an external store, this can cause data corruption through duplicates or unnecessary updates.

For Spark, disable it through spark.speculation=false. This is the default.