= Protocol Buffers as a format for FHIR data

A colleague of mine was attempting to improve throughput of an application that was being fed FHIR data, and noticed some problems.

> The system that was sending us data was sending bundles that were pretty printed JSON.
These pretty printed bundles were close to 2 MB in size.
Removing pretty printing reduced the bundles to about 1 MB.
Compressing these bundles reduced the size to 70 kB.

That was a sign that we don't have a lot of entropy in the data.
Of course, json will have field names repeating over and over, thus this is expected.
We discovered that our system performance was in some ways limited by the amount of IO involved in transmitting these bundles.

So I decided to investigate using a binary format.
There are quite a few popular ones out there - Protobuf, Thrift, Avro, MsgPack.

I started with Avro, however the existence of cycles in the structures made it unsuitable for FHIR data.
Next up, I tried Protobuf.
This worked out pretty well, and I think is good enough to enable a discussion.

== Benchmarks

For the impatient, here are the benchmark results.
I used 2 bundles - one about 2 kB in size with uncompressed pretty printed JSON.
The other about 20 kB similarly.

The source files should permit modification for one to test out for their own data.

=== File Sizes

The first table shows file sizes for different scenarios.

Input::
    Input file (generated)
P Json::
    Pretty printed JSON
P XML::
    Pretty printed XML
U Json::
    Non-Pretty (Ugly) printed JSON
U XML::
    Non-Pretty (Ugly) printed XML
Proto::
    Protobuf binary format

Each has in parens the size of the GZip compressed data.
This should reflect IO for a typical web server with GZip encoding supported.

=== Parse/Serialize Performance

The second table shows the performance for parsing and serializing different formats for each bundle type.

[source]
----
Environment
===========
* Groovy: 2.4.12
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.141-b15, Oracle Corporation)
    * JRE: 1.8.0_141
    * Total Memory: 440.5 MB
    * Maximum Memory: 3641 MB
* OS: Mac OS X (10.13.1, x86_64)

Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On

Filename             |    Input(  Zipped) |   P Json(  Zipped) |    P XML(  Zipped) |   U Json(  Zipped) |    U XML(  Zipped) |    Proto(  Zipped)
-------------------- + ------------------ + ------------------ + ------------------ + ------------------ + ------------------ + ------------------
bundle-2k.json       |     2028(     585) |     2028(     585) |     2622(     652) |     1027(     510) |     1717(     595) |      455(     382)
bundle-20k.json      |    20327(    1228) |    20327(    1228) |    27782(    1374) |    10869(    1055) |    17655(    1215) |     4331(     729)


                             user  system      cpu     real

Print Pretty JSON  - 2k     75787     520    76307    76722
Print Ugly JSON    - 2k     66706     316    67022    67444
Print Pretty XML   - 2k     78654     624    79278    80399
Print Ugly XML     - 2k     68887     504    69391    70010
Serialize Protobuf - 2k       985       2      987      989
Humanize Protobuf  - 2k     70677     905    71582    73008
Parse Pretty JSON  - 2k     80398     212    80610    80869
Parse Ugly JSON    - 2k     76539     451    76990    77377
Parse Pretty XML   - 2k    123847     597   124444   125119
Parse Ugly XML     - 2k    123952    1715   125667   127153
Parse Protobuf     - 2k      4441     117     4558     4709
Print Pretty JSON  - 20k   764057    2842   766899   769137
Print Ugly JSON    - 20k   680447    1574   682021   682836
Print Pretty XML   - 20k   805454    3954   809408   814134
Print Ugly XML     - 20k   653903    1352   655255   656354
Serialize Protobuf - 20k     9888       7     9895     9900
Humanize Protobuf  - 20k   822226    1850   824076   826057
Parse Pretty JSON  - 20k   763755    3275   767030   770772
Parse Ugly JSON    - 20k   775614    4480   780094   782850
Parse Pretty XML   - 20k  1025954    1102  1027056  1028101
Parse Ugly XML     - 20k   968764    2623   971387   972552
Parse Protobuf     - 20k    43236     717    43953    45119
----

For JSON and XML formats, I used HAPI-FHIR.

For Protobuf, I created a _quick-and-dirty_ library - https://github.com/rahulsom/fhir-protobuf[fhir-protobuf].
It uses the file `fhir.schema.json` from FHIR's download section.
Naturally, it doesn't do as much work as HAPI-FHIR does, and should not be an apples-to-apples comparison for performance.

However, given serialization is a couple orders of magnitude faster, and parsing is about twice as fast, that might be of interest to some FHIR users.
My main interest is the size of data.

An obvious downside is the difficulty in reading binary files without custom tooling.
With most of these libraries (Protobuf, Avro, Thrift, MsgPack), this custom tooling is about a dozen lines of code in your favorite scripting language.
That should not be a major deterrent.
Some of them even have 2 way conversion from JSON to binary.

The other major downside is the absence of validation.
I think neither JSON nor XML offer validation.
In case of HAPI FHIR, it has been built on top of the parser.
Something similar should be possible in this case as well.