= Protocol Buffers as a format for FHIR data A colleague of mine was attempting to improve throughput of an application that was being fed FHIR data, and noticed some problems. > The system that was sending us data was sending bundles that were pretty printed JSON. These pretty printed bundles were close to 2 MB in size. Removing pretty printing reduced the bundles to about 1 MB. Compressing these bundles reduced the size to 70 kB. That was a sign that we don't have a lot of entropy in the data. Of course, json will have field names repeating over and over, thus this is expected. We discovered that our system performance was in some ways limited by the amount of IO involved in transmitting these bundles. So I decided to investigate using a binary format. There are quite a few popular ones out there - Protobuf, Thrift, Avro, MsgPack. I started with Avro, however the existence of cycles in the structures made it unsuitable for FHIR data. Next up, I tried Protobuf. This worked out pretty well, and I think is good enough to enable a discussion. == Benchmarks For the impatient, here are the benchmark results. I used 2 bundles - one about 2 kB in size with uncompressed pretty printed JSON. The other about 20 kB similarly. The source files should permit modification for one to test out for their own data. === File Sizes The first table shows file sizes for different scenarios. Input:: Input file (generated) P Json:: Pretty printed JSON P XML:: Pretty printed XML U Json:: Non-Pretty (Ugly) printed JSON U XML:: Non-Pretty (Ugly) printed XML Proto:: Protobuf binary format Each has in parens the size of the GZip compressed data. This should reflect IO for a typical web server with GZip encoding supported. === Parse/Serialize Performance The second table shows the performance for parsing and serializing different formats for each bundle type. [source] ---- Environment =========== * Groovy: 2.4.12 * JVM: Java HotSpot(TM) 64-Bit Server VM (25.141-b15, Oracle Corporation) * JRE: 1.8.0_141 * Total Memory: 440.5 MB * Maximum Memory: 3641 MB * OS: Mac OS X (10.13.1, x86_64) Options ======= * Warm Up: Auto (- 60 sec) * CPU Time Measurement: On Filename | Input( Zipped) | P Json( Zipped) | P XML( Zipped) | U Json( Zipped) | U XML( Zipped) | Proto( Zipped) -------------------- + ------------------ + ------------------ + ------------------ + ------------------ + ------------------ + ------------------ bundle-2k.json | 2028( 585) | 2028( 585) | 2622( 652) | 1027( 510) | 1717( 595) | 455( 382) bundle-20k.json | 20327( 1228) | 20327( 1228) | 27782( 1374) | 10869( 1055) | 17655( 1215) | 4331( 729) user system cpu real Print Pretty JSON - 2k 75787 520 76307 76722 Print Ugly JSON - 2k 66706 316 67022 67444 Print Pretty XML - 2k 78654 624 79278 80399 Print Ugly XML - 2k 68887 504 69391 70010 Serialize Protobuf - 2k 985 2 987 989 Humanize Protobuf - 2k 70677 905 71582 73008 Parse Pretty JSON - 2k 80398 212 80610 80869 Parse Ugly JSON - 2k 76539 451 76990 77377 Parse Pretty XML - 2k 123847 597 124444 125119 Parse Ugly XML - 2k 123952 1715 125667 127153 Parse Protobuf - 2k 4441 117 4558 4709 Print Pretty JSON - 20k 764057 2842 766899 769137 Print Ugly JSON - 20k 680447 1574 682021 682836 Print Pretty XML - 20k 805454 3954 809408 814134 Print Ugly XML - 20k 653903 1352 655255 656354 Serialize Protobuf - 20k 9888 7 9895 9900 Humanize Protobuf - 20k 822226 1850 824076 826057 Parse Pretty JSON - 20k 763755 3275 767030 770772 Parse Ugly JSON - 20k 775614 4480 780094 782850 Parse Pretty XML - 20k 1025954 1102 1027056 1028101 Parse Ugly XML - 20k 968764 2623 971387 972552 Parse Protobuf - 20k 43236 717 43953 45119 ---- For JSON and XML formats, I used HAPI-FHIR. For Protobuf, I created a _quick-and-dirty_ library - https://github.com/rahulsom/fhir-protobuf[fhir-protobuf]. It uses the file `fhir.schema.json` from FHIR's download section. Naturally, it doesn't do as much work as HAPI-FHIR does, and should not be an apples-to-apples comparison for performance. However, given serialization is a couple orders of magnitude faster, and parsing is about twice as fast, that might be of interest to some FHIR users. My main interest is the size of data. An obvious downside is the difficulty in reading binary files without custom tooling. With most of these libraries (Protobuf, Avro, Thrift, MsgPack), this custom tooling is about a dozen lines of code in your favorite scripting language. That should not be a major deterrent. Some of them even have 2 way conversion from JSON to binary. The other major downside is the absence of validation. I think neither JSON nor XML offer validation. In case of HAPI FHIR, it has been built on top of the parser. Something similar should be possible in this case as well.