Skip to content

Instantly share code, notes, and snippets.

@fetep
Created August 3, 2012 02:31
Show Gist options
  • Select an option

  • Save fetep/3243702 to your computer and use it in GitHub Desktop.

Select an option

Save fetep/3243702 to your computer and use it in GitHub Desktop.

Revisions

  1. fetep revised this gist Aug 3, 2012. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -52,7 +52,7 @@ Each message should start with the protocol version, followed by a `;`, followed

    Version "1" update payload is the same format as the UDP protocol uses. We don't take multiple updates separated by `\n` (no need to with zeromq).

    I'm not entirely convinced about the protocol number, but it feels like it leaves the most flexibility for future improvements.
    I'm not entirely convinced about the protocol number, but it feels like it leaves the most flexibility for future improvements. I only think the protocol versioning should be on the zeromq input (to start); it would potentially not be backwards compatible with the current UDP clients.

    ### Sample zeromq messages (v1)

  2. fetep revised this gist Aug 3, 2012. 1 changed file with 11 additions and 5 deletions.
    16 changes: 11 additions & 5 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -48,14 +48,20 @@ Proposal two: add ZeroMQ input

    The zeromq input will listen on a `zmq.PULL` socket. This can be *in addition to* the UDP server, not a replacement. They both speak approximately the same protocol.

    Each message can contain as many metric updates as needed, separated by `\n`. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a `;` followed by the update payload. Protocol 1's update payload is exactly the same as the UDP protocol listed above (including the extended timer update format support). I'm not entirely convinced about the protocol number, but it feels like it leaves the most flexibility for future improvements.
    Each message should start with the protocol version, followed by a `;`, followed by the update payload.

    ### Sample zeromq message
    Version "1" update payload is the same format as the UDP protocol uses. We don't take multiple updates separated by `\n` (no need to with zeromq).

    I'm not entirely convinced about the protocol number, but it feels like it leaves the most flexibility for future improvements.

    ### Sample zeromq messages (v1)

    ```
    1;app.foo.hits|5|c
    ```

    ```
    1;app.foo.hits|5|c\n
    app.foo.response_time|14.5,10.4,8.9,11.3,12.3|t\n
    app.foo.error|2|c
    1;app.foo.response_time|14.5,10.4,8.9,11.3,12.3|t
    ```

    ### Implementation
  3. fetep revised this gist Aug 3, 2012. 1 changed file with 6 additions and 2 deletions.
    8 changes: 6 additions & 2 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    make statsd better for high throughput
    =======================================
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing two things:
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing two statsd changes:

    1. enhance the timer update format
    2. add a zeromq input (in addition to UDP)
    @@ -56,4 +56,8 @@ Each message can contain as many metric updates as needed, separated by `\n`. Th
    1;app.foo.hits|5|c\n
    app.foo.response_time|14.5,10.4,8.9,11.3,12.3|t\n
    app.foo.error|2|c
    ```
    ```

    ### Implementation

    I'm planning on implementing these in my ruby-statsd repo for testing.
  4. fetep revised this gist Aug 3, 2012. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -2,8 +2,8 @@ make statsd better for high throughput
    =======================================
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing two things:

    *# enhance the timer update format
    *# add a zeromq input (in addition to UDP)
    1. enhance the timer update format
    2. add a zeromq input (in addition to UDP)

    Original UDP protocol
    ---------------------
  5. fetep revised this gist Aug 3, 2012. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -2,8 +2,8 @@ make statsd better for high throughput
    =======================================
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing two things:

    # enhance the timer update format
    # add a zeromq input (in addition to UDP)
    *# enhance the timer update format
    *# add a zeromq input (in addition to UDP)

    Original UDP protocol
    ---------------------
  6. fetep revised this gist Aug 3, 2012. 1 changed file with 4 additions and 1 deletion.
    5 changes: 4 additions & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,9 @@
    make statsd better for high throughput
    =======================================
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing we add a zeromq input to statsd (in addition to the existing UDP input, and without breaking the UDP input). I'm also proposing a small change to the timer update format to make it easier for a client to batch statsd messages.
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing two things:

    # enhance the timer update format
    # add a zeromq input (in addition to UDP)

    Original UDP protocol
    ---------------------
  7. fetep revised this gist Aug 3, 2012. 1 changed file with 6 additions and 6 deletions.
    12 changes: 6 additions & 6 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -30,8 +30,8 @@ There's probably a better name than "timer" for this, because this is also how g
    * Some network stack tweaking is needed to up your UDP throughput rate if you're going to be doing a lot of updates.
    * Batching timer updates is hard. Batching counter updates is easy: a client can cache that it did 300 counter +1 updates for metric foo and every few seconds, flush the total. Timer updates can't really be aggregated at the statsd client level -- you want to pass the raw values so statsd's aggregations (min/max/upper90/mean) make more sense, and currently that means repeating the metric name once per timer update (and this takes up valuable space in the packet, and will basically cause you to send a lot more updates). Once you hit a certain point (hundreds of timers being updated thousands of times a second) you'll end up flushing UDP packets like mad with tons of repeated content in them.

    Extended timer update format
    ----------------------------
    Proposal 1: Extended timer update format
    -----------------------------------------

    To specifically address clients that might want to batch their statsd updates, allow specifying more than one value for a timer.

    @@ -40,12 +40,12 @@ To specifically address clients that might want to batch their statsd updates, a
    ** The key difference here is we allow multiple values to be passed in. They'll all be counted as samples for <metric name>.
    ** We accept `t` or `ms` for the timer type (`ms` for legacy reasons; it implies a lot to someone casually reading the code and isn't very descriptive, IMO -- `t` for timer makes more sense here).

    ZeroMQ protocol (version 1)
    ---------------------------
    Proposal two: add ZeroMQ input
    ------------------------------

    The zeromq server will listen on a `zmq.PULL` socket.
    The zeromq input will listen on a `zmq.PULL` socket. This can be *in addition to* the UDP server, not a replacement. They both speak approximately the same protocol.

    Each message can contain as many metric updates as needed, separated by `\n`. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a `;`. Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).
    Each message can contain as many metric updates as needed, separated by `\n`. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a `;` followed by the update payload. Protocol 1's update payload is exactly the same as the UDP protocol listed above (including the extended timer update format support). I'm not entirely convinced about the protocol number, but it feels like it leaves the most flexibility for future improvements.

    ### Sample zeromq message

  8. fetep revised this gist Aug 3, 2012. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -5,7 +5,7 @@ In the interest of shipping high volumes of updates in the shortest amount of ti
    Original UDP protocol
    ---------------------

    A series of stats updates separated by \n. Each update can be a timer or counter (originally; there are a few more extensions now, see https://github.com/b/statsd_spec).
    A series of stats updates separated by `\n`. Each update can be a timer or counter (originally; there are a few more extensions now, see https://github.com/b/statsd_spec).

    ### Counters

    @@ -45,7 +45,7 @@ ZeroMQ protocol (version 1)

    The zeromq server will listen on a `zmq.PULL` socket.

    Each message can contain as many metric updates as needed, separated by \n. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a ";". Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).
    Each message can contain as many metric updates as needed, separated by `\n`. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a `;`. Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).

    ### Sample zeromq message

  9. fetep revised this gist Aug 3, 2012. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -11,7 +11,8 @@ A series of stats updates separated by \n. Each update can be a timer or counte

    * `<metric name>:<value>|c`
    * `<metric name>:<value>|c|@<sample rate>`
    ** If you are only sampling 1/N events, include "N" here as the sample rate and we'll count it as N*`<value>` updates. It's probably easier to have your client do a little bit of batching and just flush one update every couple seconds per counter instead of this sampling logic.

    If you are only sampling 1/N events, include N here as the sample rate and we'll count it as a counter update of N*`<value>`. It's probably easier to have your client do the math itself or a little bit of batching and just flush one update every couple seconds per counter instead of this sampling madness.

    Counters will be turned into rates (every `$FLUSH_INTERVAL`, we take the total value for the counter and divide by `$FLUSH_INTERVAL` to calculate the rate).

  10. fetep revised this gist Aug 3, 2012. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -46,7 +46,7 @@ The zeromq server will listen on a `zmq.PULL` socket.

    Each message can contain as many metric updates as needed, separated by \n. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a ";". Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).

    === Sample zeromq message
    ### Sample zeromq message

    ```
    1;app.foo.hits|5|c\n
  11. fetep revised this gist Aug 3, 2012. 1 changed file with 12 additions and 16 deletions.
    28 changes: 12 additions & 16 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,39 +1,36 @@
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing we add a zeromq input to statsd (in addition to the existing UDP input, and without breaking the UDP input).

    A few of the statsd forks/implementations have added their own data types, and that's being collected/centralized here: https://github.com/b/statsd_spec
    make statsd better for high throughput
    =======================================
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing we add a zeromq input to statsd (in addition to the existing UDP input, and without breaking the UDP input). I'm also proposing a small change to the timer update format to make it easier for a client to batch statsd messages.

    Original UDP protocol
    ======================
    ---------------------

    A series of stats updates separated by \n. Each update can be a timer or counter (originally; there are a few more extensions now).
    A series of stats updates separated by \n. Each update can be a timer or counter (originally; there are a few more extensions now, see https://github.com/b/statsd_spec).

    Counters
    --------
    ### Counters

    * `<metric name>:<value>|c`
    * `<metric name>:<value>|c|@<sample rate>`
    ** If you are only sampling 1/N events, include "N" here as the sample rate and we'll count it as N*`<value>` updates. It's probably easier to have your client do a little bit of batching and just flush one update every couple seconds per counter instead of this sampling logic.

    Counters will be turned into rates (every `$FLUSH_INTERVAL`, we take the total value for the counter and divide by `$FLUSH_INTERVAL` to calculate the rate).

    Timers
    ------
    ### Timers

    * <metric name>:<value>|ms
    * `<metric name>:<value>|ms`

    Every `$FLUSH_INTERVAL`, all the values received since the last flush will be aggregated in a few ways: min, mean, upper_90 ("90" being configurable), max.

    There's probably a better name than "timer" for this, because this is also how gauges should be recorded. For example, if you wanted to periodically record through statsd how big a certain internal queue is, just send a timer update once a second with the queue size and you'll be able to graph your mean/upper_90/etc queue size.

    Limitations
    -----------
    ### Limitations

    * A single UDP packet can only be so big, so you have to be really careful with how many metric updates you pack in a single packet if you don't want to silently cause confusion (have your packet get split into two at a bad point). Lots of clients just send one update per packet.
    * Some network stack tweaking is needed to up your UDP throughput rate if you're going to be doing a lot of updates.
    * Batching timer updates is hard. Batching counter updates is easy: a client can cache that it did 300 counter +1 updates for metric foo and every few seconds, flush the total. Timer updates can't really be aggregated at the statsd client level -- you want to pass the raw values so statsd's aggregations (min/max/upper90/mean) make more sense, and currently that means repeating the metric name once per timer update (and this takes up valuable space in the packet, and will basically cause you to send a lot more updates). Once you hit a certain point (hundreds of timers being updated thousands of times a second) you'll end up flushing UDP packets like mad with tons of repeated content in them.

    Extended timer update format
    ============================
    ----------------------------

    To specifically address clients that might want to batch their statsd updates, allow specifying more than one value for a timer.

    @@ -43,14 +40,13 @@ To specifically address clients that might want to batch their statsd updates, a
    ** We accept `t` or `ms` for the timer type (`ms` for legacy reasons; it implies a lot to someone casually reading the code and isn't very descriptive, IMO -- `t` for timer makes more sense here).

    ZeroMQ protocol (version 1)
    ===========================
    ---------------------------

    The zeromq server will listen on a `zmq.PULL` socket.

    Each message can contain as many metric updates as needed, separated by \n. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a ";". Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).

    Sample zeromq message
    ---------------------
    === Sample zeromq message

    ```
    1;app.foo.hits|5|c\n
  12. fetep revised this gist Aug 3, 2012. 1 changed file with 27 additions and 20 deletions.
    47 changes: 27 additions & 20 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,49 +1,56 @@
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing we add a zeromq input (in addition to the existing UDP input, and without breaking the UDP input).
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing we add a zeromq input to statsd (in addition to the existing UDP input, and without breaking the UDP input).

    h2. Original UDP protocol
    A few of the statsd forks/implementations have added their own data types, and that's being collected/centralized here: https://github.com/b/statsd_spec

    Original UDP protocol
    ======================

    A series of stats updates separated by \n. Each update can be a timer or counter (originally; there are a few more extensions now).

    h3. Counters
    Counters
    --------

    * <metric name>:<value>|c
    * <metric name>:<value>|c|@<sample rate>
    ** If you are only sampling 1/N events, include "N" here as the sample rate and we'll count it as N*<value> updates. It's probably easier to have your client do a little bit of batching and just flush one update every couple seconds per counter instead of this sampling logic.
    * `<metric name>:<value>|c`
    * `<metric name>:<value>|c|@<sample rate>`
    ** If you are only sampling 1/N events, include "N" here as the sample rate and we'll count it as N*`<value>` updates. It's probably easier to have your client do a little bit of batching and just flush one update every couple seconds per counter instead of this sampling logic.

    Counters will be turned into rates (every $FLUSH_INTERVAL, we take the total value for the counter and divide by $FLUSH_INTERVAL to calculate the rate).
    Counters will be turned into rates (every `$FLUSH_INTERVAL`, we take the total value for the counter and divide by `$FLUSH_INTERVAL` to calculate the rate).

    h3. Timers
    Timers
    ------

    * <metric name>:<value>|ms

    Timer values will be aggregated into a few values: min, mean, upper_90 ("90" being configurable), max.
    Every `$FLUSH_INTERVAL`, all the values received since the last flush will be aggregated in a few ways: min, mean, upper_90 ("90" being configurable), max.

    There's probably a better name than "timer" for this, because this is also how gauges should be recorded. For example, if you wanted to periodically record to statsd how big a certain internal queue is, just send a timer update once a second, and you'll be able to graph your mean/upper_90/etc queue size.

    A few of the statsd forks/implementations have added their own data types, and that's being collected/centralized here: https://github.com/b/statsd_spec
    There's probably a better name than "timer" for this, because this is also how gauges should be recorded. For example, if you wanted to periodically record through statsd how big a certain internal queue is, just send a timer update once a second with the queue size and you'll be able to graph your mean/upper_90/etc queue size.

    Limitations:
    Limitations
    -----------

    * A single UDP packet can only be so big, so you have to be really careful with how many metric updates you pack in a single packet if you don't want to silently cause confusion (have your packet get split into two at a bad point). Lots of clients just send one update per packet.
    * Some network stack tweaking is needed to up your UDP throughput rate if you're going to be doing a lot of updates.
    * Batching timer updates is hard. Batching counter updates is easy: a client can cache that it did 300 counter +1 updates for metric foo and every few seconds, flush the total. Timer updates can't really be aggregated at the statsd client level -- you want to pass the raw values so statsd's aggregations (min/max/upper90/mean) make more sense, and currently that means repeating the metric name once per timer update (and this takes up valuable space in the packet, and will basically cause you to send a lot more updates). Once you hit a certain point (hundreds of timers being updated thousands of times a second) you'll end up flushing UDP packets like mad with tons of repeated content in them.

    h2. Extend timer update format
    Extended timer update format
    ============================

    To specifically address clients that might want to batch their statsd updates, allow specifying more than one value for a timer.

    * <metric name>|<value>[,<value2>,...,<valueN>]|t
    * <metric name>|<value>[,<value2>,...,<valueN>]|ms
    * `<metric name>|<value>[,<value2>,...,<valueN>]|t`
    * `<metric name>|<value>[,<value2>,...,<valueN>]|ms`
    ** The key difference here is we allow multiple values to be passed in. They'll all be counted as samples for <metric name>.
    ** We accept "t" or "ms" for the timer type ("ms" for legacy reasons; it implies a lot to someone casually reading the code and isn't very descriptive, IMO -- "t" for timer makes more sense here).
    ** We accept `t` or `ms` for the timer type (`ms` for legacy reasons; it implies a lot to someone casually reading the code and isn't very descriptive, IMO -- `t` for timer makes more sense here).

    h2. ZeroMQ protocol (version 1)
    ZeroMQ protocol (version 1)
    ===========================

    The zeromq server will listen on a zmq.PULL socket.
    The zeromq server will listen on a `zmq.PULL` socket.

    Each message can contain as many metric updates as needed, separated by \n. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a ";". Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).

    Sample zeromq message:
    Sample zeromq message
    ---------------------

    ```
    1;app.foo.hits|5|c\n
  13. fetep created this gist Aug 3, 2012.
    52 changes: 52 additions & 0 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,52 @@
    In the interest of shipping high volumes of updates in the shortest amount of time and work, I'm proposing we add a zeromq input (in addition to the existing UDP input, and without breaking the UDP input).

    h2. Original UDP protocol

    A series of stats updates separated by \n. Each update can be a timer or counter (originally; there are a few more extensions now).

    h3. Counters

    * <metric name>:<value>|c
    * <metric name>:<value>|c|@<sample rate>
    ** If you are only sampling 1/N events, include "N" here as the sample rate and we'll count it as N*<value> updates. It's probably easier to have your client do a little bit of batching and just flush one update every couple seconds per counter instead of this sampling logic.

    Counters will be turned into rates (every $FLUSH_INTERVAL, we take the total value for the counter and divide by $FLUSH_INTERVAL to calculate the rate).

    h3. Timers

    * <metric name>:<value>|ms

    Timer values will be aggregated into a few values: min, mean, upper_90 ("90" being configurable), max.

    There's probably a better name than "timer" for this, because this is also how gauges should be recorded. For example, if you wanted to periodically record to statsd how big a certain internal queue is, just send a timer update once a second, and you'll be able to graph your mean/upper_90/etc queue size.

    A few of the statsd forks/implementations have added their own data types, and that's being collected/centralized here: https://github.com/b/statsd_spec

    Limitations:

    * A single UDP packet can only be so big, so you have to be really careful with how many metric updates you pack in a single packet if you don't want to silently cause confusion (have your packet get split into two at a bad point). Lots of clients just send one update per packet.
    * Some network stack tweaking is needed to up your UDP throughput rate if you're going to be doing a lot of updates.
    * Batching timer updates is hard. Batching counter updates is easy: a client can cache that it did 300 counter +1 updates for metric foo and every few seconds, flush the total. Timer updates can't really be aggregated at the statsd client level -- you want to pass the raw values so statsd's aggregations (min/max/upper90/mean) make more sense, and currently that means repeating the metric name once per timer update (and this takes up valuable space in the packet, and will basically cause you to send a lot more updates). Once you hit a certain point (hundreds of timers being updated thousands of times a second) you'll end up flushing UDP packets like mad with tons of repeated content in them.

    h2. Extend timer update format

    To specifically address clients that might want to batch their statsd updates, allow specifying more than one value for a timer.

    * <metric name>|<value>[,<value2>,...,<valueN>]|t
    * <metric name>|<value>[,<value2>,...,<valueN>]|ms
    ** The key difference here is we allow multiple values to be passed in. They'll all be counted as samples for <metric name>.
    ** We accept "t" or "ms" for the timer type ("ms" for legacy reasons; it implies a lot to someone casually reading the code and isn't very descriptive, IMO -- "t" for timer makes more sense here).

    h2. ZeroMQ protocol (version 1)

    The zeromq server will listen on a zmq.PULL socket.

    Each message can contain as many metric updates as needed, separated by \n. The text protocol is very similar to the UDP protocol - the only difference is we add a protocol version up front followed by a ";". Protocol 1 is exactly the same as the UDP protocol listed above (with the extended timer update format support).

    Sample zeromq message:

    ```
    1;app.foo.hits|5|c\n
    app.foo.response_time|14.5,10.4,8.9,11.3,12.3|t\n
    app.foo.error|2|c
    ```