Skip to content

Instantly share code, notes, and snippets.

@bhenerey
Created May 23, 2012 19:40
Show Gist options
  • Select an option

  • Save bhenerey/2777306 to your computer and use it in GitHub Desktop.

Select an option

Save bhenerey/2777306 to your computer and use it in GitHub Desktop.

Revisions

  1. bhenerey revised this gist May 23, 2012. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ideal ops.md
    Original file line number Diff line number Diff line change
    @@ -73,8 +73,8 @@ In a perfect world, where things are done well, not just quickly, I would expect
    **Infrastructure as Code**

    * Infrastructure DB with API (Chef server)

    * All infra changes tracked, done via configuration management

    **Security**


  2. bhenerey renamed this gist May 23, 2012. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  3. bhenerey created this gist May 23, 2012.
    134 changes: 134 additions & 0 deletions ideal ops
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,134 @@
    In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

    **Documentation**

    * Accurate / up-to-date systems architecture diagram
    * Accurate / up-to-date network diagram

    * Out-of-hours support plan

    * Incident management plan

    * Change management plan
    * Application documentation

    **Metric collection:**

    * comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)

    * application metrics instrumented in code (eg. queue length, time to post new job) [statsd]

    * business metrics instrumented in code as well (eg. registrations) [statsd]
    * include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)

    * include storage (eg. netapp)

    * include database

    * include cron jobs

    * include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)
    * majority of monitoring from internal systems

    * also monitor from external systems (e.g., Nimsoft/Watchmouse)

    * retrieve external monitoring data into internal collection for correlation

    **Alert system:**

    * alert off data collected (passive)

    * alert on checks (active)

    * call-out on important alerts

    * email, irc/chat, sms, mobile escalation

    * call-out rotation, escalation plans

    **Dashboards:**

    * Real-time dashboards of all services
    * Real-time dashboard of what is being viewed on the site, where traffic is coming from

    * Dashboards to include event / deploy lines

    * Anyone can create/share dashboards

    * No passwords to access dashboards

    * Key dashboards visible in the office on screen
    * Dashboard of environments - what's deployed
    * Cost dashboard (IaaS, SaaS)

    **Correlation / Investigation**


    * Graphing system which allows ad-hoc metric correlation (eg. Graphite)

    * Centralized logging with search (eg. Logstash, Greylog)
    * Record of everything that has changed, by whom, when, and what the change was
    * Access to all relevant systems

    **Infrastructure as Code**

    * Infrastructure DB with API (Chef server)

    * All infra changes tracked, done via configuration management
    **Security**


    * Automated view of what needs to be patched/updated
    * Regular vulnerability scans with recorded history
    * ssh-key as only authentication
    * segregated environments (dev, test, prod)
    * data anonymisation for performance testing

    **Performance testing**

    * Prod-like environment to test in
    * Good performance test, with assumptions and approximations documented
    * Record of all previous test results
    * Automated running of test
    * Automated comparison of test results with previous tests

    **Communications**

    * whole company using the same instant messaging / chat system
    * task/kanbansystem for giving work to systems engineers / infrastructure developers
    * ops twitter
    * ops status (eg. etsystatus.com; stashboard; amazon status)

    **Deployment**

    * single-click deploy
    * rollback-able
    * performed by developer
    * dashboard/KPI used to validate release
    * zero-down time
    * dark-launches
    * feature flags can be turned on/off via webui

    **Standards**


    * Published standards of web systems requirements

    **Process**


    * Light-weight post-mortem process, blame-free
    * Daily operations review
    * Monthly/quarterly architecture summit
    * Daily stand-ups
    * Iteration planning/review
    * Regular capacity planning /cost optimization

    **Meta-metrics**

    * MTTD
    * MTTR
    * Availability
    * Service degradation (Slow versus broken; features disabled to protect site)
    * CD Pipeline Availability
    * Release tracking (type, success/failure, success rate, length of incident)