bhenerey · May 23, 2012 19:40 · May 23, 2012 · May 23, 2012 · May 23, 2012
diff --git a/ideal ops.md b/ideal ops.md
@@ -73,8 +73,8 @@ In a perfect world, where things are done well, not just quickly, I would expect
 **Infrastructure as Code**
 
    * Infrastructure DB with API (Chef server)
-
    * All infra changes tracked, done via configuration management
+
 **Security**
 
 

diff --git a/ideal ops → ideal ops.md b/ideal ops → ideal ops.md
diff --git a/ideal ops b/ideal ops
@@ -0,0 +1,134 @@
+In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:
+
+**Documentation**
+
+   * Accurate / up-to-date systems architecture diagram
+   * Accurate / up-to-date network diagram
+
+   * Out-of-hours support plan
+
+   * Incident management plan
+
+   * Change management plan 
+   * Application documentation
+
+**Metric collection:**
+
+   * comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)
+
+   * application metrics instrumented in code (eg. queue length, time to post new job) [statsd]
+
+   * business metrics instrumented in code as well (eg. registrations) [statsd]
+   * include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)
+
+   * include storage (eg. netapp)
+
+   * include database
+
+   * include cron jobs
+
+   * include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)
+   * majority of monitoring from internal systems
+
+   * also monitor from external systems (e.g., Nimsoft/Watchmouse)
+
+   * retrieve external monitoring data into internal collection for correlation
+
+**Alert system:**
+
+   * alert off data collected (passive)
+
+   * alert on checks (active)
+
+   * call-out on important alerts
+
+   * email, irc/chat, sms, mobile escalation
+
+   * call-out rotation, escalation plans
+
+**Dashboards:**
+
+   * Real-time dashboards of all services
+   * Real-time dashboard of what is being viewed on the site, where traffic is coming from
+
+   * Dashboards to include event / deploy lines
+
+   * Anyone can create/share dashboards
+
+   * No passwords to access dashboards
+
+   * Key dashboards visible in the office on screen
+   * Dashboard of environments - what's deployed
+   * Cost dashboard (IaaS, SaaS)
+
+**Correlation / Investigation**
+
+
+   * Graphing system which allows ad-hoc metric correlation (eg. Graphite)
+
+   * Centralized logging with search (eg. Logstash, Greylog)
+   * Record of everything that has changed, by whom, when, and what the change was
+   * Access to all relevant systems 
+
+**Infrastructure as Code**
+
+   * Infrastructure DB with API (Chef server)
+
+   * All infra changes tracked, done via configuration management
+**Security**
+
+
+   * Automated view of what needs to be patched/updated
+   * Regular vulnerability scans with recorded history
+   * ssh-key as only authentication
+   * segregated environments (dev, test, prod)
+   * data anonymisation for performance testing
+
+**Performance testing**
+
+   * Prod-like environment to test in
+   * Good performance test, with assumptions and approximations documented
+   * Record of all previous test results
+   * Automated running of test
+   * Automated comparison of test results with previous tests
+
+**Communications**
+
+   * whole company using the same instant messaging / chat system
+   * task/kanbansystem for giving work to systems engineers / infrastructure developers
+   * ops twitter
+   * ops status (eg. etsystatus.com; stashboard; amazon status)
+
+**Deployment**
+
+   * single-click deploy
+   * rollback-able
+   * performed by developer
+   * dashboard/KPI used to validate release
+   * zero-down time
+   * dark-launches
+   * feature flags can be turned on/off via webui
+
+**Standards**
+
+
+   * Published standards of web systems requirements
+
+**Process**
+
+
+   * Light-weight post-mortem process, blame-free
+   * Daily operations review
+   * Monthly/quarterly architecture summit
+   * Daily stand-ups
+   * Iteration planning/review
+   * Regular capacity planning /cost optimization
+
+**Meta-metrics**
+
+   * MTTD
+   * MTTR
+   * Availability
+   * Service degradation (Slow versus broken; features disabled to protect site)
+   * CD Pipeline Availability
+   * Release tracking (type, success/failure, success rate, length of incident)
No results found