# From SocialCalc to EtherCalc
Previously, in [The Architecture of Open Source Applications](http://www.aosabook.org/en/socialcalc.html), I described SocialCalc, an in-browser spreadsheet system that replaced the server-centric WikiCalc architecture. SocialCalc performs all of its computations in the browser; it uses the server only for loading and saving spreadsheets.
For the Socialtext team, performance was the primary goal behind SocialCalc's design in 2006. The key observation was this: Client-side computation in JavaScript, while an order of magnitude slower than server-side computation in Perl, was still much faster than the network latency incurred during AJAX roundtrips:
******
******
Toward the end of the AOSA chapter, we introduced simultaneous collaboration on spreadsheets, using a simple, chatroom-like architecture:
******
******
However, as we began to test it for production deployment, the performance and scalability characteristics fell short of practical use, motivating a series of system-wide rewrites in order to reach acceptable performance.
This chapter will discuss the revamped architecture we made for [EtherCalc](http://ethercalc.net/), a successor to SocialCalc optimized toward simultaneous editing. We'll see how we arrived at that architecture, how we used profiling tools, and how we made new tools to overcome performance problems.
# Design Constraints
The Socialtext platform has both behind-the-firewall and in-the-cloud deployment options, imposing unique constraints on EtherCalc's resource and performance requirements. At the time of this writing, Socialtext requires 2 CPU cores and 4GB RAM for vSphere-based intranet hosting; a typical dedicated EC2 instance provides about twice that capacity, with 4 cores and 7.5GB of RAM.
Deploying on intranets means that we can't simply throw hardware at the problem in the same way multi-tenant, hosted-only systems did (e.g., DocVerse, which later became part of Google Docs); we can assume only a modest amount of server capacity.
Compared to intranet deployments, cloud-hosted instances offer better capacity and on-demand extension, but network connections from browsers are usually slower and fraught with frequent disconnections and reconnections.
To recap, constraints on these resources shaped EtherCalc's architecture directions:
* **Memory**: An event-based server allows us to scale to thousands of concurrent connections with a small amount of RAM.
* **CPU**: Based on SocialCalc's original design, we offload most computations and all content rendering to client-side JavaScript.
* **Network**: By sending spreadsheet operations, instead of spreadsheet content, we reduce bandwidth use and allow recovering over unreliable network connections.
# Initial Prototype
We started with a WebSocket server implemented in Perl 5, backed by [Feersum](https://metacpan.org/release/Feersum), a [libev](http://software.schmorp.de/pkg/libev.html)-based non-blocking web server developed at Socialtext. Feersum is very fast, capable of handling over 10k requests per second on a single CPU.
On top of Feersum, we use the [PocketIO](https://metacpan.org/release/PocketIO) middleware to leverage the popular Socket.io JavaScript client, which provides backward compatibility for legacy browsers without WebSocket support.
The initial prototype closely resembles a chat server. Each collaborative session is a chatroom; clients sends their locally executed commands and cursor movements to the server, which relays them to all other clients in the same room.
A typical flow of operation looks like this:
******
******
Each command is logged on the server with a timestamp. If a client drops and reconnects, it can resume by asking for a log of all requests since it was disconnected, then replay those commands locally to get to the same state as its peers.
As we described in the AOSA chapter, this simple design minimized server-side CPU and RAM requirements, and demonstrates reasonable resiliency against network failure.
# First Bottleneck
When the prototype was put to field testing in June 2011, we quickly discovered a performance problem with long-running sessions.
Spreadsheets are long-lived documents, and a collaborative session can accumulate thousands of modifications over weeks of editing.
Under the naive backlog model, when a client joins such an edit session, it must replay thousands of commands, incurring a significant startup delay before it can make any modifications.
To mitigate this issue, we implemented a snapshot mechanism. After every 100 commands sent to a room, the server will poll the states from each active client, and save the latest snapshot it receives next to the backlog. A freshly joined client receives the snapshot along with new commands entered after the snapshot was taken, so it only needs to replay 99 commands at most.
******
******
This workaround solved the CPU issue for new clients, but created a network performance problem of its own, as it taxes each client's upload bandwidth. Over a slow connection, this delays the reception of subsequent commands from the client.
Moreover, the server has no way to validate the consistency of snapshots submitted by clients. Therefore, an erroneous -- or malicious -- snapshot can corrupt the state for all newcomers, placing them out of sync with existing peers.
An astute reader may note that both problems are caused by the server's inability to execute spreadsheet commands. If the server can update its own state as it receives each command, it would not need to maintain a command backlog at all.
The in-browser SocialCalc engine is written in JavaScript. We considered translating that logic into Perl, but that carries the steep cost of maintaining two codebases going forward. We also experimented with embedded JS engines ([V8](https://metacpan.org/release/JavaScript-V8), [SpiderMonkey](https://metacpan.org/release/JavaScript-SpiderMonkey)), but they imposed their own performance penalties when running inside Feersum's event loop.
Finally, by August 2011, we resolved to rewrite the server in Node.js.
# Porting to Node.js
The initial rewrite went smoothly, because both Feersum and Node.js are based on the same libev event model, and Pocket.io's API matches Socket.io closely.
It took us only an afternoon to code up a functionally equivalent server in just 80 lines of code, thanks to the concise API offered by [ZappaJS](http://zappajs.org/).
Initial [micro-benchmarking](http://c9s.github.com/BenchmarkTest/) showed that porting to Node.js costed us about one half of maximum throughput: On a typical Core i5 CPU in 2011, the original Feersum+Tatsumaki stack handles 5k request per second, while Node.js+Express maxes out at 2.8k requests.
This performance degradation was acceptable for our first JavaScript port, as it wouldn't significantly increase latency for users, and we expect that it will improve over time.
Subsequently, we continued the work to reduce client-side CPU use and minimize bandwidth use by tracking each session's ongoing state with server-side SocialCalc spreadsheets:
******
******
# Server-side SocialCalc
The key enabling technology for our work is [jsdom](https://github.com/tmpvar/jsdom), a full implementation of the W3C document object model, which enables Node.js to load client-side JavaScript libraries within a simulated browser environment.
Using jsdom, it's trivial to create any number of server-side SocialCalc spreadsheets, each taking about 30kb of RAM, running in its own sandbox:
require! <[ vm jsdom ]>
create-spreadsheet = ->
document = jsdom.jsdom \