Who Needs Process?

Software development methodology is organizational Valtrex. Sure, it treats a symptom, but the only cure for the underlying disease is to never have contracted it in the first place. This is not to say that process and methodology are bad. They are means to an end. But the ability of your team to execute on a goal is inversely proportional to the amount of process you have in place. It's not a direct correlation, though. The underlying cause is that the variance of developer skill on your team is too high, which means your team can't execute well, and you need process to wrangle the laggards.

Software development processes exist to manage the bell curve of ability in developers. It's simple mathematics. The more people you collect on a team, the more likely it is that the team's average skill is the average skill of software developers as a whole. This is the Strong Law of Large Numbers, and it is non-negotiable. There's not a meeting you can hold to make it go away. Most organizations simply accept their fate, and design policies and procedures to keep the back half of the bell curve from causing damage. Unfortunately, policies and procedures irritate top performers, and are just more grievances on their list that will some day metastasize into a resignation letter. (Side note: if someone keeps such a list, there's a high probability they are a top performer.)

So, how do you produce good software?

Rule 1: Resist Process from the Start

Anti-process needs to be in your blood from the beginning. Growing the team causes problems, but adding people who like to make process is catastrophic. Busybodies are toxic. When we were making Milo, the development process was, for a long time, loosely coordinated chaos. Of course this caused problems, and there are a few bodies in the codebase to prove it. When there was some big fuckup, a bunch of us would sit in a room (mistake 1), try to "trace the root" of the problem, which is code for "place blame" (mistake 2), and then figure out what process we need to put in place to make sure it never happens again (mistake 3).

Every time we tried to force process on the team, it failed, because process was not part of the corporate culture. Here's a concrete example:

Around the time we were raising our Series A financing, I was doing some maintenance work on our one and only database machine, Zeus. Because I had overconfigured Nagios, it was going apeshit as I was working, so I disabled all alerts for that machine. Sure enough, I forgot to re-enable alerts when I was done, and sure enough, that night, Zeus's RAID controller decided he'd had enough of our bullshit and up and died.

The site was down for 6 hours and nobody noticed because we were all asleep. Do all the root cause analysis you want on that one, I fucked up, and everyone knew it. We tried policy: don't ever disable Nagios alerts, just tell people when you're working on something so they know to ignore the alerts. But nobody followed the policy because homie don't play dat.

It turned out that nobody ever made that mistake ever again. Maybe it was just by chance, but maybe it's because of the Old Testament type stomping that the next person to do that would get.

Learn from your mistakes, but don't flagellate yourself with them.

Rule 2: Grow Headcount as a Last Resort

Headcount is not a virtue. Functionality is a virtue. Ability to execute is a virtue. But having a lot of people working on the project? No. It's a liability.

The Strong Law of Large Numbers burns you when the team grows, so be reluctant to grow the team. When you think there is too much work for your current team to handle, do what any good programmer does: profile. Quantify the amount of time your team spends on every task, then figure out what you can optimize.

When you do hire, do it carefully. Really carefully. ... More careful than that. At Milo, we do "trial periods", where we invite a candidate to work with us for a few days so we can judge their work. Here's a really simple trial period task: make this thing better. Use your judgment and your programming skills, just make it better. This will keep a common vision in your team.

Rule 3: Use The GitHub Workflow (And Other Good Stuff)

For brevity:

versus:

Seriously, what the fuck is going on in that branch model? Policy and procedure, that's what. We used the latter branch model at Milo, as GitHub was not yet ready enough to support their awesome workflow. That complicated but popular workflow is vicious, there is just too much state to track in your head as a developer. It's not something you need to spend your time on.

It's not specific to the GitHub workflow: your tooling should have the same attitude toward policy that your team does. GitHub's (the software, not the company) opinion is that process is unnecessary, and having the tools to support process will only beget process.

If you leave enough handguns hanging around, eventually someone's going to get shot.

If a tool is forcing process on you, or even encourages you to follow some process, ditch the tool and find another. Ticket tracker has a lot of fields on the ticket form? Dump it. Code review tool has a lot of shit going on? Find a new one. You get the idea. Don't let your tools infect you.

Rule 4: Just Let Go

This is the hardest one, and, if you're serious about doing away with process, is the one from which all others follow. Just stop having process. Cancel your weekly planning meetings. Get rid of your prioritized list of tasks. Stop having your daily stand-ups. No more status report e-mails.

Just stop doing that stuff, and get rid of anyone who can't cope.

Nothing terrible will happen.

My current project, which will be launching in the first half of 2012, is as close to anarchy as a project can get. And it's working. I trust my teammates to get their work done and do it well. Sure enough, they do.

Dirty Words

Anyone who ever told you that swear words have no place in technical discussion is right. They're right, and sadly, they're part of the problem because they miss the point. The sterile word placement that's supposed to support an argument makes any true motivation indistinguishable from all the hired bullshit.

Objective technical discussion is a God damned lie, and it's the most rotten kind of lie because it's a way to stick your nose in the air, disguised as altruism. Every time you post benchmarks, you're not moving any discussion forward. When you compare web frameworks on one useless dimension or another, you don't bring any value to the world. What you've done is relieve me of a task that's kind of a pain in the ass, but by no means insurmountable. Ass scratching is not value. You want to be heralded as a great visionary for your work, and you think that getting on the front page of Hacker News or Reddit means that people respect your opinion.

No. Your link is just space between the ads, and fuck-all if mine aren't too.

The skill it takes to write objectively about technology can be automated, and publishing it yourself is disingenuous because it lacks passion.

However, when someone starts swearing in technical discussion, showing emotion, that's a strong indicator that I'm about to receive wisdom. Wisdom is earned the hard way, and it is permanent, not like some statistically shaky performance benchmark that we'll all forget about next week.

Anyone who has ever told you that swear words are a cheap way to get an audience is right, too. I've been on both the amateur and professional side of technical journalism, and I'll tell you this: every way to get an audience is cheap. Let's take Paul Graham, for example. Any emotion you detect in his essays is purely by accident, but he conveys a message, and has a following. He would not have that following if he were some guy off the street, so rattling off any damn thing and putting a Paul Graham byline on it is a cheap way to get an audience.

But admit it, somewhere in the back of your gut is a rebellious nerve that wonders what happens when Paul gets pissed off.

People like me, Zed Shaw, and Zach Holman will give you a brutally honest answer if you ask for it. People like Paul won't. You will get a response, but it's in newspaper words. The same newspaper words, that, by the way, with their self-imposed emotional blockade, allow the nicest haircut to slither into the White House every couple of years.

That's not to say that Paul is so proper in private. I don't know first hand, but I have met other leaders in technology who are more than willing to give you an earful over cocktails. In public, however, they cultivate a persona.

Just like I do.

Straight Talk on Event Loops

Two days ago, I pointed out how Node.js, an event-driven web framework, will eat it hard if it's given any nontrivial amount of CPU work to do in its request handler. After I published that, it seemed that the point of the article went sailing right past the Node.js camp, who proceeded to see how fast they can make a Fibonacci number generator.

The Fibonacci function was arbitrary. It was inefficient on purpose. I needed a function that would use CPU time, and chose that because it's familiar and easy to implement. So, now I offer a more formal analysis of what CPU usage does to the throughput of an event-driven system as compared to a threaded system.

Since it's now clear that reading comprehension and critical thinking are not strong suits of the Node.js programmer, I would suggest that all Noders reading this article read it aloud, slowly and loudly, like an American tourist trying to find a train station in Tokyo. Furthermore, to assist the Node camp, I will highlight the important parts in large lettering, like this:

When the weather is threatening rain, bring an umbrella with you.

A Math Model of Throughput

Assume we've got a request handler that processes an HTTP request and sends back the result. Let's see how threads and event loops differ on processing these requests. Note that we're measuring throughput here, not latency. That's an article for another day.

This is an analysis of Queries Per Second (QPS) only.

Let's start with some definitions:

  • Let C be the amount of CPU time used by the handler, in milliseconds.
  • Let I be the amount of I/O time used by the handler, in milliseconds.
  • Let W be the wall clock time it takes for a handler to execute. By definition, W = I + C
  • Let N be the number of threads running in the threaded system.
  • Let E be the throughput of the event driven system.
  • Let T be the throughput of the threaded system.

Given that the times are measured in milliseconds, we can define and .

Since the wall time W is expressable in terms of CPU time C and I/O time I, and considering that both C and I are positive, nonzero, it is helpful define , with the factor k expressing the relationship between C and I.

It follows then that and .

THEOREM 1. When a handler takes more CPU time than I/O time, an event-driven system has greater throughput than a threaded system if and only if the threaded system has exactly one thread.

PROOF (partial). (note: for brevity, I will only prove one direction. The other direction is an exercise left for the reader.)

Suppose

Simplifying the inequality,

Given that , we can bound the inner term

Further simplifying

Since N is integral and nonzero, it follows that .

If you do more CPU than I/O, use threads.

THEOREM 2.When the handler takes more I/O time than CPU time, an event-driven system has greater throughput than a threaded system if and only if .

PROOF (partial).(note: again for brevity, I will prove one direction).

Given our previous construction,

and the alternate expression

it follows that .

If you do more I/O than CPU, use more threads.

A Practical Example

Let's suppose you have a request handler that does 10 milliseconds of CPU work and 50 milliseconds of database I/O. Would you choose threads or events?

I this case, the theoretical maximum throughput of the event driven system is 1000/10 = 100 QPS, where as a threaded system with 50 threads has a theoretical maximum throughput of 50,000/60 = 833.33 QPS. Of course, in the threaded case, you need to worry about being bound by the CPU, but given the number of cores on modern hardware, threads seems like a winner here.

Multiple Event Workers

The Noders got really into this one: forking "workers" from your event loop to do the heavy CPU work, and having them call back to the event loop when they're done. One parent process coordinates work among many children? Where have I heard that before?

Anyhow, let's extend the model to that case. Just for funsies.

Since your asynchronous processes do not block on I/O, at full utilization, they will theoretically take 100% of the CPU. Therfore, the number of worker processes to spawn must be equal to the number of CPUs in the system to avoid oversubscribing the machine. Let's introduce a new variable, M, to represent the number of CPUs in the computer.

The throughput formula for the event driven system therefore becomes

Now, with threads, we also need to avoid oversubscribing the CPU. Considering that during a single handler execution, only C milliseconds of CPU are used, it follows that the number of threads that will achieve theoretical maximum utilization is .

Our formula for the threaded system's throughput is therefore

...but look at this:

At full utilization, threads and events have the same theoretical throughput.

This makes intuitive sense, as if the CPUs are working as hard as they can, all else equal, they should yield the same performance regardless of the framework used.

Hold up, this does not prove that Node is good.

Of course, in a practical setting, threads have a greater memory overhead, and programming an event loop with multiple workers just seems silly, as if you're doing that much CPU work in an event looped system, you've already fucked up somewhere, so why add to it?

Node.js Is Still Cancer

So, let's review.

Suppose you're a less-than-expert programmer, which Node seems to attract in droves for some reason. You are using Node for the supposed "scalability" of it, but as we have just seen, threaded programming, which is easier to understand than callback driven programming, meets or exceeds the asynchronous model in the vast majority of cases. Chances are, you're not going to be forking worker processes to do CPU jobs, what with the less-than-expert and all.

Therefore, the reason you're using Node is not a lack of technical ability, it's because all the cool kids are doing it.

Node.js is a danger to novice programmers.

Next, suppose you're an expert programmer, and you've got some CPU bound work that you fork off to child processes to keep your event loop trucking. OK man, how complicated do you want to make this thing? At full capacity, you're at par with threads, provided it's not memory bound. At this point, you are less focused on solving the problem at hand than you are on coming up with something you can blog about and get on programming Reddit.

If you're forking workers in Node, you've got bigger problems.

Plus, it's fucking JavaScript ... on the server.