April 2008 Archives

Eclipse Crashes in Ubuntu Hardy Heron

| | Comments (0)
I just upgraded my workstation to Hardy and got a SIGSEGV when starting Eclipse.  It appears to be a bug in the Sun JVM that ships with Hardy, and it only happens on AMD64.

If you're set on using this runtime, the fix is to disable the JIT compiler by launching Eclipse with -Xint, but that's comes with a severe performance penalty.

The fix I used was to simply downgrade the Hardy JRE (6-06-0ubuntu1) to the Gutsy version (6-03-0ubuntu2).  You'll have to edit /etc/apt/sources.list to add a Gutsy repository.

I'm pretty sure this is the Sun bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6614100

And this is the Ubuntu bug: https://bugs.launchpad.net/ubuntu/+source/eclipse/+bug/174759

I'm Going To Scale My Foot Up Your Ass

|
1205210029413.jpg
Engineers love to talk about scalability.  It makes us feel like the bad ass, dick-swingin' motherfuckers that we wish we could be.

After we talk about scalability with our co-workers (Yeah, Rails doesn't scale!), we flex our true engineering prowess by writing a post about it on our blog.  Once that post hits Reddit, son, everyone will know how hardcore you really are.  Respect.

People Who Talk Big About Scalability Don't Need To Worry About It

Fact:  every chest-thumping blog post I have seen written about scalability is either about architecture, Memcached, or both.  Some asshole who writes shitty code starts pontificating about "scalable architecture" with data storage, web frontends, whatever-the-fuck.  Dude, your app isn't having scalability problems because of the architecture.  It's having scalability problems because you coded a ton of N^2 loops into it and you're too self-important to get peer reviews on your commits.

And let's not forget the tools who discover Memcached for the first time, install it on a web server, and notice how fast their app runs now.  Yeah, welcome to the modern age.  Hope you know what a cache expiry policy is.

If You Haven't Discussed Capacity Planning, You Can't Discuss Scalability

You don't need to worry about scalability on your Rails-over-Mysql application because nobody is going to use it.  Really.  Believe me.  You're going to get, at most, 1,000 people on your app, and maybe 1% of them will be 7-day active.  Scalability is not your problem, getting people to give a shit is.

Unless you know what you need to scale to, you can't even begin to talk about scalability.  How many users do you want your system to handle? A thousand?  Hundred thousand? Ten million?  Here's a hint: the system you design to handle a quarter million users is going to be different from the system you design to handle ten million users.

Of course you'll point to the engineer's wet dream: linear scalability.  Lulz but when we get more users we just add more machines you are so stupid ted. uncov sucks.

Yeah, great, well it doesn't exist.  Oh no, go ahead and try out Amazon SimpleDB and think to yourself that it will scale linearly.  Then, when you get enough users that the latency becomes a problem, blame it on "those shitty Amazon datacenters".

Choosing Technology Don't Mean Shit If You Don't Know How To Use It

The most common butthurt about scalability is this:  choose a technology.  If you like the technology, claim "technology X scales better!" If you don't like it, claim "technology X doesn't scale!"

Saying "Rails doesn't scale" is like saying "my car doesn't go infinitely fast".  Alternatively, saying "We'll have no problems scaling because we're using Django" is like saying "I will win every race because my car is the most powerful".  Maybe so, but you suck at driving, and you're up against professionals.

If you're having scalability problems and blaming it on a single technology, chances are, you're doing it wrong.

tl;dr

Shut up about scalability, no one is using your app anyway.
Not if you can avoid it, at least.

Hadoop provides you with the Writable interface if you want to write your object to a SequenceFile.  It's up to you to implement the write() and readFields() methods for your object.  It's easy if your object is simple: just write each of your instance variables to a DataOutput and read them back in the same order from a DataInput.

Don't Write Your Object As A Serialized Byte Array

I got lazy when I was implementing the Writable interface with one of our classes because it had a ton of instance variables.  I figured I'd just serialize it to a byte array, then write the array length and the whole array to the DataOutput.  And on the read, well, just unserialize the object from the byte array.   This was my write():

@Override
public void write(DataOutput out) throws IOException {
	ByteArrayOutputStream byteOutStream = new ByteArrayOutputStream();
	ObjectOutputStream objectOut = new ObjectOutputStream(byteOutStream);
		
	objectOut.writeObject(getContainedObject());
	objectOut.close();
		
	byte[] serializedObject= byteOutStream.toByteArray();
		
	out.writeInt(serializedObject.length);
	out.write(serializedModel);

}
Naw, dude. Bad idea.

I knew that I'd be paying some overhead in both space and time for this little scheme, but I didn't know how much.  It was just a little bit per object, but when we started seeing MapReductions take way too much time in I/O, it was time to revisit this.

What This Cost In Space And Time

First, the Java serialization space overhead.  On a toy example of this object, serialization to a byte array used 953 bytes.  Properly writing out the instance variables consumed 296 bytes.  In production, doing it the right way shrunk a 1,600-record SequenceFile from 1.4GB to 825MB.

Time savings were great, too.  In the same toy example, it took my JVM 7.2 milliseconds to serialize the object and 1.7 milliseconds to unserialize.  Doing with with stream I/O only took 76,000 nanoseconds to serialize, 58,000 nanoseconds to unserialize.

I love order-of-magnitude improvements.

Lesson learned: get off your lazy ass and do it right.