July 2008 Archives

Transactional Memory

|
The next article in my column at El Reg is up.  It's about transactional memory in Sun's upcoming Rock CPU.

You can read my article and follow transactional memory on pressflip.

Cue the buttsore Brits in the comments whining about my language.

Looking For Bloggers

|
Do you blog regularly about politics, current events, or science?  If you do, I'd like to talk with you.

E-mail me: ted@pressflip.com

Porter Stemming Makes Me Rage

|
not_illegal_in_thailand.jpg
I have no formal training in natural language processing.  As such, I figure out a lot of this shit on my own.

One of the simplest concepts in NLP/text mining is stemming.  If you're not in the know, to stem a word is to remove all the unnecessary shit after its root.

For example, "computer", "computing" and "compute" all stem to "comput".  Same root, virtually the same meaning.

Something like this is clearly useful in a search engine like Pressflip, because if somebody searches for "iphone" (and a lot of you people are), the engine should pull up documents that contain the plural (iphones) of the word.

The canonical algorithm for doing this sort of thing is called the Porter Stemming Algorithm, which considers each word on its own.  Porter works great 99% of the time, but when it fails, it fucks you hard.

Why You Keep Tryin To Say That Word?

A good example of this comes from the pressflip query logs.  A user searched for "marketing".  Perfectly reasonable.  Porter stemmed that to "market", which returned a bunch of search results about the Dow Jones and Nasdaq.  Ouch. Right in the butt.

What went wrong?  In smart-talk, the bare infinitive that corresponds to the gerund has a different meaning than the gerund.  Again, I know dick-shit about NLP, so maybe you guys have a serious-business name for this sort of thing.

So yeah, gerunds make Porter suck sometimes.

There are some other failure cases I've discovered.  Proper nouns will give it to you Clydesdale-style, too.  More specifically, proper nouns that don't stem to themselves.  Example: "Mariners" and "Marin" both share the same stem.  So potentially, someone searching for the baseball team from Seattle will come up with news about the hoity-toity town across the Golden Gate Bridge from San Francisco.

What's the answer to this?  If you're a company with millions in VC lottery winnings, you can pay Basistech $100,000 for a 3-year license of their context sensitive stemmer.  If you're me, though, you make exclusion lists.  Big ones.

That being said, after a large re-processing this weekend, Pressflip search quality is going to improve.

The First 24 Hours

|
ted_dziuba_matt_kent_kyle_shank.jpg
Pressflip launched last Thursday, and there has been a flurry of traffic so far.  Myself, Kyle, and I celebrated a bit the night of the launch (pictured above).  Within the first day we had more than one thousand registered users, and are still growing.

There was some good fail to fix, a few bugs here and there, but nothing catastrophic. 

I found out today that TechCrunch's Michael Arrington doesn't like it, but that comes as no surprise - there are no shiny things on the site.

Anyway, there's a bit of a back story to this that not a lot of people know.  When I wrote Uncov, I pointed out the widespread incompetence that had metastasized on the internet.  I got pretty popular, to the point where Uncov could have been a self sustaining business.  Now, it's a well known fact that I am the greatest living technical writer, and carrying this burden is no easy task.  After I stopped writing Uncov, Arrington tried to hire me to write for TechCrunch.  I refused, for obvious reasons.

I refused to kneel before Arrington, and this is my consequence.  Not that it matters, though.  Our usage statistics and server logs suggest that people like Pressflip.  I've seen some very interesting queries go through the system, and people are coming back for news on a wide variety of topics.

That reminds me, I need to write a log processor.  Does anyone know of anything better that AWStats?  Don't say Google Analytics.  There's no Javascript on our site and I'm kind of proud of that.  E-mail me: tjdziuba@gmail.com.

Persai Is Now Pressflip, And It's Ready

|
pf_web_logo_front.pngIt's taken more than a year, two rounds of angel financing, and a whole lot of butt pain, but the startup I've been working on is finally ready to go.

This is a maliciously unofficial announcement.

What Does It Do?

You would like to know that, wouldn't you?  Well let me tell you, it's the greatest web service product that has ever been created.  This product is so awesome, that simply reading this announcement will make you several times cooler.

Pressflip is a persistent search service, but it's not like the others.  It doesn't think it knows better than you.  Previously, search & learning systems have been black boxes.  You go to the site, click a few stories from unrelated topics, and it will show you content that it thinks you'll like, most of the time based on the click patterns of other users, the theory being that if another user reads the same programming articles I do, then he likely has the same political opinions, taste in humor, and favorite sports teams as me.  Right.  Makes perfect sense.

For what it's worth, Digg's up-and-coming recommendation system solves this problem by correlating users within a defined set of topics, so people are catching on that there's a shit-ton of variance in this kind of social-click correlation.

Anyway, pressflip is way different and way more specific.  We don't have a define set of topics, you tell us the topics.  We don't correlate you with other users, our computer programs read what you click on, figure out what it's about, and then show you more stuff like it.  If you don't like something, you press "flip", and we'll show you less like it.

How Does It Work?


That's a good question.  After dealing with angel investors, VCs, users, and anybody who isn't an engineer, the answer in my mind is, nobody gives a shit.  Really, nobody cares about your algorithm or how revolutionary you think it is.  All people care about is a system that shows them things they want to read.

With that in mind, we set off to answer one question: what are you interested in?  Corollary to this, we have declared the you-can't-end-a-sentence-in-a-preposition rule to be obsolete, because it's stupid.

You can type something in as a response to that, "new york yankees", "iphone 3g", "john mccain", and we'll start by showing you some news about the topic.  You can save this search as an interest and continue to get news about it.  As you click and flip stories within the interest, the machine will learn the nuanced detail about what you like.  Maybe you only like iPhone 3G news that talks about how the AT&T data plan is giving people the shaft.  We can learn that.

tl;dr

Go to pressflip.com and start using my shit.  Then blog about how much you think it sucks and how Uncov was stupid.

I lurk #uncov on Freenode as tjd.

Announcing My New Column At The Register

|
pissed_dog_deer.jpgStarting today, I'll be writing a bi-weekly (that's every other week, not twice a week, you dunce) column at The Register.  Why?  Because I'm just that awesome.

Here are the answers to a few of the questions I've got so far:

What are you going to write about?

Not startups, thank Christ.  It's going to be more developery.

Does this make you a sellout?
Yes, insofar as I (and other sellouts before me) have bills to pay in the part of the country that has the highest cost of living.  If I really wanted to sell out, I'd have taken the job that Arrington offered me at TechCrunch.

Why not write for TechCrunch of Valleywag?
Because I like the audience better at El Reg.  The people who read The Register are the boots-on-the-ground in technology.  These are the DBAs, the J2EE developers, and the sys admins who are there every day, supporting the infrastructure that runs the world outside of San Francisco. These are the people I identify with the most. Delusion of grandeur?  Maybe, but no more deluded than thinking Valleywag matters to anybody outside the 510 and 650 area codes.

Can I pitch you my startup like I did every fucking month at Uncov?

Sure, but my rule is that you need to buy me a drink before you start speaking.  I'm not going to write anything about what you're doing, but if you want the honor of my attention it will cost you at least one Maker's Mark.

Read the first one here.  It's about protocol buffers.

Build Google Protocol Buffers Without Maven

|
trippin_balls.jpgGoogle released protocol buffers as open source, which, with a proper transport, will give both XML-RPC and Thrift a run for their money.

Anyway, it's kind of a pain in the balls to build the Java version.  If you import the Java source into Eclipse, it's got all sorts of build errors, all stemming from a missing file: DescriptorProtos.java.

If you've got Maven installed, it will make DescriptorProtos.java for you (this file is generated via protoc).  But Maven is stupid, because it didn't work immediately after apt-get install and I couldn't figure out how to fix it within 30 seconds.  I have no patience for this kind of bullshit.

So, to build DescriptorProtos.java without Maven, you make it by hand:

protoc --java_out=/home/ted/some_directory \ 
/path/to/protobufsrc/src/google/protobuf/descriptor.proto
(You already compiled protoc, didn't you?)

Drop the output file into Eclipse and protocol buffers will build.  There are still a bunch of compilation warnings, but only chumps listen to those.

Corporate Competence

|
1213940897512.jpg
I really love it when people just do their jobs.  I feel gifted whenever I call a company and get a customer support representative who know what they are doing and actually cares about me.

It's rare, but it happens.

Worst ISP Ever


For a while, I had Comcast's cable internet service.  It was clear after two years of putting up with their horseshit that they don't care about customers at all.

Oh, wait, they set up a Twitter account.

Fantastic, but my BitTorrent shit still didn't work on their network.  Their installation staff is rude and has questionable hygeine, and their customer support representatives are downright lazy.

Switch to AT&T Now

When I moved, my first order of business was to call Comcast and tell them it's over.  They said my service wouldn't end until I brought back my cable modem, and of course, the place I need to bring it back to is only open during working hours.

I took off work early to get this little brick of dissatisfaction back to its rightful owner, because fuck them.

At the same time, I was waiting for AT&T to show up and install U-Verse internet service.  They did, and shit was impressive.

  • They told me the tech would be at my house any time from noon to 2pm on a Sunday.  The tech showed up at noon on the dot.
  • It took him about an hour to set up the service.  When he left, he gave me a card with his direct cell phone number.  If I had any problem in the next ten days, I called him directly and he would come fix it.
  • An hour after he left, the service went out.  I called him, and he was back at my house within 30 minutes.  It turns out there was something wrong with the line from the street to my house, and he had to get another tech out to fix it.  That guy showed up, fixed the problem, and was on his way.  The two of them were at my place until 8pm on a Sunday until the job was done right.
I've been using the service for almost a week now and it's great.

  • No BitTorrent fuckery.  All my torrents work great, and I can seed.
  • 10 megabits downstream, 1.5 megabits upstream.
Great job, AT&T, you actually care about the people paying your salaries.


Having Fun At Mahalo's Expense

|
Holy crap this is awesome.  Mahalo, the "human powered search engine", now lets the general public edit parts of its pages.

For the uninformed, Mahalo is a for-profit installation of MediaWiki founded by notable blowhard Jason Calacanis and backed by Sequoia Capital.  They're aiming to be a hand-vetted search engine to compete with Google.  Right.

Anyway, now you can edit their pages anonymously.  Since they're a capitalist version of Wikipedia, they need to pay people to review edits, so they aren't too quick on the uptake.  Here's how to do it:

Step 1: Search for something at mahalo.com


mahalo_search.png
Step 2: Click the 'edit' link

loren_page.pngStep 3: Add your verbiage. Remember, this is MediaWiki syntax, so you can link to other Mahalo pages.


edit_page.png
Step 4: Hit 'save' and collect your winnings

win.png
A search engine that anyone can edit.  What could possibly go wrong?

Practical Unique Identifiers

|
dogs_love_md5.jpgThere have been a handful of places within the Persai pipeline where I have needed unique identifiers of varying length.  64 bits here, 32 bits there.  I'm not the only one to ever have to solve this problem, but I could never find a concise toolbox of information on it.

Automatic Increment or Not


MySQL has the AUTO_INCREMENT modifier for integral record keys.  That's great, if you're using MySQL.  In general, prefer a non-automatically increasing record identifier, unless you have a specific reason.  Here's why:

  • You may actually have to think about thread synchronization at some point when creating records.
  • If these identifiers become publicly visible, they can leak information about how many records are in your database.
  • If you make identifiers out of other pieces of data (say URLs), then you can't get the identifier value of a given datum without a table lookup.  And even then, you'll need another index on that field.
There are a few cases where automatic increment identifiers are good, though:

  • You are using a MySQL database and are setting up a simple structure of tables. (i.e. MySQL handles synchronization for you and it's actually harder to not use automatic increment)
  • The creation order of records is really important to you, but not important enough to store a timestamp field.

Making an Identifier Out Of Arbitrary Data

Easy, right?  Just hash whatever data you've got.  It's not reversible and spread uniformly over the identifier space.  However, many times the output of a standard hashing algorithm is too big.  SHA-1, for example, is 160 bits wide.  Way too long for most purposes.

In this case, I truncate the output.  Yes, this is mathematically valid, because any good hashing algorithm's output will be uniform over the range of the function.  And by uniform, I mean really uniform.  For example, if you take the first 64 bits of a 160-bit SHA-1 hash and call that your unique identifier, the probability of a collision is going to be uniform over the space of all 64-bit numbers.  If it wasn't (i.e. the first 64-bits of a SHA-1 hash were distributed, say normally), then the hash function would be cryptographically insecure.

Don't try to swing your dick around and come up with your own hash function.  You'll screw it up.  I know I have.

GUIDs

I got an e-mail from a reader about using GUIDs for unique identifiers.  This fits with the hashing scheme, but for the most part, I think GUIDs are far too large, especially if you are storing a lot of records.  GUIDs are 128 bits wide, so if you have a hundred million records, that's about 1.5GB worth of identifiers.  Use a 64-bit identifier, and your space is halved, without a significant increase in collision probability.

Making An Identifier Easier On The Eyes

If you need to put a unique identifier in a URL, it can't look too nerdy.  For example, this URL looks like shit:

http://www.website.com/document?id=1b25a53bf21d0206
Too many numbers.  So, to make it look better, Base-64 encode it.  It will lengthen the code a little, but it's much easier to look at:

http://www.website.com/document?id=ZnJvc3RlZCBidXR0cw==
Eh, well it looks better to me.  Personal taste, I guess.

You'll need to make sure that your Base-64 alphabet doesn't include the + and / characters: they aren't URL safe.

Sort Orderings

Don't worry about sort ordering unless you have to worry about sort ordering.  Duh.  The vast majority of Persai's data is stored simply as files, and for most purposes we don't have to care about the processing order.  We're fortunate in that regard (well maybe not fortunate, I mean that's like saying you're fortunate that you're not fat because you exercise and eat sensibly).

Anyway, there are a couple of places in Persai where sort order matters.  The ordering of recommendations, for example.  There, though, we're just ordering by time, and we need to display the exact time, not just the relative times of the recommendations, so we store a date field and order data by it in the store.

This drives one of my earlier points home: if you need ordering by time, don't count on an automatic increment unique identifier to do it.  It's much more robust to store a timestamp.

In fact this point goes deeper.  Very rarely do you actually need records sorted by record identifier.  What you need is the records sorted by some other value that happens to be reflected in the record identifier by virtue of automatic increment and the insertion order.  It's always more robust to store the actual value you need to sort by.

I'm Not Going To Tell You How To Write Code

Because I don't really care.  This is how I do it, though.

About this Archive

This page is an archive of entries from July 2008 listed from newest to oldest.

June 2008 is the previous archive.

August 2008 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Categories