Sunday, September 7, 2008

US government seizes control of mortgage giants (AP)

Treasury Secretary Henry Paulson, Jr. speaks during a news conference in Washington, Sunday, Sept. 7, 2008 on the bailout of mortgage giants Fannie Mae and Freddie Mac. (AP Photo/Susan Walsh)AP - The Bush administration seized control Sunday of troubled mortgage giants Fannie Mae and Freddie Mac, aiming to stabilize the housing market turmoil that is threatening financial markets and the overall economy.

It was looking like a boring day. I got in late, almost 11 (B-A-D, I usually work 10:30-18:30), and by 11:30 had nailed all the daily maintenance stuff and was looking at a series of deadlocked and waiting-for-other-guys jobs. And a few epic jobs that couldn't really be furthered today. Then the ops manager walked up to me...

"What's faster than grep?"

If you aren't a geek, grep is a really versatile and REALLY FAST thing you search text files with. If somebody important wants something faster than grep, they are either in an unholy hurry, or have a truckload of data.

Or worst case, like this case, both.

Fourty-four gig of data all told, a 180k list of some 13,000 records to pull out. Variable-length text data, a worst-case scenario. And we need the information extracted by tomorrow morning. Oh and you need to match two different fields.

Great.

The good news? The first field has a lot of duplicates. It's 13,000 records, but only 220 unique in the first field. I flag that as "interesting".

The problem with a looped grep is, you're spinning the Big Wheel fast and the little wheel slow. You're cycling through 44gig of data 13,000 times. Nothing's going to be in RAM (this is 2006, and I still don't have a box with 64GB of RAM darnit.. I stuck one in at my last job with 32 though). So I come at it from the other direction. A repeated grep command was literally taking days (it was tried before they handed me the problem)

I've written something faster than grep before. But I cheated then, and indexed the data using something called CDB, which is the fastest database system I've ever seen (look it up, it is truly annihilative). But today I had no time to index anything..... one pass was going to have to be enough. How'd I do it?

I wrote it in C. No kidding, that volume of data, I need the fastest _performing_ tool I can find, or it's gonna take weeks. Compile with -O3! PHP would take way less time to code, saving an hour or two even, but take days longer to run.

First thing I did was load the 13,000 sets of two fields into an array. This meant that the data doing most of the work was stuck in RAM, and we only had to cycle through the 44GB of junk once. This is a good thing.

Lots of bugs ensued. The usual C stuff... you hack something together and it segfaults the first 100 times you run it, while you madly run around putting print statements everywhere trying to figure out where the fire is. Meanwhile, the clock is ticking and my boss's boss is sitting next to me or pacing around his office looking worried. Tick tock.

I get the thing working, and move it over the HPUX box that has all the data on it. I didn't even have vim on that box, so I coded it first up on my linux desktop. I start it up there on a small sample of the data that had been working on my desktop.

It crashes. And crashes. And crashes. I run into bug after bug in the awful HP libraries. sscanf doesn't work anything like it does under gnu libc. I'm about ready to move the data off onto a heavy linux server when my Boss (not my Boss's Boss) mentions that he's installed the GNU compiler and libraries on the hpux machine. Sweet! Compiles first go, runs first go now.

Except it's too slow.

"What cpu's in this thing man?"

"Uhhh I think it's 4x 360Mhz"

"$%#%!"

I start transferring the data onto the aforementioned heavy linux server. It has about 10 times the processor power. I have a brief discussion with IP Engineering about LETTING ME THROUGH THE FIREWALL NOW PLZ. Then it starts..... 5MB a sec anyone? This is the kind of thing, I mention, it would be USEFUL to UPGRADE TO GIGABIT ETHERNET for. Tick tock.

Hours later.... The hours are good in a way, they give me time to do a bit of other work and think about how to make my hacked-together program go faster.

The first thing is, I realise, smacking myself in the head, that I'm matching stuff and then continuing to compare the rest of the 13,000 wanted records to the line after it's been matched! I fix this, and then I remember the "interesting" thing from earlier.

220 unique records in the first field only. I write a loop (I hate C, this would be another one-liner in php :() to find these and put them into an array. I use this array to "screen" each record before scanning it. If the first field isn't in the 220, the line doesn't get compared to the 13,000 - we skip it and move on.

This actually slows down my sample data noticeably. But I know that my sample data has an unusually high number of matches - I figure my program is just doing an extra 220 comparisons on each line for that data, whereas the bulk of the REAL data will be skipped over quickly by this code. How do I test that theory? I'm able to run the program on the incomplete file as it downloads. Don't try this on Windows :) I'm getting what looks like good results. It's fast, and it's Getting the Data.

I'm running out of hours. The first half of the data (glad it's in two separate files now!) is finished downloading and I get to work on it for real. Looks good... it's matching stuff, and the counter I've got in the program that prints a line when it gets to a million is clipping past fairly quickly. The dead spots in the data without any records in the 220 list fly past, and I estimate it's going to take just 90 minutes to get through the first file! 285 million records in 90 minutes is just fine by me.

WE ARE GOING TO WIN.

The rest is easy. A few hours later, the second file finishes downloading. I start it up and walk home. It's done by the time I get there and VPN in. Sweet. Home by 9pm, and the work done!

Today reminded me why I love my job.

James Hicks owns and operates http://isnerd.net

He has ten years experience in the Information Technology / Information Services industry, including eight as a Linux Systems Administrator. He has worked as a senior Unix Administrator for Primus Telecom Australia (a large Australian telco/ISP) and is currently Production Support Manager at AusRegistry - the infrastructure company that maintains the com.au, net.au, org.au (and other) domain spaces.

He became a RedHat Certified Engineer in 2004, and currently lives in Melbourne, Australia.