Hey all you Rails fans, just sharing a quick Rails performance DB tip that's tripped me up more than a few times.
I am no professional developer, more of an 'enthusiast', but hopefully this tip can help you when updating large sets of records in rails. Let's embark on a journey through Rails land, exploring two distinct paths of updating records: the loop-through trail and the speedy highway of update_all
.
Setting The Table
I recently had to classify 8M+ NY unclaimed property records in as either a "person" or "business" based on a name field. Here are examples of names in unclaimed property holders in Staten Island NY:
MILLER KEITH
BORRIELLO SALVATORE
NEW VISON HOUSING DEVELPMENT FUND CORP
TEAMSTERS LOCAL 210
JACOBSON FURNITURE INC
ARCURI STEVEN
To do that, I created a method on my model to loop through an array of business terms like this:
BUSINESS_TERMS = ["accounting","agency","asbestos","assoc","associated", ...]
For each term, then I look through all 8M+ records and mark the record as business if the name includes any of my business terms. I am sure there's a better way to do this, but hey, it works.
So what's the best way to do this for each term? Should I loop through all 8M+ records one by one for each term, or is there a better way? Let's imagine our options as a picnic vs. a buffet (strange I know, but I like analogies).
The Loop-Through Picnic
Imagine setting up a picnic, one sandwich at a time. In this case, one sandwich equals one unclaimed property record. For each sandwich, you walk to your car, fetch ingredients, make the sandwich, then walk back to your car to store it. Sounds like a workout, right? This is the loop-through method in action: reliable, but oh-so time-consuming. In code-land, each trip to your “car” (or database) takes time and energy.
How much? Not a ton for each record, but when you do it 8 million times, then it gets long. Plus, pulling that many records into memory takes a while. According to some simple logging, it takes about 10ms for the read, and another 10ms for the update. That's 20ms per record, 8 million times, which would take 1.8 DAYS, eeks. And that's just for one stroll through the records for one term.
Now, let's consider a better option...
The update_all
Buffet
Now, imagine a buffet. A grand table where you spread out all ingredients, assemble all sandwiches simultaneously, and bam – lunch for everyone! This, dear reader, is the magic of update_all
. Instead of multiple trips, you're making everything in one go. It's the culinary dream for the impatient – and, well, let's face it, I’d be first in line.
In technical terms, update_all
batches the task into one SQL operation, making it incredibly efficient. Just like prepping that buffet in record time, your database appreciates the brevity.
In my example with updating all the records at once, one update to all 8 million records takes about 33 seconds on a Basic Postgres DB on Heroku using Basic dynos. I'll take 33 seconds over 1.8 days anytime!
But Wait... There’s Always a ‘But’
However, a word of caution for all the over-enthusiastic buffet lovers out there (myself included). While update_all
is quick and easy, it might bypass some crucial steps, like validations and callbacks on the models you're updating. So, while you may churn out sandwiches (or updated records) at an impressive rate, you might just forget the mayo or pickles. If you’re like me and tend to overlook the finer details in your zest for efficiency, take note.
Take-Away (See What I Did There?)
Choosing between the methodical loop-through approach and the express lane of update_all
is like deciding between a leisurely picnic and a rapid-fire buffet. Both have their merits and pitfalls. Both have their place, but if you don't care about validations and callbacks, update_all
may just be your huckelberry (see Tombstone for that reference).