David Eisinger


Elsewhere > Use .pluck If You Only Need a Subset of Model Attributes

Posted 2014-08-20 on viget.com

Despite some exciting advances in the field, like Node, Redis, and Go, a well-structured relational database fronted by a Rails or Sinatra (or Django, etc.) app is still one of the most effective toolsets for building things for the web. In the coming weeks, I’ll be publishing a series of posts about how to be sure that you’re taking advantage of all your RDBMS has to offer.

IF YOU ONLY REQUIRE a few attributes from a table, rather than instantiating a collection of models and then running a .map over them to get the data you need, it’s much more efficient to use .pluck to pull back only the attributes you need as an array. The benefits are twofold: better SQL performance and less time and memory spent in Rubyland.

To illustrate, let’s use an app I’ve been working on that takes Harvest data and generates reports. As a baseline, here is the execution time and memory usage of rails runner with a blank instruction:

$ time rails runner ""
real 0m2.053s
user 0m1.666s
sys 0m0.379s

$ memory_profiler.sh rails runner ""
Peak: 109240

In other words, it takes about two seconds and 100MB to boot up the app. We calculate memory usage with a modified version of this Unix script.

Now, consider a TimeEntry model in our time tracking application (of which there are 314,420 in my local database). Let’s say we need a list of the dates of every single time entry in the system. A naïve approach would look something like this:

dates = TimeEntry.all.map { |entry| entry.logged_on }

It works, but seems a little slow:

$ time rails runner "TimeEntry.all.map { |entry| entry.logged_on }"
real 0m14.461s
user 0m12.824s
sys 0m0.994s

Almost 14.5 seconds. Not exactly webscale. And how about RAM usage?

$ memory_profiler.sh rails runner "TimeEntry.all.map { |entry| entry.logged_on }"
Peak: 1252180

About 1.25 gigabytes of RAM. Now, what if we use .pluck instead?

dates = TimeEntry.pluck(:logged_on)

In terms of time, we see major improvements:

$ time rails runner "TimeEntry.pluck(:logged_on)"
real 0m4.123s
user 0m3.418s
sys 0m0.529s

So from roughly 15 seconds to about four. Similarly, for memory usage:

$ memory_profiler.sh bundle exec rails runner "TimeEntry.pluck(:logged_on)"
Peak: 384636

From 1.25GB to less than 400MB. When we subtract the overhead we calculated earlier, we’re going from 15 seconds of execution time to two, and 1.15GB of RAM to 300MB.

Using SQL Fragments

As you might imagine, there’s a lot of duplication among the dates on which time entries are logged. What if we only want unique values? We’d update our naïve approach to look like this:

dates = TimeEntry.all.map { |entry| entry.logged_on }.uniq

When we profile this code, we see that it performs slightly worse than the non-unique version:

$ time rails runner "TimeEntry.all.map { |entry| entry.logged_on }.uniq"
real 0m15.337s
user 0m13.621s
sys 0m1.021s

$ memory_profiler.sh rails runner "TimeEntry.all.map { |entry| entry.logged_on }.uniq"
Peak: 1278784

Instead, let’s take advantage of .pluck’s ability to take a SQL fragment rather than a symbolized column name:

dates = TimeEntry.pluck("DISTINCT logged_on")

Profiling this code yields surprising results:

$ time rails runner "TimeEntry.pluck('DISTINCT logged_on')"
real 0m2.133s
user 0m1.678s
sys 0m0.369s

$ memory_profiler.sh rails runner "TimeEntry.pluck('DISTNCT logged_on')"
Peak: 107984

Both running time and memory usage are virtually identical to executing the runner with a blank command, or, in other words, the result is calculated at an incredibly low cost.

Using .pluck Across Tables

Requirements have changed, and now, instead of an array of timestamps, we need an array of two-element arrays consisting of the timestamp and the employee’s last name, stored in the “employees” table. Our naïve approach then becomes:

dates = TimeEntry.all.map { |entry| [entry.logged_on, entry.employee.last_name] }

Go grab a cup of coffee, because this is going to take awhile.

$ time rails runner "TimeEntry.all.map { |entry| [entry.logged_on, entry.employee.last_name] }"
real 7m29.245s
user 6m52.136s
sys 0m15.601s

memory_profiler.sh rails runner "TimeEntry.all.map { |entry| [entry.logged_on, entry.employee.last_name] }"
Peak: 3052592

Yes, you’re reading that correctly: 7.5 minutes and 3 gigs of RAM. We can improve performance somewhat by taking advantage of ActiveRecord’s eager loading capabilities.

dates = TimeEntry.includes(:employee).map { |entry| [entry.logged_on, entry.employee.last_name] }

Benchmarking this code, we see significant performance gains, since we’re going from over 300,000 SQL queries to two.

$ time rails runner "TimeEntry.includes(:employee).map { |entry| [entry.logged_on, entry.employee.last_name] }"
real 0m21.270s
user 0m19.396s
sys 0m1.174s

$ memory_profiler.sh rails runner "TimeEntry.includes(:employee).map { |entry| [entry.logged_on, entry.employee.last_name] }"
Peak: 1606204

Faster (from 7.5 minutes to 21 seconds), but certainly not fast enough. Finally, with .pluck:

dates = TimeEntry.includes(:employee).pluck(:logged_on, :last_name)

Benchmarks:

$ time rails runner "TimeEntry.includes(:employee).pluck(:logged_on, :last_name)"
real 0m4.180s
user 0m3.414s
sys 0m0.543s

$ memory_profiler.sh rails runner "TimeEntry.includes(:employee).pluck(:logged_on, :last_name)"
Peak: 407912

A hair over 4 seconds execution time and 400MB RAM – hardly any more expensive than without employee names.

Conclusion