Elsewhere > Use .pluck If You Only Need a Subset of Model Attributes
Posted 2014-08-20 on viget.com
Despite some exciting advances in the field, like Node, Redis, and Go, a well-structured relational database fronted by a Rails or Sinatra (or Django, etc.) app is still one of the most effective toolsets for building things for the web. In the coming weeks, I’ll be publishing a series of posts about how to be sure that you’re taking advantage of all your RDBMS has to offer.
IF YOU ONLY REQUIRE a few attributes from a table, rather than
instantiating a collection of models and then running a .map
over them
to get the data you need, it’s much more efficient to use .pluck
to
pull back only the attributes you need as an array. The benefits are
twofold: better SQL performance and less time and memory spent in
Rubyland.
To illustrate, let’s use an app I’ve been working on that takes
Harvest data and generates reports. As a
baseline, here is the execution time and memory usage of rails runner
with a blank instruction:
$ time rails runner ""
real 0m2.053s
user 0m1.666s
sys 0m0.379s
$ memory_profiler.sh rails runner ""
Peak: 109240
In other words, it takes about two seconds and 100MB to boot up the app. We calculate memory usage with a modified version of this Unix script.
Now, consider a TimeEntry model in our time tracking application (of which there are 314,420 in my local database). Let’s say we need a list of the dates of every single time entry in the system. A naïve approach would look something like this:
dates = TimeEntry.all.map { |entry| entry.logged_on }
It works, but seems a little slow:
$ time rails runner "TimeEntry.all.map { |entry| entry.logged_on }"
real 0m14.461s
user 0m12.824s
sys 0m0.994s
Almost 14.5 seconds. Not exactly webscale. And how about RAM usage?
$ memory_profiler.sh rails runner "TimeEntry.all.map { |entry| entry.logged_on }"
Peak: 1252180
About 1.25 gigabytes of RAM. Now, what if we use .pluck
instead?
dates = TimeEntry.pluck(:logged_on)
In terms of time, we see major improvements:
$ time rails runner "TimeEntry.pluck(:logged_on)"
real 0m4.123s
user 0m3.418s
sys 0m0.529s
So from roughly 15 seconds to about four. Similarly, for memory usage:
$ memory_profiler.sh bundle exec rails runner "TimeEntry.pluck(:logged_on)"
Peak: 384636
From 1.25GB to less than 400MB. When we subtract the overhead we calculated earlier, we’re going from 15 seconds of execution time to two, and 1.15GB of RAM to 300MB.
Using SQL Fragments
As you might imagine, there’s a lot of duplication among the dates on which time entries are logged. What if we only want unique values? We’d update our naïve approach to look like this:
dates = TimeEntry.all.map { |entry| entry.logged_on }.uniq
When we profile this code, we see that it performs slightly worse than the non-unique version:
$ time rails runner "TimeEntry.all.map { |entry| entry.logged_on }.uniq"
real 0m15.337s
user 0m13.621s
sys 0m1.021s
$ memory_profiler.sh rails runner "TimeEntry.all.map { |entry| entry.logged_on }.uniq"
Peak: 1278784
Instead, let’s take advantage of .pluck
’s ability to take a SQL
fragment rather than a symbolized column name:
dates = TimeEntry.pluck("DISTINCT logged_on")
Profiling this code yields surprising results:
$ time rails runner "TimeEntry.pluck('DISTINCT logged_on')"
real 0m2.133s
user 0m1.678s
sys 0m0.369s
$ memory_profiler.sh rails runner "TimeEntry.pluck('DISTNCT logged_on')"
Peak: 107984
Both running time and memory usage are virtually identical to executing the runner with a blank command, or, in other words, the result is calculated at an incredibly low cost.
Using .pluck
Across Tables
Requirements have changed, and now, instead of an array of timestamps, we need an array of two-element arrays consisting of the timestamp and the employee’s last name, stored in the “employees” table. Our naïve approach then becomes:
dates = TimeEntry.all.map { |entry| [entry.logged_on, entry.employee.last_name] }
Go grab a cup of coffee, because this is going to take awhile.
$ time rails runner "TimeEntry.all.map { |entry| [entry.logged_on, entry.employee.last_name] }"
real 7m29.245s
user 6m52.136s
sys 0m15.601s
memory_profiler.sh rails runner "TimeEntry.all.map { |entry| [entry.logged_on, entry.employee.last_name] }"
Peak: 3052592
Yes, you’re reading that correctly: 7.5 minutes and 3 gigs of RAM. We can improve performance somewhat by taking advantage of ActiveRecord’s eager loading capabilities.
dates = TimeEntry.includes(:employee).map { |entry| [entry.logged_on, entry.employee.last_name] }
Benchmarking this code, we see significant performance gains, since we’re going from over 300,000 SQL queries to two.
$ time rails runner "TimeEntry.includes(:employee).map { |entry| [entry.logged_on, entry.employee.last_name] }"
real 0m21.270s
user 0m19.396s
sys 0m1.174s
$ memory_profiler.sh rails runner "TimeEntry.includes(:employee).map { |entry| [entry.logged_on, entry.employee.last_name] }"
Peak: 1606204
Faster (from 7.5 minutes to 21 seconds), but certainly not fast enough.
Finally, with .pluck
:
dates = TimeEntry.includes(:employee).pluck(:logged_on, :last_name)
Benchmarks:
$ time rails runner "TimeEntry.includes(:employee).pluck(:logged_on, :last_name)"
real 0m4.180s
user 0m3.414s
sys 0m0.543s
$ memory_profiler.sh rails runner "TimeEntry.includes(:employee).pluck(:logged_on, :last_name)"
Peak: 407912
A hair over 4 seconds execution time and 400MB RAM – hardly any more expensive than without employee names.
Conclusion
Prefer
.pluck
to instantiating a collection of ActiveRecord objects and then using.map
to build an array of attributes..pluck
can do more than simply pull back attributes on a single table: it can run SQL functions, pull attributes from joined tables, and tack on to any scope.Whenever possible, let the database do the heavy lifting.