Chunky Iterator: So You Don’t Have to Load All Your AR Objects at Once

Posted: May 12th, 2008 | Author: Daniel Higginbotham | Filed under: Rails |

The following code lets you iterate over large collections of Active Record objects without having to load them all at once, thus reducing memory usage. It’s allowed me to run cron jobs which iterate over thousands of records without getting the cron’d process killed for using too much of a system’s resources.

class ChunkyIterator
  include Enumerable
  def initialize(model_class, chunk_size, options)
    @model_class = model_class
    @chunk_size = chunk_size
    @options = options
  end

  def each
    rows = @model_class.find(:all, merged_options(0))

    until model_objects.empty?
      rows.each{|record| yield record}
      model_objects = @model_class.find(:all, merged_options(rows.last.id))
    end
  end

  def merged_options(id)
    @options.merge(
      :conditions => merge_conditions("#{@model_class.table_name}.id > #{id}"),
      :limit => @chunk_size
    )
  end

  def merge_conditions(added_condition)
    existing_condition = @options[:conditions]
    new_condition = case existing_condition
    when nil: added_condition
    when String: "(#{existing_condition}) AND (#{added_condition})"
    when Array
      ["(#{existing_condition[0]})" +
       " AND (#{added_condition})"] +
       existing_condition[1..-1]
    end
  end
end

# Example
Bacon.find_all_in_chunks(500, :conditions => "fresh = TRUE").each do |bacon|
  bacon.feed_to_cat
end

Update: altered code to use ID rather than offset, like Jamis Buck does.
Update 2: Fixed merge_conditions per Frank’s observation


14 Comments on “Chunky Iterator: So You Don’t Have to Load All Your AR Objects at Once”

  1. 1 Karl said at 6:09 pm on May 12th, 2008:

    I’ve been thinking about something like this for a couple of months. Nice implementation. I know it’s pretty small, but have you any thoughts or plans to spin it into a plugin?

    It would be even better if you could wrap this around the regular find command so that it could be used with any AR.find set.

  2. 2 Daniel Higginbotham said at 6:57 pm on May 12th, 2008:

    I hadn’t thought of adding it as a plugin, though I guess it would be just as easy to do that as it would not to :) Right now, on designerpages.com, this code is in an initializer.

    I’m leery of changing the find method itself. What other AR.find sets were you thinking about?

  3. 3 Will said at 8:29 pm on May 12th, 2008:

    From Jamis buck, about a year ago, a similar solution:
    http://weblog.jamisbuck.org/2007/4/6/faking-cursors-in-activerecord

  4. 4 Brandon Dimcheff said at 8:42 pm on May 12th, 2008:

    The problem with solutions with LIMIT and OFFSET is that as your OFFSET gets big, query time increases. Your database basically needs to seek through all the rows before the offset until it gets to the beginning of the range you want.

  5. 5 Mark Wilden said at 8:44 pm on May 12th, 2008:

    Looks good. However, I’d DRY it up by using a do-while-do loop, so as to avoid duplicating the find call. Unfortunately, the only common language that supports that control structure (as far as I know) is Smalltalk, so you’d have to simulate it with code like:

    do
    find
    if not found
    break
    yield
    increment
    loop

    ///ark

  6. 6 Peter Jones said at 9:29 pm on May 12th, 2008:

    Looks a lot like my all_records plugin. Seems like this is a pretty universal need, and maybe should be in Rails?

  7. 7 Evgeniy said at 9:34 pm on May 12th, 2008:

    solution can miss some records if inserts/deletes happening at the same time (even within block to process) - imagine you iterate to update or delete, on the next iteration - offset will jump over the last end

    better to stick to Jamis Buck solution
    http://weblog.jamisbuck.org/2007/4/6/faking-cursors-in-activerecord

  8. 8 Grant Hutchins said at 9:44 pm on May 12th, 2008:

    Recent versions of the will_paginate plugin include a method called paginated_each to do something similar.

    http://groups.google.com/group/will_paginate/browse_thread/thread/59856834eb52b4d0

  9. 9 Daniel Higginbotham said at 8:42 am on May 13th, 2008:

    Thanks for the great comments. I didn’t realize that with limit/offset, query time can increase as offset gets larger. Thanks as well for the link to Jamis Buck’s solution - it should be easy to combine what I have here with his solution so that the “cursored” find can take the other find options.

    That’s great that will_paginate has a method that will do this - before I wrote this post I was actually thinking that it would make sense to extend will_paginate to do this. Looking through the code, it looks like the ActiveRecord#paginate method uses offset, so I guess paginated_each would also take increasing amounts of time.

  10. 10 links for 2008-05-13, or so says Harry Love said at 11:31 am on May 13th, 2008:

    […] Flying Machine Studios » Blog Archive » Chunky Iterator: So You Don’t Have to Load All Your AR O… Iterate over a collection of AR objects in groups of n without having to load them all at once (tags: rubyonrails activerecord iterators) Tags: Found Objects […]

  11. 11 Karl said at 7:03 pm on May 13th, 2008:

    Looks like I had to do this again today, iterate/update over a large set. So I threw together a rake task to do such, but took a different route. Should be db independent, and doesn’t depend on offsets.

    I’m not saying it’s better, just a different approach. Probably slower if a large percentage of the records need updates.

    http://blog.spoolz.com/2008/05/13/butt-biter-fix-state-abbreviations-large-data-set-updates/

  12. 12 This Week in Ruby (May 20, 2008) | Zen and the Art of Programming said at 5:20 pm on May 20th, 2008:

    […] remarkable articles from this week were: Chunky Iterator: So You Don’t Have to Load All Your AR Objects at Once, Do we really need Controller and View tests?, Guide to Unobtrusive JavaScript and SWFUpload, […]

  13. 13 Frank Kuepper said at 12:26 am on May 21st, 2008:

    Shouldn’t the existing_condition in merge_conditions be enclosed in braces? If existing_condition for example was
    "fresh = TRUE OR amount > 5"
    the correct merge should be
    "(fresh = TRUE OR amount > 5) AND #{@model_class.table_name}.id > #{id}"
    whereas your code produces
    "fresh = TRUE OR amount > 5 AND #{@model_class.table_name}.id > #{id}"

  14. 14 Daniel Higginbotham said at 9:15 am on May 23rd, 2008:

    Good catch Frank, thanks!


Leave a Reply