Chunky Iterator: So You Don’t Have to Load All Your AR Objects at Once
Posted: May 12th, 2008 | Author: Daniel Higginbotham | Filed under: Rails |The following code lets you iterate over large collections of Active Record objects without having to load them all at once, thus reducing memory usage. It’s allowed me to run cron jobs which iterate over thousands of records without getting the cron’d process killed for using too much of a system’s resources.
class ChunkyIterator
include Enumerable
def initialize(model_class, chunk_size, options)
@model_class = model_class
@chunk_size = chunk_size
@options = options
end
def each
rows = @model_class.find(:all, merged_options(0))
until model_objects.empty?
rows.each{|record| yield record}
model_objects = @model_class.find(:all, merged_options(rows.last.id))
end
end
def merged_options(id)
@options.merge(
:conditions => merge_conditions("#{@model_class.table_name}.id > #{id}"),
:limit => @chunk_size
)
end
def merge_conditions(added_condition)
existing_condition = @options[:conditions]
new_condition = case existing_condition
when nil: added_condition
when String: "(#{existing_condition}) AND (#{added_condition})"
when Array
["(#{existing_condition[0]})" +
" AND (#{added_condition})"] +
existing_condition[1..-1]
end
end
end
# Example
Bacon.find_all_in_chunks(500, :conditions => "fresh = TRUE").each do |bacon|
bacon.feed_to_cat
end
Update: altered code to use ID rather than offset, like Jamis Buck does.
Update 2: Fixed merge_conditions per Frank’s observation

I’ve been thinking about something like this for a couple of months. Nice implementation. I know it’s pretty small, but have you any thoughts or plans to spin it into a plugin?
It would be even better if you could wrap this around the regular find command so that it could be used with any AR.find set.
I hadn’t thought of adding it as a plugin, though I guess it would be just as easy to do that as it would not to :) Right now, on designerpages.com, this code is in an initializer.
I’m leery of changing the find method itself. What other AR.find sets were you thinking about?
From Jamis buck, about a year ago, a similar solution:
http://weblog.jamisbuck.org/2007/4/6/faking-cursors-in-activerecord
The problem with solutions with LIMIT and OFFSET is that as your OFFSET gets big, query time increases. Your database basically needs to seek through all the rows before the offset until it gets to the beginning of the range you want.
Looks good. However, I’d DRY it up by using a do-while-do loop, so as to avoid duplicating the find call. Unfortunately, the only common language that supports that control structure (as far as I know) is Smalltalk, so you’d have to simulate it with code like:
do
find
if not found
break
yield
increment
loop
///ark
Looks a lot like my all_records plugin. Seems like this is a pretty universal need, and maybe should be in Rails?
solution can miss some records if inserts/deletes happening at the same time (even within block to process) - imagine you iterate to update or delete, on the next iteration - offset will jump over the last end
better to stick to Jamis Buck solution
http://weblog.jamisbuck.org/2007/4/6/faking-cursors-in-activerecord
Recent versions of the will_paginate plugin include a method called paginated_each to do something similar.
http://groups.google.com/group/will_paginate/browse_thread/thread/59856834eb52b4d0
Thanks for the great comments. I didn’t realize that with limit/offset, query time can increase as offset gets larger. Thanks as well for the link to Jamis Buck’s solution - it should be easy to combine what I have here with his solution so that the “cursored” find can take the other find options.
That’s great that will_paginate has a method that will do this - before I wrote this post I was actually thinking that it would make sense to extend will_paginate to do this. Looking through the code, it looks like the ActiveRecord#paginate method uses offset, so I guess paginated_each would also take increasing amounts of time.
[…] Flying Machine Studios » Blog Archive » Chunky Iterator: So You Don’t Have to Load All Your AR O… Iterate over a collection of AR objects in groups of n without having to load them all at once (tags: rubyonrails activerecord iterators) Tags: Found Objects […]
Looks like I had to do this again today, iterate/update over a large set. So I threw together a rake task to do such, but took a different route. Should be db independent, and doesn’t depend on offsets.
I’m not saying it’s better, just a different approach. Probably slower if a large percentage of the records need updates.
http://blog.spoolz.com/2008/05/13/butt-biter-fix-state-abbreviations-large-data-set-updates/
[…] remarkable articles from this week were: Chunky Iterator: So You Don’t Have to Load All Your AR Objects at Once, Do we really need Controller and View tests?, Guide to Unobtrusive JavaScript and SWFUpload, […]
Shouldn’t the existing_condition in merge_conditions be enclosed in braces? If existing_condition for example was
"fresh = TRUE OR amount > 5"the correct merge should be
"(fresh = TRUE OR amount > 5) AND #{@model_class.table_name}.id > #{id}"whereas your code produces
"fresh = TRUE OR amount > 5 AND #{@model_class.table_name}.id > #{id}"Good catch Frank, thanks!