Uniq for Array or Hash with a Deeply Nested Structure

Most people have had some experience with ruby’s built in #uniq method for Arrays. Internally, this method finds the unique items in the array by creating a hash internally, and this internal comparison is done with the #eql? method. If an item in the array is a Hash, then #eql? simply uses the object_id, generated by the #hash method, to determine whether it is equal to another object in the array. There are many solutions online each with s light variations and goals. I found myself in need of a uniq method for an array containing items in an arbitrarily deep nested structure (ie many sub-hashes and arrays).

I found myself working with some data in a hash generated from scraping a website. The website had some duplicate content and I didn’t feel like it was necessary to clutter my scraping code with tests for duplicate content while it was scraping. After scraping the data and turning it into a hash, I wished to remove the duplicates with some post-processing. I felt like this should be pretty simple, except that the elements I wanted to test for uniqueness contained a deeply nested structure of hashes and arrays. Here is an example an element in an array I wanted to uniqify:

{:name => 'bob', :posts => [{:title => 'First Post!' :comments => [{:commenter => 'Alice', :text => 'Hi bob!'}]}]}

The #uniq method does not work as expected when there is a Hash element in the array, it only uses the object_id to determine uniqueness. There are some solutions online for dealing with this.

If you know that each element inside the array is a hash with no sub-hashes you can use the solution presented in the answer of this stackoverflow.com question. The first solution uses #inject to build a new array with only unique items. Whether an item is included in the array already or not is determined with the Hash’s #include method.

a.inject([]) { |result,h| result << h unless result.include?(h); result }

The second solution presented in the same post:

a.map {|h| h.to_a[0]}.uniq.map {|k,v| {k => v}}

This flattens all of your hashes into an array. It then calls uniq on the resulting array after all the hashes are flattened. This means that for an item to be unique it must have the same key AND value. After finding the unique elements, it then maps the items back into a hash.

In this post, Mike Burns solve this problem by redefining #hash and #eql?. If you are doing this kind of comparison often, this would probably be a better approach. He does note though, that by doing this the #dup method no longer works as expected.

So it was obvious through searching that I was going to have to figure this out on my own. I was a little worried that I would have to implement an ugly recursive solution that searches the entire tree to find all duplicates. While looking into this solution, I felt that it would be very nice if #uniq would accept a block that would allow me to return true if the given parameter is a duplicate, thus removing it from the returned array. This is not implemented in my current ruby version (1.8.6), but there is an example of how to do this on this blog post. Hopefully, more (all) of these methods in the ruby core will accept blocks someday.

I began to write a method that would search my tree-like structure of hashes and arrays and I felt like it was a very ugly solution. I wanted something simple that would just take a line or two and get the job done. So I showed an example of the problem to my non-computer scientist wife and she said, “why don’t you just compare how the two lists look when printed out?” I was a little shocked that I hadn’t thought of this. It isn’t the most efficient solution and it relies on the #inspect method, which is probably pretty slow because it has to determine the structure of all the hashes in order to print them in a readable format. The old solution I was cooking up would also have to traverse the tree, similar to what I expected #inspect was doing. In addition to the #inspect call, there would be a string comparison for each element. I was willing to accept both of these for a short elegant solution to my problem.

The issue that came to my mind first was whether hashes get printed out the same way every time. In Ruby 1.9, it seems that Hashes are going to be stored in the same order that they were entered. In previous versions, I’m assuming that they are stored in the order of their generated hash. This could cause a problem if the data someone was entering could be in different orders. For my purposes though, the data would be entered in the same order if it was a duplicate.

So, after all of that, to uniqify an array containing a deeply nested structure, I present the following solution:

class Array
  def nested_uniq
    self.delete_if {|item| (self - [item]).any? {|other_item| item.inspect == other_item.inspect}}
  end
  def nested_uniq!
    self.replace self.nested_uniq
  end
end

I’ll walk through it because it may seem a bit hard to understand at first. I go through every element in the array with delete_if to determine if the element should be deleted from the array. Then, for all elements except the one in question, I check to see if #any? of them are equivalent to the element in question. This equivalence is determined by doing a string comparison of what the #inspect method returns for that element.

Here is an IRB test

>>  test = [{:hello => '5', :test => [{:hello => '10', :hell => '5'}, {:yeah => '2'}]}, {:to => '1', :yeah => '8'}, {:hello => '5', :test => [{:hello => '10', :hell => '5'}, {:yeah => '2'}]}]
=> [{:test=>[{:hello=>"10", :hell=>"5"}, {:yeah=>"2"}], :hello=>"5"}, {:to=>"1", :yeah=>"8"}, {:test=>[{:hello=>"10", :hell=>"5"}, {:yeah=>"2"}], :hello=>"5"}]
>> test.delete_if {|item| (test - [item]).any? {|other_item| item.inspect == other_item.inspect}}
=> [{:to=>"1", :yeah=>"8"}, {:test=>[{:hello=>"10", :hell=>"5"}, {:yeah=>"2"}], :hello=>"5"}]

Hope this is helpful to someone! I am very interested to hear suggestions and comments on alternative ways to uniqify an array with elements that have a nested structure.

-Chase

blog comments powered by Disqus
© Chasing • Powered by Wordpress • Using the Swiss Cool theme.