Diving into Hash#merge, Hash#merge! and Hash#update in Ruby
Recently I had the opportunity to use and explore Hash#merge and Hash#merge! (aka Hash#update).
These methods are surprisingly powerful, and if you are working with hashes that need to merge with other hashes, it is surprising to learn about their flexibility.
Possibilities
You will have more explanations in the documentation
But both these methods, as their name imply, allow you to merge some arrays to an initial one.
Hash#merge! will substitute the original array, whereas Hash#merge will return a copy of the original array.
Both methods will behave similarly. They can receive a list of hashes and a block.
- If no hash is given: Hash#merge will return a copy of itself and Hash#merge! will return self.
- If hashes are given, they will be merged into the original or copied array appropriately from left to right, with the newest key-value pair substituting the oldest one, unless:
- If a block is given, if a key is present in more than one hash (including self), The execution of the block will return its value.
Caveats (discovered painfully)
- Merge will not take into account default values of hashes
- The time complexity of these methods (as discussed below) can get high if you are not careful
Dissecting the methods
Hash#merge
The Hash class is implemented in C like most of the Ruby language; here is the current Hash#merge implementation in C
static VALUE rb_hash_merge(int argc, VALUE *argv, VALUE self){return rb_hash_update(argc, argv, rb_hash_dup(self));}
As you can see, this will create a copy of the hash the method is called on:
rb_hash_dup(self)
and then call rb_hash_update
, keeping the original arguments but changing self to be the copy.
if we were to pseudo-Ruby it (yes, I made this up, but hey, I am writing so I get to do that)
def merge(*other_hashes = [], &block) copy = self.copy copy.merge!(other_hashes) &blockend
As you can see, Hash#merge calls Hash#merge! so let us take a deep dive into:
Hash#merge!
The current implementation of Hash#merge! in C is:
static VALUE rb_hash_update(int argc, VALUE *argv, VALUE self) { int i; bool block_given = rb_block_given_p(); rb_hash_modify(self); for (i = 0; i < argc; i++){ VALUE hash = to_hash(argv[i]); if (block_given) { rb_hash_foreach(hash, rb_hash_update_block_i, self); } else { rb_hash_foreach(hash, rb_hash_update_i, self); } } return self;}
This code looks complicated, so let’s dive in and dissect what is happening:
int i; # => Declaring an integer for our later loopbool block_given = rb_block_given_p(); # => Has a block been passed or not?
These two variables above are pretty straightforward. Here on after, the fun stuff starts:
rb_hash_modify(self)
will check if the hash has been frozen previously
if that passes, we go to the meat of our method:
for (i = 0; i < argc; i++){ # => Loops through the argument hashes VALUE hash = to_hash(argv[i]); # => Ensures we are dealing with a hashif (block_given) {
/*
* loops through the new hash and
* adds new keys at the end of the original hash,
* if a key is duplicated it executes the block rb_hash_foreach(hash, rb_hash_update_block_i, self); }else {
/*
* loops through the new hash and
* adds new keys at the end of the original hash,
* if a key is duplicated it replaces it rb_hash_foreach(hash, rb_hash_update_i, self); }}
Before we start discussing how cool this is, let’s pseudo-Ruby this method.
def merge!(*other_hashes, &block) return error if is_frozen?(self) other_hashes.each do |new_hash| if block_given? new_hash.keys.each do |key| self[key] = new_hash[key] unless self[key] self[key] = yield(key, self[key], new_hash[key] else self[key] = new_hash[key] end end selfend
def update merge!end
This implementation is ingenious; it tries not to iterate through both hashes since it only needs to replace the duplicated keys. If a key does not exist, it is just added at the end of the hash.
By doing that, Ruby avoids doing checks at the block level and only calls the block when it is needed. As discussed below, it is also a way to reduce the time complexity since you are not iterating unnecessarily in multiple hashes, only the hashes you need to iterate in.
Time complexity
This stack overflow answer does an excellent job explaining the time complexity of the Hash#merge method for merging two hashes, but let’s try to out what would be the time complexity of merging multiple hashes.
We are here only looking at O notation (longest theoretical time this function can take, more on that here
For my mental clarity, I will do it based on our pseudo-Ruby:
def merge(*other_hashes = [], &block) copy = self.copy # => constant copy.merge!(other_hashes) &block # => We shall see bellow!enddef merge!(*other_hashes, &block) return error if is_frozen?(self) # => constant other_hashes.each do |new_hash| # => number of hashes = m if block_given? # => constant new_hash.keys.each do |key| # => number of keys on longest hash = n self[key] = new_hash[key] unless self[key] # => constant self[key] = yield(key, self[key], new_hash[key] # => We will assume a constant operation here* else self[key] = new_hash[key] # => constant end end selfend
If we try to create a formula for the longest path where it will give us:
constant + (constant + (m * (constant + (n * constant))))
We can ignore constants, so we have O(m*n) if we take them away.
I keep m and n separated since the number of hashes should in most cases be small compared to the number of keys, but for the purists out there, our time complexity is O(n²)
We can learn from this analysis that if we want to optimize for time complexity (i.e., the lowest number of iterations), we should put the largest hash as the callee.
However, we would need to explore the consequences of adding a block to change the default behavior(i.e., the new values overwriting the old ones.)
Conclusion
The methods: Hash#merge and Hash#merge! (aka Hash#update), are surprisingly powerful, and if you are working with hashes that need to merge with other hashes, it is a great tool to have in your tool-belt.
Be careful not to overdo it. The time complexity can grow pretty rapidly. However, you can try to optimize it.
Please do not hesitate to correct me or ask any questions you might have, and thank you for your attention :)