Understanding and tweaking pg_search
A couple of days ago, I had to create a suggestion system using, among other things, a full-text search.
I quickly came across pg_search. And let me tell you, people: this gem is crazy good. It will allow you out of the box to do a full-text search against several columns or even associations. (You can also run the search against all your models).
I won’t tell you all the functionalities, but I can tell you there are enough of them for most usages. It reproduces quite closely what PostgreSQL allows in terms of full-text search
It does need some manipulation to function optimally, and if you want to use complex queries, the functionalities might not be enough.
Part 1 — How does it all work?
PostgreSQL full text search
If you want the full details about how it works, please do read the documentation
For a quicker and less accurate version, keep reading :)
PostgreSQL allows you to search a combination of words and some logical operators against a ts_vector column.
That ts_vector column is a column of key-value pairs that maps the position of your column’s words.
Example:
We have a Book model with a description column.
lotr = Book.new(name: ‘Lord of The Rings’, description: ‘Wizards, and elfs, and rings, and evil’)
the ts_vector version of lotr.description would be :
and: 2, 4, 6elf: 3evil: 7ring: 5wizard: 1
When we search full text, we can know if words are adjacent, if several words appear, etc., it also is alphabetically sorted, which makes finding hits quicker.
As we will see further on, there are ways to have the ts_vector columns created and updated on your DB instead of recreating them for each search.
pg_search
pg_search maps most of the capabilities of searching with PostgreSQL relatively easily. the documentation is quite straightforward, so I will not spend too much time there. We will have a peak ofthe insides further on.
In it’s most basic form it resembles this:
class Thing
pg_search_scope :search_by_title, against: :title
end
Part 2 — The case
(You will find all the code referred to in this article: here)
Let’s assume a standard rails app with three models (Book, Chapter, Genre).
class Book < ActiveRecord::Base has_many: :chapters has_one: :genre […]endclass Chapter < ActiveRecord::Base […]endclass Genre < ActiveRecord::Base […]end
We want to build a system that will allow users to search movies according to:
- a selection of sufficient keywords or sentences (If we find at least one of these, we suggest the film to the user)
- a selection of insufficient keywords or sentences (we must find a combination of them to recommend it)
Finally, we want to give the user a small explanation of why it ‘popped.’
The fun part begins:
Let’s assume that our BooksController has a search method where we will receive a hash of the different search options as such:
{ sufficient_keywords: ‘One, Harry Potter, Fantasy’, insufficient_keywords: ‘Big Familly, excursion, excavation’, insufficient_keyword_count: 3}class BooksController < ApplicationController def search [In here, we want to process the options and return the results with ‘highlights’] end […]end
To do this search, we must do several things:
- Create a string with the query for the ‘sufficient_keywords’ and ‘insufficient_keywords’
- Search for results that enter in the sufficient_keywords, discard them, and look for insufficient_keyword_hits
- Return the results of both searches with highlights
- Ideally, we would like to avoid taking capitalization and accents into account.
- Make our search fast.
These five steps sound simple, but sadly they are anything but as we will see.
But let’s start at the beginning:
Create a string with the query for the ‘sufficient_keywords’ and ‘insufficient_keywords’
Let’s extract that in a PORO (Plain Old Ruby Object)
# frozen_string_literal: trueclass QueryConstructor attr_reader :sufficient_query, :insufficient_query def initialize(options) @sufficient_keywords = options[:sufficient_keywords]&.split(',') @insufficient_keywords = options[:insufficient_keywords]&.split(',') @combination_size = options[:insufficient_keyword_count] end def sufficient_query return if @sufficient_keywords.blank?
clean_keywords = @sufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(' ', '<->') end clean_keywords.join(' ') end def insufficient_query return if @insufficient_keywords.blank?
clean_keywords = @insufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(' ', '<->') end
combinations = clean_keywords.combination(@combination_size.to_i).to_a combinations.map! do |combination| combination.join('&') end combinations.join(' ') endend
Ok, this looks complicated; let’s break it down.
def sufficient_query return if @sufficient_keywords.blank?
clean_keywords = @sufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(' ', '<->') end clean_keywords.join(' ')end
here is the most straightforward of the two query constructions.
We look if there are sentences, and we join them using ‘<->’; this will allow us to later look for adjacency.
We then join the different keywords with space into a single string. With the keywords of our example we would have as a result: ‘One Harry<->Potter Fantasy’
def insufficient_query return if @insufficient_keywords.blank?
clean_keywords = @insufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(‘ ‘, ‘<->’) end combinations = clean_keywords.combination(@combination_size.to_i).to_a combinations.map! do |combination| combination.join(‘&’) end combinations.join(‘ ‘)end
Here our logic is ‘similar,’ but we create all possible combinations of the given size. Thanks to this, we will have at least one combination in our results.
Here we are expecting ‘Big<->Familly&excursion Big<->Familly&excavation excursion&excavation’. If we look in our spec file, we will see that it is working as expected:
Search for results that enter in the sufficient_keywords, discard them, and look for insufficient_keyword_hits
To look for our words, we will create a PORO (Plain Old Ruby Object) called Book::Finder.
class Book::Finder attr_reader :sufficients, :insufficients def initialize(query_constructor) @query_constructor = query_constructor @sufficients = create_sufficients @insufficients = create_insufficients end def create_sufficients Book.book_search(@query_constructor.sufficient_query) end def create_insufficients Book.where.not(id: @sufficients.pluck(:id)).book_search(@query_constructor.insufficient_query) endend
Here we find books that satisfy the ‘sufficient’ condition on initialization. We then look up the rest of the books and search among them to meet the ‘insufficient’ requirement.
To understand Book.book_search
, we need to look at our Book
model.
In it, we have added:
pg_search_scope :book_search,against: [:title, :synopsis],associated_against: { chapters: :title, genre: :name },using: { tsearch: { any_word: true } }
If we look line by line :
- We created a new search scope called book search.
- We want to look into the Book columns: title and synopsis.
- We also look in the association chapter’s title column and the name column of the association genre.
- Finally, we want to use tsearch and join our conditions with an OR instead of an AND.
We still have a small problem at this step; the query does not understand adjacency or sentences. We construct our string with them, but a search such as this one (with my data, results might differ on your end):
Will return these results:
When we are expecting it to return empty
To solve this, we need to look into the pg_search gem and modify it to suit our needs.
Small tangent: modifying pg_search
To modify pg_search, we will fork to our repo and use it as a source moving forward. To do that we add gem 'pg_search', git: 'https://github.com/pablocm90/pg_search'
to our gemfile.
In there (and I might explain it in another article if there is interest), I tweaked a bit the ‘lib/pg_search/features/search.rb’ file by:
adding the following methods
def construct_ts_query(term) if term.include?(‘&’) ts_query_and(term) elsif term.include?(‘|’) ts_query_or(term) elsif term.include?(‘<->’) ts_query_consecutive(term) else tsquery_for_term(term) endend
def ts_query_consecutive(term) “(#{term.split(‘<->’).map { |term| construct_ts_query(term) }.join(‘ <-> ‘)})”end
def ts_query_and(term) “(#{term.split(‘&’).map { |term| construct_ts_query(term) }.join(‘ && ‘)})”end
def ts_query_or(term) “(#{term.split(‘|’).map { |term| construct_ts_query(term) }.join(‘ || ‘)})”end
and changing this method
def tsquery return “‘’” if query.blank? query_terms = query.split.compact tsquery_terms = query_terms.map { |term| construct_ts_query(term) } tsquery_terms.join(options[:any_word] ? ‘ || ‘ : ‘ && ‘)end
the only change in this method is the fact that I map the terms by calling construct_ts_query(term)
instead of tsquery_for_term(term)
Now that we have made the change to our gemfile, we can look for adjacency without trouble: the same query as before returns empty as it should.
Return the results of both searches with highlights
Here, one part is easy and the other not so much.
Returning the results of both sufficient and insufficient keywords is reasonably trivial with the code we have already done.
app/views/books/search.html.erb<%= form_with url: search_books_path, method: :get, local: true do |form| %> <div class=”field_wrapper”> <div class=”field”> <%= form.label :sufficient_keywords, ‘sufficient keywords (separate with a ,)’ %> <%= form.text_area :sufficient_keywords %> </div> <div class=”field”> <%= form.label :insufficient_keywords, ‘insufficient keywords (separate with a ,)’ %> <%= form.text_area :insufficient_keywords %> </div> <div class=”field”> <%= form.label :insufficient_keyword_count, ‘insufficient keyword count’ %> <%= form.number_field :insufficient_keyword_count %> </div> </div> <%= form.submit “Search” %><% end %>
<h2> Sufficient Keyword hits </h2><%= render partial: ‘book’, collection: @book_finder.sufficients %><br /><h2> Insufficient Keyword hits </h2><%= render partial: ‘book’, collection: @book_finder.insufficients %><br /><h3> All books </h3><%= render partial: ‘book’, collection: @books %>
coupled with :
app/views/books/_book.html.erb<hr><p> <strong><%= book.title %></strong> <em> <%= book.genre.name %></em></p><p> <%= book.sinopsys %></p><p><strong><em>chapters</em></strong></p><% book.chapters.each do |chapter| %> <p> <%= chapter.title %> </p><% end %><hr>
does the job reasonably well. It is not pretty, but it shall suffice for this exercise.
The problem comes when we try to use pg_highlight to our advantage.
Naively, by adding this:
highlight: { StartSel: ‘<b>’, StopSel: ‘</b>’, MaxWords: 10, MinWords: 4, ShortWord: 1, HighlightAll: true, MaxFragments: 6, FragmentDelimiter: ‘…’}
to our book_search
, we should have what we want. We now need to add with_pg_search_highlight
to our search methods…
And we run into an exception since pg_search does not support highlights in association searches. We can find a workaround.
If we create a new column to store all the data and have a before_save
callback to save the different data in it, we will have access to highlights just as we want it.
Let’s start by
rails g migration AddSearchColumnToBook search_column:text && rails db:migrate
Now we can create our callback and our method.
app/models/book.rbbefore_save :collate_search_informationdef collate_search_information search_column = “#{title} // #{sinopsys} // #{genre.name} // #{chapters.pluck(:title).join(‘ // ‘)}”end
After saving all the books, all books whould have a search_column, we can now change our `book_search` scope to be like this:
pg_search_scope :book_search,against: :search_column, using: { tsearch: { any_word: true, highlight: { StartSel: ‘<b>’, StopSel: ‘</b>’, MaxWords: 10, MinWords: 4, ShortWord: 1, HighlightAll: true, MaxFragments: 6, FragmentDelimiter: ‘…’ } }}
And our results will appear correctly. By using with_pg_search_highight and pg_search_highlight, we will be able to show the words that match our search.
app/views/books/_book.html.erb<hr><p> <strong><%= book.title %></strong> <em> <%= book.genre.name %></em></p><p> <%= book.sinopsys %></p><p><strong><em>chapters</em></strong></p><% book.chapters.each do |chapter| %> <p> <%= chapter.title %> </p><% end %><% if with_highlights %> <p><strong><em>Highlights</em></strong></p> <p> <%= book.pg_search_highlight.html_safe %> </p> <hr><%end %>class Book::Finder attr_reader :sufficients, :insufficients def initialize(query_constructor) @query_constructor = query_constructor @sufficients = create_sufficients @insufficients = create_insufficients end def create_sufficients Book.book_search(@query_constructor.sufficient_query).with_pg_search_highlight end def create_insufficients Book.where.not(id: @sufficients.pluck(:id)).book_search(@query_constructor.insufficient_query).with_pg_search_highlight endend
Ideally, we would like to avoid taking capitalization and accents into account.
Capitalization is taken care of automatically by pg_search, and we have an option to remove accents
In the seed, you will see that we create a book with weird accents.
Book.create(genre: Genre.first, title: ‘I àm weirdly accéntèd’, sinopsys: ‘I like piñatas’)
If we try to add the option ignoring: :accents
to our search, we will raise an ‘ActiveRecord::StatementInvalid’ since we have not installed the extension.
To install it, we will need to access our PostgreSQL console:
psql NAME_OF_YOUR_DB
, if you are using my repo: psql pg_searchin_article_development
and run this command in it:
CREATE EXTENSION unaccent;
[UNACCENT]
We can now search for a ‘sufficient’ keyword ‘accented,’ and we should have our book as a result.
We have accomplished our primary job. But searching uses a lot of memory and is time consuming. To speed up the process, we can create a ts_vector column instead of building it every time.
With it, we can avoid the to_tsvector
part of this query:
To do that, we will:
- create a configuration of unnaccent to use as our dictionary
- create a trigger to auto-update the ts_vector column
- search against the ts_vector column instead of the text column
To create a configuration of unaccent to use as our dictionary, we will need to open our PostgreSQL console again and:
CREATE TEXT SEARCH CONFIGURATION unaccented_english ( COPY = english );ALTER TEXT SEARCH CONFIGURATION unaccented_englishALTER MAPPING FOR hword, hword_part, wordWITH unaccent, english_stem;
Let’s create the migration to add the vector_search column:
rails g migration AddVectorSearchToBook vector_search:tsvector && rails db:migrate
And now we are ready to create our trigger. We open our PostgreSQL console and:
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATEON books FOR EACH ROWEXECUTE PROCEDURE tsvector_update_trigger(‘vector_search’, ‘public.unaccented_english’, ‘search_column’);
in order to check if our trigger has been created, we can run:
select event_object_schema as table_schema,event_object_table as table_name,trigger_schema,trigger_name,string_agg(event_manipulation, ‘,’) as event,action_timing as activation,action_condition as condition,action_statement as definitionfrom information_schema.triggersgroup by 1,2,3,4,6,7,8order by table_schema,table_name;
we should see our trigger in the console.
I recommend running rails db:seed
agin. That way, we don’t have to save all our books.
And the last step is now upon us! Let’s change the lookup column. Our scope should now look like this:
pg_search_scope :book_search, against: :search_column, ignoring: :accents, using: { tsearch: { any_word: true, dictionary: ‘unaccented_english’, tsvector_column: ‘vector_search’, highlight: { StartSel: ‘<b>’, StopSel: ‘</b>’, MaxWords: 10, MinWords: 4, ShortWord: 1, HighlightAll: true, MaxFragments: 6, FragmentDelimiter: ‘…’ } } }
And we can look up books by chapter, title, genre, or synopsis quickly. As a bonus, since we are using ‘english_stem’ as the base of our text search configuration, variations of the same word will also be considered. For instance: ‘accent’ and ‘accented’ will be the same.
I hope you enjoyed this long post. Please let me know if you have questions, feedback, criticism :D
Till next time,
Pablo