Understanding and tweaking pg_search

Pablo Curell Mompo
10 min readJan 31, 2021

A couple of days ago, I had to create a suggestion system using, among other things, a full-text search.

I quickly came across pg_search. And let me tell you, people: this gem is crazy good. It will allow you out of the box to do a full-text search against several columns or even associations. (You can also run the search against all your models).

I won’t tell you all the functionalities, but I can tell you there are enough of them for most usages. It reproduces quite closely what PostgreSQL allows in terms of full-text search

It does need some manipulation to function optimally, and if you want to use complex queries, the functionalities might not be enough.

Part 1 — How does it all work?

PostgreSQL full text search

If you want the full details about how it works, please do read the documentation

For a quicker and less accurate version, keep reading :)

PostgreSQL allows you to search a combination of words and some logical operators against a ts_vector column.

That ts_vector column is a column of key-value pairs that maps the position of your column’s words.

Example:

We have a Book model with a description column.

lotr = Book.new(name: ‘Lord of The Rings’, description: ‘Wizards, and elfs, and rings, and evil’)

the ts_vector version of lotr.description would be :

and: 2, 4, 6elf: 3evil: 7ring: 5wizard: 1

When we search full text, we can know if words are adjacent, if several words appear, etc., it also is alphabetically sorted, which makes finding hits quicker.

As we will see further on, there are ways to have the ts_vector columns created and updated on your DB instead of recreating them for each search.

pg_search

pg_search maps most of the capabilities of searching with PostgreSQL relatively easily. the documentation is quite straightforward, so I will not spend too much time there. We will have a peak ofthe insides further on.

In it’s most basic form it resembles this:

class Thing
pg_search_scope :search_by_title, against: :title
end

Part 2 — The case

(You will find all the code referred to in this article: here)

Let’s assume a standard rails app with three models (Book, Chapter, Genre).

UML
class Book < ActiveRecord::Base  has_many: :chapters  has_one: :genre  […]endclass Chapter < ActiveRecord::Base  […]endclass Genre < ActiveRecord::Base  […]end

We want to build a system that will allow users to search movies according to:

  • a selection of sufficient keywords or sentences (If we find at least one of these, we suggest the film to the user)
  • a selection of insufficient keywords or sentences (we must find a combination of them to recommend it)

Finally, we want to give the user a small explanation of why it ‘popped.’

The fun part begins:

Let’s assume that our BooksController has a search method where we will receive a hash of the different search options as such:

{  sufficient_keywords: ‘One, Harry Potter, Fantasy’,  insufficient_keywords: ‘Big Familly, excursion, excavation’,  insufficient_keyword_count: 3}class BooksController < ApplicationController  def search    [In here, we want to process the options and return the results with ‘highlights’]  end  […]end

To do this search, we must do several things:

  1. Create a string with the query for the ‘sufficient_keywords’ and ‘insufficient_keywords’
  2. Search for results that enter in the sufficient_keywords, discard them, and look for insufficient_keyword_hits
  3. Return the results of both searches with highlights
  4. Ideally, we would like to avoid taking capitalization and accents into account.
  5. Make our search fast.

These five steps sound simple, but sadly they are anything but as we will see.

But let’s start at the beginning:

Create a string with the query for the ‘sufficient_keywords’ and ‘insufficient_keywords’

Let’s extract that in a PORO (Plain Old Ruby Object)

# frozen_string_literal: trueclass QueryConstructor  attr_reader :sufficient_query, :insufficient_query  def initialize(options)    @sufficient_keywords = options[:sufficient_keywords]&.split(',')    @insufficient_keywords = options[:insufficient_keywords]&.split(',')    @combination_size = options[:insufficient_keyword_count]  end  def sufficient_query    return if @sufficient_keywords.blank?
clean_keywords = @sufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(' ', '<->') end clean_keywords.join(' ') end def insufficient_query return if @insufficient_keywords.blank?
clean_keywords = @insufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(' ', '<->') end
combinations = clean_keywords.combination(@combination_size.to_i).to_a
combinations.map! do |combination| combination.join('&') end combinations.join(' ') endend

Ok, this looks complicated; let’s break it down.

def sufficient_query  return if @sufficient_keywords.blank?
clean_keywords = @sufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(' ', '<->') end clean_keywords.join(' ')end

here is the most straightforward of the two query constructions.

We look if there are sentences, and we join them using ‘<->’; this will allow us to later look for adjacency.

We then join the different keywords with space into a single string. With the keywords of our example we would have as a result: ‘One Harry<->Potter Fantasy’

def insufficient_query  return if @insufficient_keywords.blank?
clean_keywords = @insufficient_keywords.uniq.reject(&:empty?) clean_keywords.map! do |word| word.strip.gsub(‘ ‘, ‘<->’) end combinations = clean_keywords.combination(@combination_size.to_i).to_a combinations.map! do |combination| combination.join(‘&’) end combinations.join(‘ ‘)end

Here our logic is ‘similar,’ but we create all possible combinations of the given size. Thanks to this, we will have at least one combination in our results.

Here we are expecting ‘Big<->Familly&excursion Big<->Familly&excavation excursion&excavation’. If we look in our spec file, we will see that it is working as expected:

spec-result

Search for results that enter in the sufficient_keywords, discard them, and look for insufficient_keyword_hits

To look for our words, we will create a PORO (Plain Old Ruby Object) called Book::Finder.

class Book::Finder  attr_reader :sufficients, :insufficients  def initialize(query_constructor)    @query_constructor = query_constructor    @sufficients = create_sufficients    @insufficients = create_insufficients  end  def create_sufficients    Book.book_search(@query_constructor.sufficient_query)  end  def create_insufficients    Book.where.not(id: @sufficients.pluck(:id)).book_search(@query_constructor.insufficient_query)  endend

Here we find books that satisfy the ‘sufficient’ condition on initialization. We then look up the rest of the books and search among them to meet the ‘insufficient’ requirement.

To understand Book.book_search, we need to look at our Book model.

In it, we have added:

pg_search_scope :book_search,against: [:title, :synopsis],associated_against: { chapters: :title, genre: :name },using: { tsearch: { any_word: true } }

If we look line by line :

  • We created a new search scope called book search.
  • We want to look into the Book columns: title and synopsis.
  • We also look in the association chapter’s title column and the name column of the association genre.
  • Finally, we want to use tsearch and join our conditions with an OR instead of an AND.

We still have a small problem at this step; the query does not understand adjacency or sentences. We construct our string with them, but a search such as this one (with my data, results might differ on your end):

sentence search

Will return these results:

search results

When we are expecting it to return empty

To solve this, we need to look into the pg_search gem and modify it to suit our needs.

Small tangent: modifying pg_search

To modify pg_search, we will fork to our repo and use it as a source moving forward. To do that we add gem 'pg_search', git: 'https://github.com/pablocm90/pg_search' to our gemfile.

In there (and I might explain it in another article if there is interest), I tweaked a bit the ‘lib/pg_search/features/search.rb’ file by:

adding the following methods

def construct_ts_query(term)  if term.include?(‘&’)    ts_query_and(term)  elsif term.include?(‘|’)    ts_query_or(term)  elsif term.include?(‘<->’)    ts_query_consecutive(term)  else    tsquery_for_term(term)  endend
def ts_query_consecutive(term) “(#{term.split(‘<->’).map { |term| construct_ts_query(term) }.join(‘ <-> ‘)})”end
def ts_query_and(term) “(#{term.split(‘&’).map { |term| construct_ts_query(term) }.join(‘ && ‘)})”end
def ts_query_or(term) “(#{term.split(‘|’).map { |term| construct_ts_query(term) }.join(‘ || ‘)})”end

and changing this method

def tsquery  return “‘’” if query.blank?  query_terms = query.split.compact  tsquery_terms = query_terms.map { |term| construct_ts_query(term) }  tsquery_terms.join(options[:any_word] ? ‘ || ‘ : ‘ && ‘)end

the only change in this method is the fact that I map the terms by calling construct_ts_query(term) instead of tsquery_for_term(term)

Now that we have made the change to our gemfile, we can look for adjacency without trouble: the same query as before returns empty as it should.

Return the results of both searches with highlights

Here, one part is easy and the other not so much.

Returning the results of both sufficient and insufficient keywords is reasonably trivial with the code we have already done.

app/views/books/search.html.erb<%= form_with url: search_books_path, method: :get, local: true do |form| %>  <div class=”field_wrapper”>    <div class=”field”>      <%= form.label :sufficient_keywords, ‘sufficient keywords (separate with a ,)’ %>      <%= form.text_area :sufficient_keywords %>    </div>    <div class=”field”>      <%= form.label :insufficient_keywords, ‘insufficient keywords (separate with a ,)’ %>      <%= form.text_area :insufficient_keywords %>    </div>    <div class=”field”>      <%= form.label :insufficient_keyword_count, ‘insufficient keyword count’ %>      <%= form.number_field :insufficient_keyword_count %>    </div>  </div>  <%= form.submit “Search” %><% end %>
<h2> Sufficient Keyword hits </h2><%= render partial: ‘book’, collection: @book_finder.sufficients %><br /><h2> Insufficient Keyword hits </h2><%= render partial: ‘book’, collection: @book_finder.insufficients %><br /><h3> All books </h3><%= render partial: ‘book’, collection: @books %>

coupled with :

app/views/books/_book.html.erb<hr><p>  <strong><%= book.title %></strong>  <em> <%= book.genre.name %></em></p><p>  <%= book.sinopsys %></p><p><strong><em>chapters</em></strong></p><% book.chapters.each do |chapter| %>  <p> <%= chapter.title %> </p><% end %><hr>

does the job reasonably well. It is not pretty, but it shall suffice for this exercise.

The problem comes when we try to use pg_highlight to our advantage.

Naively, by adding this:

highlight: {  StartSel: ‘<b>’,  StopSel: ‘</b>’,  MaxWords: 10,  MinWords: 4,  ShortWord: 1,  HighlightAll: true,  MaxFragments: 6,  FragmentDelimiter: ‘&hellip;’}

to our book_search, we should have what we want. We now need to add with_pg_search_highlight to our search methods…

And we run into an exception since pg_search does not support highlights in association searches. We can find a workaround.

If we create a new column to store all the data and have a before_save callback to save the different data in it, we will have access to highlights just as we want it.

Let’s start by

rails g migration AddSearchColumnToBook search_column:text && rails db:migrate

Now we can create our callback and our method.

app/models/book.rbbefore_save :collate_search_informationdef collate_search_information  search_column = “#{title} // #{sinopsys} // #{genre.name} // #{chapters.pluck(:title).join(‘ // ‘)}”end

After saving all the books, all books whould have a search_column, we can now change our `book_search` scope to be like this:

pg_search_scope :book_search,against: :search_column,  using: {    tsearch: {      any_word: true,      highlight: {      StartSel: ‘<b>’,      StopSel: ‘</b>’,      MaxWords: 10,      MinWords: 4,      ShortWord: 1,      HighlightAll: true,      MaxFragments: 6,      FragmentDelimiter: ‘&hellip;’    }  }}

And our results will appear correctly. By using with_pg_search_highight and pg_search_highlight, we will be able to show the words that match our search.

app/views/books/_book.html.erb<hr><p>  <strong><%= book.title %></strong>  <em> <%= book.genre.name %></em></p><p>  <%= book.sinopsys %></p><p><strong><em>chapters</em></strong></p><% book.chapters.each do |chapter| %>  <p> <%= chapter.title %> </p><% end %><% if with_highlights %>  <p><strong><em>Highlights</em></strong></p>  <p> <%= book.pg_search_highlight.html_safe %> </p>  <hr><%end %>class Book::Finder  attr_reader :sufficients, :insufficients  def initialize(query_constructor)    @query_constructor = query_constructor    @sufficients = create_sufficients    @insufficients = create_insufficients  end  def create_sufficients    Book.book_search(@query_constructor.sufficient_query).with_pg_search_highlight  end  def create_insufficients    Book.where.not(id: @sufficients.pluck(:id)).book_search(@query_constructor.insufficient_query).with_pg_search_highlight  endend

Ideally, we would like to avoid taking capitalization and accents into account.

Capitalization is taken care of automatically by pg_search, and we have an option to remove accents

In the seed, you will see that we create a book with weird accents.

Book.create(genre: Genre.first, title: ‘I àm weirdly accéntèd’, sinopsys: ‘I like piñatas’)

If we try to add the option ignoring: :accents to our search, we will raise an ‘ActiveRecord::StatementInvalid’ since we have not installed the extension.

To install it, we will need to access our PostgreSQL console:

psql NAME_OF_YOUR_DB, if you are using my repo: psql pg_searchin_article_development

and run this command in it:

CREATE EXTENSION unaccent;

[UNACCENT]

unaccent-console

We can now search for a ‘sufficient’ keyword ‘accented,’ and we should have our book as a result.

result using unaccented

We have accomplished our primary job. But searching uses a lot of memory and is time consuming. To speed up the process, we can create a ts_vector column instead of building it every time.

With it, we can avoid the to_tsvector part of this query:

query with to_tsvector

To do that, we will:

  1. create a configuration of unnaccent to use as our dictionary
  2. create a trigger to auto-update the ts_vector column
  3. search against the ts_vector column instead of the text column

To create a configuration of unaccent to use as our dictionary, we will need to open our PostgreSQL console again and:

CREATE TEXT SEARCH CONFIGURATION unaccented_english ( COPY = english );ALTER TEXT SEARCH CONFIGURATION unaccented_englishALTER MAPPING FOR hword, hword_part, wordWITH unaccent, english_stem;

Let’s create the migration to add the vector_search column:

rails g migration AddVectorSearchToBook vector_search:tsvector && rails db:migrate

And now we are ready to create our trigger. We open our PostgreSQL console and:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATEON books FOR EACH ROWEXECUTE PROCEDURE tsvector_update_trigger(‘vector_search’, ‘public.unaccented_english’, ‘search_column’);
create trigger

in order to check if our trigger has been created, we can run:

select event_object_schema as table_schema,event_object_table as table_name,trigger_schema,trigger_name,string_agg(event_manipulation, ‘,’) as event,action_timing as activation,action_condition as condition,action_statement as definitionfrom information_schema.triggersgroup by 1,2,3,4,6,7,8order by table_schema,table_name;

we should see our trigger in the console.

The trigger will appear here

I recommend running rails db:seedagin. That way, we don’t have to save all our books.

And the last step is now upon us! Let’s change the lookup column. Our scope should now look like this:

pg_search_scope :book_search,  against: :search_column,  ignoring: :accents,    using: {      tsearch: {        any_word: true,        dictionary: ‘unaccented_english’,        tsvector_column: ‘vector_search’,        highlight: {          StartSel: ‘<b>’,          StopSel: ‘</b>’,          MaxWords: 10,          MinWords: 4,          ShortWord: 1,          HighlightAll: true,          MaxFragments: 6,          FragmentDelimiter: ‘&hellip;’        }      }    }

And we can look up books by chapter, title, genre, or synopsis quickly. As a bonus, since we are using ‘english_stem’ as the base of our text search configuration, variations of the same word will also be considered. For instance: ‘accent’ and ‘accented’ will be the same.

I hope you enjoyed this long post. Please let me know if you have questions, feedback, criticism :D

Till next time,

Pablo

--

--

Pablo Curell Mompo

Full Stack developer @ Seraphin, learned how to code @ Le Wagon. I love coding and plan to keep learning as much as I can about it :)