Search the Docs

Elasticsearch at Read the Docs

Santos Gallegos - @stsewd

Navigate with [ space ]

$ whoami

Santos Gallegos
Read the Docs
developer (web)
@stsewd

About Read the Docs

Build and host your documentation
Created around 2010
More than 100K projects
Extra features
https://readthedocs.org/

Server Side Search

Better search experience
Sphinx, MkDocs (partially)
Search across sub-projects
Ranking and ignoring pages
Analytics
API
"exact match" prefix* fuzzy~2

https://docs.readthedocs.io/page/server-side-search.html

A brief history about search at RTD

2013 Rob Hudson @robhudson (#493)
2018 Safwan Rahman @safwanrahman
https://blog.readthedocs.com/search-improvements/
2019 Vaibhav Gupta @dojutsu-user
https://blog.readthedocs.com/improved-search-and-search-as-you-type/
2020 - 2021 Me!

It was added around 2013, and improved from there. RTD was an small team (2 at the time), we still are (5 now!). It was working, so why touch it?
Safwan participated as a Google summer of code student. Migrated ES from 1.3 (EOL 2016) to 6.x. Simple query string syntax support, improve results order, and added some management commands for re-indexing with zero-downtime. I joined the team that year!
Vaibhav, another Google summer of code student. Search as you type extension. Results linking to specific sections. Search analytics. Helped a little with the migration to 7.x.
Stable API. Improve the HTML parsing. Support for MkDocs. Ranking and ignore pages. Search on multiple projects.

Indexing Process

Parse configuration file

↓

Sphinx build → JSON metadata & HTML

↓

Track files with their ignore & ranking options

↓

JSON metadata → Searchable text

Parse the configuration file that has all the information to build your docs (and also search options!).
Sphinx build, which generates all html pages from sources. We install an extension hook into the process and dump some meta-information about each page (title, main html content). We also inject our embedded script to patch the search engine from Sphinx. https://github.com/readthedocs/readthedocs-sphinx-ext
Go through each html file and save it on our DB and set the given ranking and set it as ignored or not.
In a later step we go again over each html file, and retrieve the json metadata file, and extract its html. We parse the html to extract the searchable text and index it into ES.
For MkDocs the process is almost the same, but we don't use an extension to get the main content, but instead we parse the search index or each html file. We'll see more about this later.

Configuration File

.readthedocs.yaml


                version: 2
                sphinx:
                  configuration: docs/conf.py
                python:
                  version: 3.8
                  install:
                    - requirements: docs/requirements.txt

                search:
                  ranking:
                    api/v1/*: -5
                    api/v2/*: 5
                  ignore:
                    - search.html
                    - 404.html

https://docs.readthedocs.io/page/config-file/v2.html


                search:
                  ranking:
                    api/v1/*: -5
                    api/v2/*: 5
                  ignore:
                    - search.html
                    - 404.html

https://docs.readthedocs.io/en/stable/config-file/v2.html#search

Indexing the content

Get metadata from each file or the file itself
Extract searchable content based on heuristics
Remove navigation nodes and line numbers from code-blocks
Divide content into sections and domains (Sphinx)

readthedocs/search/parsers.py
readthedocs/search/documents.py
https://docs.readthedocs.io/page/development/search-integration.html

For sphinx we use the json metadata file.
For MkDocs we are testing two ways: using a search index file and indexing from HTML files.
We extract the searchable content based on some heuristics like aria roles (from accessibility) and html tags (semantic web!). Identify the main node of the content, remove irrelevant content like navigation items.
Remove line numbers from code blocks, again this is done based on heuristics like the common way of representing code-blocks with numbers.
Domains are specific of Sphinx, sphinx domains, like documentation extracted from source code.

Identifying the main content node

Removing navigation nodes

Identifying sections

Identifying nested sections

Ignoring pages

Ignored page → not indexed page

Ranking pages

Ranking can be done at index or search time
Ranking at search time with function score query
Painless script
Map each ranking [-10, 10] to a score
Play nice with custom boosting

readthedocs/search/faceted_search.py

Config file used at RTD


          search:
            ignore:
              # Internal documentation
              - development/design/*
              - search.html
              - 404.html
           ranking:
              # Deprecated content
              api/v1.html: -1
              config-file/v1.html: -1

              # Useful content, but not something we want most users finding
              custom_installs/*: -6
              changelog.html: -6

Ignoring pages from results

Pages with rank -6 are still there

Tie breaker with rank -1

Painless script


            // ranking: [0.01, .. 0.8, 1, 1,3, .. 2]
            int rank =
              doc['rank'].size() == 0 ?
              0 : (int) doc['rank'].value;

            return params.ranking[rank + 10] * _score;

https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-painless.html

For ranking pages at search time we use a painless script
This is how a painless script looks like. It's almost like java, so if you aren't familiar with java o haven't coded in java in a long time (like myself!), it may not be that painless to code in...
We map each ranking to a value between 0.001 to 2, that's 21 values.
We return the final score by multiplying the original score with the ranking.
0.8 and 1.3 are magic numbers! They were calculated in base to our boosting values (2.0, 1.5).

Overriding the default search

Hijack the search call via Javascript
Keep a reference to the old search for backup
If you ignore a page, you may still see it in the results from the backup search

https://docs.readthedocs.io/page/development/search-integration.html#overriding-the-default-search


          var original_search = Search.query;

          function search_override(query) {
            var results = fetch_resuls(query);
            if (results) {
                for (var i = 0; i ≤ results.length; i += 1) {
                  var result = process_result(results[i]);
                  Search.output.append(result);
                }
            } else {
                original_search(query);
            }
          }

          Search.query = search_override;

          $(document).ready(function() {
            Search.init();
          });

API

DRF wrapper around the ES response
Serializers & Pagination

docs.readthedocs.io/_/api/v2/search/?project=docs&version=latest&q=api

readthedocs/search/api.py
https://docs.readthedocs.io/page/server-side-search.html#api

Problems

And solutions!

Latency 🐌

~~ES cloud (AWS) → RTD app (Azure)~~
ES cloud (Azure) → RTD app (Azure)

~~ES cloud (AWS zone B) → RTD app (AWS zone A)~~
ES cloud (AWS zone A) → RTD app (AWS zone A)

Partial terms results

Search for code and words
theme, themes, temes, temes~1
python.*, *.fail_on_warning
test, testing
index, reindex, indexing, reindexing

Simple query string query
Multi-match query (fuzziness AUTO:4,6)
Wildcard query
We can combine multiple types of queries, but we don't always want to combine them all.
Single/simple search terms vs complex terms.

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

Simple query string is useful when we know what we are looking for (refine our search)
Fuzzy search is useful when we don't know what we are looking for (or we don't know how to write it, specially if the isn't your lang)
Wildcard queries support prefix and suffix searches!
We can combine several type of searches in one
But how to we know when to use one or combine several?
Again, more heuristics to find out! Simple search terms don't have the special syntax, and have just one word.
Maybe we could add the string length to the heuristics?

Wildcard queries over Text fields can be slow...
Wildcard queries over Wildcard fields are fast!
Wildcard or Text field?...
Both! Multi fields (#7613)

https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type

Other things

Sponsored account of Elastic Cloud
Cached results via Cloudflare
Search as you type
https://readthedocs-sphinx-search.readthedocs.io/

What's next?

Better results for partial terms #7613
Weight page views into results #7297
Use SSS by default for MkDocs
Enable search indexing for pull requests previews?
Ignore and ranking per sections?
Better display for results from code blocks? #7112
Search for images and code snippets?
Search by files patterns and other facets?