Search the Docs
Elasticsearch at Read the Docs
Santos Gallegos
-
@stsewd
Navigate with [ space ]
$ whoami
Santos Gallegos
Read the Docs
developer (web)
@stsewd
About Read the Docs
A platform that allows you to build and host your documentation.
Created by Eric Holscher and others
More than hundred thousand OS projects.
Flask, tox, pip, requests, even sphinx itself!
If you are a python dev,
chances are that you have already read docs hosted by rtd.
Easy to recognize by this version selector to the bottom right of pages.
Versioning, translations, sub-projects, server side search, and more!
Server Side Search
Better search experience
Sphinx, MkDocs (partially)
Search across sub-projects
Ranking and ignoring pages
Analytics
API
"exact match"
prefix*
fuzzy~2
https://docs.readthedocs.io/page/server-side-search.html
Better search experience inside and outside the docs pages
MkDocs integration is still experimental, let us know if you want to test it in your project!
rtd allows you to group several projects as sub-projects of another
Allows you to give priority to some pages over others, or ignore them from results. This is like
tweaking your own search engine!
You can use server side search outside rtd via our API
Simple query string syntax from ES
A brief history about search at RTD
It was added around 2013, and improved from there.
RTD was an small team (2 at the time), we still are (5 now!).
It was working, so why touch it?
Safwan participated as a Google summer of code student.
Migrated ES from 1.3 (EOL 2016) to 6.x.
Simple query string syntax support,
improve results order,
and added some management commands for re-indexing with zero-downtime.
I joined the team that year!
Vaibhav, another Google summer of code student.
Search as you type extension.
Results linking to specific sections.
Search analytics.
Helped a little with the migration to 7.x.
Stable API.
Improve the HTML parsing.
Support for MkDocs.
Ranking and ignore pages.
Search on multiple projects.
Indexing Process
Parse configuration file
↓
Sphinx build → JSON metadata & HTML
↓
Track files with their ignore & ranking options
↓
JSON metadata → Searchable text
Parse the configuration file
that has all the information to build your docs
(and also search options!).
Sphinx build,
which generates all html pages from sources.
We install an extension hook into the process and dump some meta-information
about each page (title, main html content).
We also inject our embedded script to patch the search engine from Sphinx.
https://github.com/readthedocs/readthedocs-sphinx-ext
Go through each html file and save it on our DB and set the given ranking and set it as ignored or not.
In a later step we go again over each html file,
and retrieve the json metadata file,
and extract its html.
We parse the html to extract the searchable text and index it into ES.
For MkDocs the process is almost the same,
but we don't use an extension to get the main content,
but instead we parse the search index or each html file.
We'll see more about this later.
Configuration File
.readthedocs.yaml
version: 2
sphinx:
configuration: docs/conf.py
python:
version: 3.8
install:
- requirements: docs/requirements.txt
search:
ranking:
api/v1/*: -5
api/v2/*: 5
ignore:
- search.html
- 404.html
https://docs.readthedocs.io/page/config-file/v2.html
This is how a config file looks like for a sphinx project, we'll be focusing on the search part.
Indexing the content
Get metadata from each file or the file itself
Extract searchable content based on heuristics
Remove navigation nodes and line numbers from code-blocks
Divide content into sections and domains (Sphinx)
readthedocs/search/parsers.py
readthedocs/search/documents.py
https://docs.readthedocs.io/page/development/search-integration.html
For sphinx we use the json metadata file.
For MkDocs we are testing two ways: using a search index file and indexing from HTML files.
We extract the searchable content based on some heuristics like aria roles (from accessibility) and html
tags (semantic web!).
Identify the main node of the content, remove irrelevant content like navigation items.
Remove line numbers from code blocks, again this is done based on heuristics like the common way of
representing code-blocks with numbers.
Domains are specific of Sphinx, sphinx domains,
like documentation extracted from source code.
Identifying the main content node
We identify the main node based on the main role.
We don't want to include things like navbar, footer, etc.
Removing navigation nodes
We remove navigation nodes that aren't part of the content.
This is based on heuristics like we say,
like the navigation role or nav tags.
Removing navigation nodes
We remove adorments like permalinks, this is the symbols that appear at the side of titles.
Identifying sections
Identifying nested sections
Ignoring pages
Ignored page → not indexed page
This is simple,
if the page is ignored we don't index it.
Sorry we are ignoring you, pages :(
Ranking pages
Ranking can be done at index or search time
Ranking at search time with
function score query
Painless script
Map each ranking [-10, 10] to a score
Play nice with custom boosting
readthedocs/search/faceted_search.py
Ranking can be done at index (boosting, specific fields) or search time,
they have their pros ans cons.
Faster queries, need re-index when the ranking value changes.
For our case at search time was more appropriate.
At search time with a function score query + script score.
This is done with the painless scripting language.
We map each ranking to a ES score,
keeping in mind to still show relevant results
independently from the ranking.
Config file used at RTD
search:
ignore:
# Internal documentation
- development/design/*
- search.html
- 404.html
ranking:
# Deprecated content
api/v1.html: -1
config-file/v1.html: -1
# Useful content, but not something we want most users finding
custom_installs/*: -6
changelog.html: -6
This is how a real life example of a config file for search looks like.
We ignore things like internal documentation that isn't relevant to users.
We use ranking to deprecate content.
We use ranking to show relevant results.
But still make them available.
Ignoring pages from results
To the left we are using the custom search options.
To the right is a without those custom options.
To the right we see the first result being a design document,
this contains information about the implementation of the feature,
rather than the documentation itself.
To the left we don't see that result at all.
The first result is the documentation about pull requests previews.
Pages with rank -6 are still there
To the left we are using the custom search options.
To the right is a without those custom options.
Changelog has lower ranking, but is still shown on results
Pages with a lower ranking are still shown on relevant results
Tie breaker with rank -1
To the left we are using the custom search options.
To the right is a without those custom options.
Both v1 and v2 offer the same option, but v2 is shown first on the results
Painless script
// ranking: [0.01, .. 0.8, 1, 1,3, .. 2]
int rank =
doc['rank'].size() == 0 ?
0 : (int) doc['rank'].value;
return params.ranking[rank + 10] * _score;
https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-painless.html
For ranking pages at search time we use a painless script
This is how a painless script looks like.
It's almost like java,
so if you aren't familiar with java o haven't coded in java in a long time (like myself!),
it may not be that painless to code in...
We map each ranking to a value between 0.001 to 2, that's 21 values.
We return the final score by multiplying the original score with the ranking.
0.8 and 1.3 are magic numbers! They were calculated in base to our boosting values (2.0, 1.5).
Overriding the default search
Hijack the search call via Javascript
Keep a reference to the old search for backup
If you ignore a page, you may still see it in the results from the backup search
https://docs.readthedocs.io/page/development/search-integration.html#overriding-the-default-search
We inject a script in all docs pages,
part of what this script does is hijack the default search,
so we use our SSS via our API.
We still need to keep a reference to the original search
as backup in case our api is down or in case our search
doesn't return results.
var original_search = Search.query;
function search_override(query) {
var results = fetch_resuls(query);
if (results) {
for (var i = 0; i ≤ results.length; i += 1) {
var result = process_result(results[i]);
Search.output.append(result);
}
} else {
original_search(query);
}
}
Search.query = search_override;
$(document).ready(function() {
Search.init();
});
This is how the overriding is done.
We keep a reference to the original search
We declare a function that uses our api to retrieve and inject the results
The original search is called if the api doesn't return results or if an error happened.
Latency 🐌
ES cloud (AWS) → RTD app (Azure)
ES cloud (Azure) → RTD app (Azure)
ES cloud (AWS zone B) → RTD app (AWS zone A)
ES cloud (AWS zone A) → RTD app (AWS zone A)
Our API response was very slow.
Elastic Cloud let's you choose were to host your ES instance
.org was on azure, and ES was hosted on aws.
We have now migrated to AWS, so we migrated ES again to aws.
.com is on aws, ES was aws... But on a different zone!
In conclusion, make sure your ES deploy in on the same cloud provider and zone to improve latency.
Partial terms results
Search for code and words
theme ,
theme s,
te mes,
temes~1
python.* ,
*. fail_on_warning
test ,
test ing
index ,
reindex ,
index ing,
reindex ing
We host technical documentation, so users search for code and words!
Code is different from words, punctuation isn't mean to separate "words" in code
What I mean with partial terms? Let's look at the examples
Search for theme, themes, then with typo (we won't be using ~ too much, we don't know we have a typo!)
Explicit suffix search, but explicit prefix search isn't supported with the simple query string query
Implicit suffix search
Implicit prefix or suffix search, or both!
https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
Simple query string is useful when we know what we are looking for (refine our search)
Fuzzy search is useful when we don't know what we are looking for
(or we don't know how to write it, specially if the isn't your lang)
Wildcard queries support prefix and suffix searches!
We can combine several type of searches in one
But how to we know when to use one or combine several?
Again, more heuristics to find out!
Simple search terms don't have the special syntax,
and have just one word.
Maybe we could add the string length to the heuristics?
Wildcard queries over Text fields can be slow...
Wildcard queries over Wildcard
fields are fast!
Wildcard or Text field?...
Both! Multi fields
(#7613 )
https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type
Bad news, the docs mention that wildcard queries over text fields can be slow,
but maybe they aren't that bad... Let's try it!
502!
Good news, we have wildcard fields!
Bad news!
They are indexed in a different way,
we need to decide between each other!
We still want to have the other type of searches (break between words)!
(we haven't deployed this solution yet)
On .com there wasn't any downside in performance.
There we have a smaller index.
Other things
Thank you very much Elastic!
If you (or anyone else) search the same term, you'll get a cached response
The contents is static, so caching is fine, we also purge the cache when a new build happens. So you
won't get outdated results.
What's next?
Better results for partial terms
#7613
Weight page views into results
#7297
Use SSS by default for MkDocs
Enable search indexing for pull requests previews?
Ignore and ranking per sections?
Better display for results from code blocks?
#7112
Search for images and code snippets?
Search by files patterns and other facets?
Gray items are still pending on decisions or are just ideas,
so they may not be implemented soon or ever.
Search the Docs with Read the Docs and Elasticsearch