Search is right at the centre of GOV.UK. It’s the main focus of the homepage and it appears in the corner of every single page. Many of our recent and upcoming apps such as licence finder also rely heavily on search. So, making sure we have the right tool for the job is vital. Recently we decided to begin switching away from Solr to elasticsearch for our search server. Rob Young, a developer at GDS explains in some detail the basis for our decisions - the usual disclaimers about this being quite technical apply.
A little background
Both Solr and elasticsearch are Lucene based search servers that expose a HTTP interface. They both provide a lot of the same features and, in fact, both depend on a lot of the same code. So why on earth do we want to go through the effort of switching?
Controlling the index
One of the great features of elasticsearch is that it exposes all sorts of index management operations through the HTTP interface, such as creating, deleting or modifying the schema of an index. Solr does allow you to create a new index based on an existing one, but nothing more. This extra control is great for a few reasons:
- it is easy to experiment with
- it puts the control of the index firmly in the hands of the application, where it belongs
- it makes temporary indexes for integration testing or A/B/ testing possible
Elasticsearch also allows us to have rich, nested documents that better model our data. It may be difficult to visualise so take a look at the two JSON documents below.
{"title": "Example title","format": "Answer","section": "Example","additional_links__title": ["title one", "title two"],"additional_links__link": ["/one", "/two"]}
{
"title": "Example title",
"format": "Answer",
"section": "Example",
"additional_links": [
{"title": "title one", "link": "/one"},
{"title": "title two", "link": "/two"}
]
}
https://gist.github.com/3092498
The first shows the structure of a Solr document where we want to nest additional links, the second shows how we could model that in elasticsearch.
Finding results
Just about the most important feature of any search engine is the ability to query it. Both Solr and elasticsearch expose their query APIs over HTTP but they do so in quite different ways. Solr queries are made up of two and three letter URL parameters, while elasticsearch queries are clear, self documenting JSON objects passed in the HTTP body.
Here is a curl command that can be run from the terminal to query a Solr index. It’s quite difficult to interpret, some of the fields can be worked out but ultimately you would have to resort to the documentation find out what they mean.
1
curl -XGET 'http://localhost:8983/solr/rummager/select?qt=dismax&q=bank+hoildays&fl=title%2Clink%2Cdescription%2Cformat&bq=format%3A%28transaction+OR+recommended-link%29%5E3.0&hl=true&hl.fl=description%2Cindexable_content&hl.simple.pre=HIGHLIGHT_START&hl.simple.post=HIGHLIGHT_END&start=0&rows=50&mm=75%25'
https://gist.github.com/3092083
Compare that to the (mostly) equivalent query using elasticsearch.
curl -XGET 'http://localhost:9200/rummager/_search' -H 'Content-type: application/json' -d '{
"from": 0, "size": 50,
"query": {
"bool": {
"must": {
"query_string": {
"fields": ["title", "description", "indexable_content"],
"query": "bank holidays",
"minimum_should_match": "75%"
}
},
"should": {
"query_string": {
"default_field": "format",
"query": "transaction OR recommended-link",
"boost": 3.0
}
}
}
},
"highlight": {
"pre_tags": ["HIGHLIGHT_START"],
"post_tags": ["HIGHLIGHT_END"],
"fields": {
"description":{ },
"indexable_content":{ }
}
}
}'
It’s much more verbose, but it’s also much more obvious what is happening.
Performance
Not every aspect of elasticsearch is an improvement on Solr and this includes performance. Solr performs very well on small indexes that don’t change very often, which includes us. We set up a few very simple performance tests. Our goal wasn’t to get an accurate picture of production performance but rather to get an idea of the difference in performance. However, elasticsearch is more than fast enough, so it’s not a compelling reason to stick to Solr.
Stability
Another concern that was raised was regarding index stability and the risk of corruption. This is a serious concern. It is also a difficult one to refute as stability issues often only arise in specific circumstances, under consistently high load for an extended period of time. We spoke to 37Signals and Mozilla, amongst others and did not find anything that worried us.
Combined with all the other reasons, on the 25th of June we gave elasticsearch the thumbs up and now we’re using it in production.
12 comments
Comment by Nick Holmes posted on
I've just started investigating and testing GOV UK - randomly at present I have to admit.
Can you shed some light on how results are ranked as this appears to me to be almost completely random.
For example, let's look at a browse for Businesses and self-employed > Farming
The top few results on the list appear randomly ordered, but maybe those are the most popular or the most ... ? - you tell me. Then the remaining results are alpha ordered. ??
At the bottom we have a box with detailed guidance links. I latch on to Grants and payments for farmers.
Now I paste that title into the search box expecting that page to come out top. No way! It's nowhere to be seen.??
I know you will hate me saying this, but if I Google Grants and payments for farmers the page is #1.
Comment by Coding in the open | Government Digital Service posted on
[...] is the right solution and start using the software straight away. A good example of that is the Elasticsearch search engine that Rob wrote about a few months back. From what we were reading it seemed like it’d be simpler for us to work with than solr, and [...]
Comment by Andre' (@AndreHazelwood) posted on
Your query is not the same. Not to invalidate the results. bank+hoildays != bank holidays
Comment by ElasticSearch | ReStreaming posted on
[...] From Solr to elasticsearch (digital.cabinetoffice.gov.uk) [...]
Comment by Graham Jenkins posted on
Everyone else puts search at the top, even when it's not their main focus, so why did you hide it 4 screens down? Just curious.
Comment by davie hay posted on
"center of GOV.UK." ? Shouldnt that be "centre of GOV.UK."
Comment by Louise Kidney posted on
Thanks for your feedback. This has now been changed.
Comment by Gerrit Berkouwer posted on
Interesting move! Were there any functional thoughts about the two solutions? Does ElasticSearch differ in the way you present searchresults? Or was it a pure technical decision to make the switch?
Do you use several indexes for several websites, or is there 1 index only for the Gov.uk domain
Comment by Rob Young posted on
If I understand your first question correctly it's about how search results are ordered. As both Solr and Elasticsearch use Lucene as their underlying library the default result order will be pretty similar. One benefit that Elasticsearch offers over Solr in this area is that it's much easier to experiment with different configurations that affect the order of the results.
We use a separate index for each application under http://www.gov.uk. I've been stung before by sticking everything in one index. If the content is considerably different between the websites this will affect the the result ordering.
Comment by From Solr to elasticsearch [Clarity as a Value?] « Another Word For It posted on
[...] From Solr to elasticsearch by Rob Young. [...]
Comment by robyounggds posted on
Hi Luke, that's a very interesting question and one that we will need to answer but right now we haven't. Note this is just for search traffic and not total page requests. I've just dome some very rough, back of an envelope calculations based on Directgov traffic and the proportion of GOV.UK traffic that goes through the search page. It looks like we should be expecting to peak at around 3.5k - 4k search requests per hour. However, the error margins are pretty huge and it assumes no extra load due to publicity around the launch.
Comment by Luke posted on
When gov.uk is launched how many queries do expect to receive per hour/day?