Skip to main content

https://gds.blog.gov.uk/2012/08/03/from-solr-to-elasticsearch/

From Solr to elasticsearch

Posted by: , Posted on: - Categories: GOV.UK, Technology

Search is right at the centre of GOV.UK. It’s the main focus of the homepage and it appears in the corner of every single page. Many of our recent and upcoming apps such as licence finder also rely heavily on search. So, making sure we have the right tool for the job is vital. Recently we decided to begin switching away from Solr to elasticsearch for our search server. Rob Young, a developer at GDS explains in some detail the basis for our decisions - the usual disclaimers about this being quite technical apply.

A little background

Both Solr and elasticsearch are Lucene based search servers that expose a HTTP interface. They both provide a lot of the same features and, in fact, both depend on a lot of the same code. So why on earth do we want to go through the effort of switching?

Controlling the index

One of the great features of elasticsearch is that it exposes all sorts of index management operations through the HTTP interface, such as creating, deleting or modifying the schema of an index. Solr does allow you to create a new index based on an existing one, but nothing more. This extra control is great for a few reasons:

  • it is easy to experiment with
  • it puts the control of the index firmly in the hands of the application, where it belongs
  • it makes temporary indexes for integration testing or A/B/ testing possible

Elasticsearch also allows us to have rich, nested documents that better model our data. It may be difficult to visualise so take a look at the two JSON documents below.

{
"title": "Example title",
"format": "Answer",
"section": "Example",
"additional_links__title": ["title one", "title two"],
"additional_links__link": ["/one", "/two"]
}

{
"title": "Example title",
"format": "Answer",
"section": "Example",
"additional_links": [
{"title": "title one", "link": "/one"},
{"title": "title two", "link": "/two"}
]
}
https://gist.github.com/3092498

 

The first shows the structure of a Solr document where we want to nest additional links, the second shows how we could model that in elasticsearch.

Finding results

Just about the most important feature of any search engine is the ability to query it. Both Solr and elasticsearch expose their query APIs over HTTP but they do so in quite different ways. Solr queries are made up of two and three letter URL parameters, while elasticsearch queries are clear, self documenting JSON objects passed in the HTTP body.

Here is a curl command that can be run from the terminal to query a Solr index. It’s quite difficult to interpret, some of the fields can be worked out but ultimately you would have to resort to the documentation find out what they mean.

1
curl -XGET 'http://localhost:8983/solr/rummager/select?qt=dismax&q=bank+hoildays&fl=title%2Clink%2Cdescription%2Cformat&bq=format%3A%28transaction+OR+recommended-link%29%5E3.0&hl=true&hl.fl=description%2Cindexable_content&hl.simple.pre=HIGHLIGHT_START&hl.simple.post=HIGHLIGHT_END&start=0&rows=50&mm=75%25'
https://gist.github.com/3092083

Compare that to the (mostly) equivalent query using elasticsearch.

curl -XGET 'http://localhost:9200/rummager/_search' -H 'Content-type: application/json' -d '{
"from": 0, "size": 50,
"query": {
"bool": {
"must": {
"query_string": {
"fields": ["title", "description", "indexable_content"],
"query": "bank holidays",
"minimum_should_match": "75%"
}
},
"should": {
"query_string": {
"default_field": "format",
"query": "transaction OR recommended-link",
"boost": 3.0
}
}
}
},
"highlight": {
"pre_tags": ["HIGHLIGHT_START"],
"post_tags": ["HIGHLIGHT_END"],
"fields": {
"description":{ },
"indexable_content":{ }
}
}
}'

https://gist.github.com/3092079

It’s much more verbose, but it’s also much more obvious what is happening.

Performance

Not every aspect of elasticsearch is an improvement on Solr and this includes performance. Solr performs very well on small indexes that don’t change very often, which includes us. We set up a few very simple performance tests. Our goal wasn’t to get an accurate picture of production performance but rather to get an idea of the difference in performance. However, elasticsearch is more than fast enough, so it’s not a compelling reason to stick to Solr.

Stability

Another concern that was raised was regarding index stability and the risk of corruption. This is a serious concern. It is also a difficult one to refute as stability issues often only arise in specific circumstances, under consistently high load for an extended period of time. We spoke to 37Signals and Mozilla, amongst others and did not find anything that worried us.

Combined with all the other reasons, on the 25th of June we gave elasticsearch the thumbs up and now we’re using it in production.

Sharing and comments

Share this page

12 comments

  1. Comment by Nick Holmes posted on

    I've just started investigating and testing GOV UK - randomly at present I have to admit.

    Can you shed some light on how results are ranked as this appears to me to be almost completely random.

    For example, let's look at a browse for Businesses and self-employed > Farming

    The top few results on the list appear randomly ordered, but maybe those are the most popular or the most ... ? - you tell me. Then the remaining results are alpha ordered. ??

    At the bottom we have a box with detailed guidance links. I latch on to Grants and payments for farmers.

    Now I paste that title into the search box expecting that page to come out top. No way! It's nowhere to be seen.??

    I know you will hate me saying this, but if I Google Grants and payments for farmers the page is #1.

  2. Comment by Coding in the open | Government Digital Service posted on

    [...] is the right solution and start using the software straight away. A good example of that is the Elasticsearch search engine that Rob wrote about a few months back. From what we were reading it seemed like it’d be simpler for us to work with than solr, and [...]

  3. Comment by Andre' (@AndreHazelwood) posted on

    Your query is not the same. Not to invalidate the results. bank+hoildays != bank holidays

  4. Comment by ElasticSearch | ReStreaming posted on

    [...] From Solr to elasticsearch (digital.cabinetoffice.gov.uk) [...]

  5. Comment by Graham Jenkins posted on

    Everyone else puts search at the top, even when it's not their main focus, so why did you hide it 4 screens down? Just curious.

  6. Comment by davie hay posted on

    "center of GOV.UK." ? Shouldnt that be "centre of GOV.UK."

    • Replies to davie hay>

      Comment by Louise Kidney posted on

      Thanks for your feedback. This has now been changed.

  7. Comment by Gerrit Berkouwer posted on

    Interesting move! Were there any functional thoughts about the two solutions? Does ElasticSearch differ in the way you present searchresults? Or was it a pure technical decision to make the switch?
    Do you use several indexes for several websites, or is there 1 index only for the Gov.uk domain

    • Replies to Gerrit Berkouwer>

      Comment by Rob Young posted on

      If I understand your first question correctly it's about how search results are ordered. As both Solr and Elasticsearch use Lucene as their underlying library the default result order will be pretty similar. One benefit that Elasticsearch offers over Solr in this area is that it's much easier to experiment with different configurations that affect the order of the results.
      We use a separate index for each application under http://www.gov.uk. I've been stung before by sticking everything in one index. If the content is considerably different between the websites this will affect the the result ordering.

  8. Comment by From Solr to elasticsearch [Clarity as a Value?] « Another Word For It posted on

    [...] From Solr to elasticsearch by Rob Young. [...]

  9. Comment by robyounggds posted on

    Hi Luke, that's a very interesting question and one that we will need to answer but right now we haven't. Note this is just for search traffic and not total page requests. I've just dome some very rough, back of an envelope calculations based on Directgov traffic and the proportion of GOV.UK traffic that goes through the search page. It looks like we should be expecting to peak at around 3.5k - 4k search requests per hour. However, the error margins are pretty huge and it assumes no extra load due to publicity around the launch.

  10. Comment by Luke posted on

    When gov.uk is launched how many queries do expect to receive per hour/day?