https://gds.blog.gov.uk/2012/08/03/from-solr-to-elasticsearch/

From Solr to elasticsearch

Rob Young, 3 August 2012 - GOV.UK, Technology

Search is right at the centre of GOV.UK. It’s the main focus of the homepage and it appears in the corner of every single page. Many of our recent and upcoming apps such as licence finder also rely heavily on search. So, making sure we have the right tool for the job is vital. Recently we decided to begin switching away from Solr to elasticsearch for our search server. Rob Young, a developer at GDS explains in some detail the basis for our decisions - the usual disclaimers about this being quite technical apply.

A little background

Both Solr and elasticsearch are Lucene based search servers that expose a HTTP interface. They both provide a lot of the same features and, in fact, both depend on a lot of the same code. So why on earth do we want to go through the effort of switching?

Controlling the index

One of the great features of elasticsearch is that it exposes all sorts of index management operations through the HTTP interface, such as creating, deleting or modifying the schema of an index. Solr does allow you to create a new index based on an existing one, but nothing more. This extra control is great for a few reasons:

it is easy to experiment with
it puts the control of the index firmly in the hands of the application, where it belongs
it makes temporary indexes for integration testing or A/B/ testing possible

Elasticsearch also allows us to have rich, nested documents that better model our data. It may be difficult to visualise so take a look at the two JSON documents below.

{

"title": "Example title",

"format": "Answer",

"section": "Example",

"additional_links__title": ["title one", "title two"],

"additional_links__link": ["/one", "/two"]

}

https://gist.github.com/3092478

{
"title": "Example title",
"format": "Answer",
"section": "Example",
"additional_links": [
{"title": "title one", "link": "/one"},
{"title": "title two", "link": "/two"}
]
}
https://gist.github.com/3092498

The first shows the structure of a Solr document where we want to nest additional links, the second shows how we could model that in elasticsearch.

Finding results

Just about the most important feature of any search engine is the ability to query it. Both Solr and elasticsearch expose their query APIs over HTTP but they do so in quite different ways. Solr queries are made up of two and three letter URL parameters, while elasticsearch queries are clear, self documenting JSON objects passed in the HTTP body.

Here is a curl command that can be run from the terminal to query a Solr index. It’s quite difficult to interpret, some of the fields can be worked out but ultimately you would have to resort to the documentation find out what they mean.

1
curl -XGET 'http://localhost:8983/solr/rummager/select?qt=dismax&q=bank+hoildays&fl=title%2Clink%2Cdescription%2Cformat&bq=format%3A%28transaction+OR+recommended-link%29%5E3.0&hl=true&hl.fl=description%2Cindexable_content&hl.simple.pre=HIGHLIGHT_START&hl.simple.post=HIGHLIGHT_END&start=0&rows=50&mm=75%25'
https://gist.github.com/3092083

Compare that to the (mostly) equivalent query using elasticsearch.

curl -XGET 'http://localhost:9200/rummager/_search' -H 'Content-type: application/json' -d '{
"from": 0, "size": 50,
"query": {
"bool": {
"must": {
"query_string": {
"fields": ["title", "description", "indexable_content"],
"query": "bank holidays",
"minimum_should_match": "75%"
}
},
"should": {
"query_string": {
"default_field": "format",
"query": "transaction OR recommended-link",
"boost": 3.0
}
}
}
},
"highlight": {
"pre_tags": ["HIGHLIGHT_START"],
"post_tags": ["HIGHLIGHT_END"],
"fields": {
"description":{ },
"indexable_content":{ }
}
}
}'

https://gist.github.com/3092079

It’s much more verbose, but it’s also much more obvious what is happening.

Performance

Not every aspect of elasticsearch is an improvement on Solr and this includes performance. Solr performs very well on small indexes that don’t change very often, which includes us. We set up a few very simple performance tests. Our goal wasn’t to get an accurate picture of production performance but rather to get an idea of the difference in performance. However, elasticsearch is more than fast enough, so it’s not a compelling reason to stick to Solr.

Stability

Another concern that was raised was regarding index stability and the risk of corruption. This is a serious concern. It is also a difficult one to refute as stability issues often only arise in specific circumstances, under consistently high load for an extended period of time. We spoke to 37Signals and Mozilla, amongst others and did not find anything that worried us.

Combined with all the other reasons, on the 25th of June we gave elasticsearch the thumbs up and now we’re using it in production.

Share this page

12 comments

Comment by Nick Holmes posted on 17 January 2013

I've just started investigating and testing GOV UK - randomly at present I have to admit.

Can you shed some light on how results are ranked as this appears to me to be almost completely random.

For example, let's look at a browse for Businesses and self-employed > Farming

The top few results on the list appear randomly ordered, but maybe those are the most popular or the most ... ? - you tell me. Then the remaining results are alpha ordered. ??

At the bottom we have a box with detailed guidance links. I latch on to Grants and payments for farmers.

Now I paste that title into the search box expecting that page to come out top. No way! It's nowhere to be seen.??

I know you will hate me saying this, but if I Google Grants and payments for farmers the page is #1.

Link to this comment
Comment by Coding in the open | Government Digital Service posted on 12 October 2012

[...] is the right solution and start using the software straight away. A good example of that is the Elasticsearch search engine that Rob wrote about a few months back. From what we were reading it seemed like it’d be simpler for us to work with than solr, and [...]

Link to this comment
Comment by Andre' (@AndreHazelwood) posted on 11 September 2012

Your query is not the same. Not to invalidate the results. bank+hoildays != bank holidays

Link to this comment
Comment by ElasticSearch | ReStreaming posted on 24 August 2012

[...] From Solr to elasticsearch (digital.cabinetoffice.gov.uk) [...]

Link to this comment
Comment by Graham Jenkins posted on 08 August 2012

Everyone else puts search at the top, even when it's not their main focus, so why did you hide it 4 screens down? Just curious.

Link to this comment
Comment by davie hay posted on 07 August 2012

"center of GOV.UK." ? Shouldnt that be "centre of GOV.UK."

Link to this comment
- Replies to davie hay>
  
  Comment by Louise Kidney posted on 07 August 2012
  
  Thanks for your feedback. This has now been changed.
  
  Link to this comment
Comment by Gerrit Berkouwer posted on 07 August 2012

Interesting move! Were there any functional thoughts about the two solutions? Does ElasticSearch differ in the way you present searchresults? Or was it a pure technical decision to make the switch?
Do you use several indexes for several websites, or is there 1 index only for the Gov.uk domain

Link to this comment
- Replies to Gerrit Berkouwer>
  
  Comment by Rob Young posted on 07 August 2012
  
  If I understand your first question correctly it's about how search results are ordered. As both Solr and Elasticsearch use Lucene as their underlying library the default result order will be pretty similar. One benefit that Elasticsearch offers over Solr in this area is that it's much easier to experiment with different configurations that affect the order of the results.
  We use a separate index for each application under http://www.gov.uk. I've been stung before by sticking everything in one index. If the content is considerably different between the websites this will affect the the result ordering.
  
  Link to this comment
Comment by From Solr to elasticsearch [Clarity as a Value?] « Another Word For It posted on 06 August 2012

[...] From Solr to elasticsearch by Rob Young. [...]

Link to this comment
Comment by robyounggds posted on 04 August 2012

Hi Luke, that's a very interesting question and one that we will need to answer but right now we haven't. Note this is just for search traffic and not total page requests. I've just dome some very rough, back of an envelope calculations based on Directgov traffic and the proportion of GOV.UK traffic that goes through the search page. It looks like we should be expecting to peak at around 3.5k - 4k search requests per hour. However, the error margins are pretty huge and it assumes no extra load due to publicity around the launch.

Link to this comment
Comment by Luke posted on 03 August 2012

When gov.uk is launched how many queries do expect to receive per hour/day?

Link to this comment

From Solr to elasticsearch

A little background

Controlling the index

Finding results

Performance

Stability

Share this page

12 comments

Government Digital Service

Sign up and manage updates

Be part of the transformation

Follow us

Leading Government Digital and Data

Recent Posts

Comments and moderation

Social media house rules

A little background

Controlling the index

Finding results

Performance

Stability

Sharing and comments

Share this page

12 comments

Related content and links

Government Digital Service

Sign up and manage updates

Be part of the transformation

Follow us

Leading Government Digital and Data

Recent Posts

Comments and moderation

Social media house rules