Scaling Search: We Used Solr to Reduce Indexing Time for Millions of Records from 14 Hours to 12 Minutes

We took the indexing time of over a million records from 14 hours to just 12 minutes using a combination of Solr’s native bulk import functionality (called DataImportHandler) and Sunspot.

Solr is used by some of the biggest sites on the web, including Netflix, Zappos and even NASA, to improve search functionality. I discovered Solr in 2009, when I was trying to search through over a million orders for an ecommerce site I was building.

Working with Solr’s arcane syntax can be a challenge. Sunspot is a great Ruby gem for making interaction with Solr simple and easy. Like many great abstraction layers, Sunspot achieves its ease-of-use by limiting what you can do. One useful feature of Solr that Sunspot doesn’t support is bulk importing documents directly from a database. To re-index a collection of documents using Sunspot, the gem selects each document from the database individually, creates an ActiveRecord object, and then inserts the document into Solr. This can be a resource-intensive and lengthy process with large collections.

Fortunately, with a little work, you can get Solr’s native bulk import functionality, called DataImportHandler, working with Sunspot.  This technique took the indexing time of over a million records from 14 hours to just 12 minutes.  Here’s how:

First, it’s important to understand how Sunspot represents your data in Solr.

Sunspot takes each field that you’ve identified as “searchable” and creates a corresponding field in Solr made up of the property name in ActiveRecord joined with a suffix indicating the type of data the field contains.  So a “blog_id” field that’s an integer will appear as “blog_id_i” in Solr.  Here’s a list of the most common datatype-to-suffix mappings:

  • text: _text

  • integer: _i

  • time: _time

  • date: _date

  • boolean: _b

  • string: _s

You can find the full list in the schema.xml file in the Sunspot repository.  For multivalued fields, append an “m” to the suffixes above (so if a blog has multiple author_id’s, the Solr field would be “author_id_im”).  And if the contents of the field are stored in Solr (as opposed to just being indexed), append an “s”.  If the contents of a multivalued field are stored in Solr, then append both letters, “ms”.

An exception to these mapping rules is the “id” column.  This maps directly into Solr as “id”.

Next, add a clause to the end of your solrconfig.xml file to reference a data import configuration file that you’ll create.

Screen Shot 2014-04-16 at 10.25.27

Here we call the configuration file data-config.xml.  It should reside in the same directory as solrconfig.xml.

Solr’s data import system is a powerful way to bulk load records from another source (be it a database, XML file, or a URL).  

It's full documentation can be found here. Generally, you’ll want to either import an entire collection of documents, or import those documents that have been updated since the last import.  Fortunately, DataImportHandler makes this easy by allowing you to construct a SQL select statement with flags in it to indicate which type of import you’d like.

For instance, for a blog database, the select statement might look like this:

SELECT id, title, content, author_id FROM blog_articles
WHERE '${dataimporter.request.clean}' != 'false'
OR updated_at > '${dataimporter.last_index_time}'

For full imports, dataimporter.request.clean can be set to true, which will load all the records.  To only import data since the last import, dataimporter.request.clean can be set to false and dataimporter.last_index_time can be set to the last index time.  Fortunately, this time is stored automatically by DataImportHandler in a file called in the same directory as solrconfig.xml and data-config.xml (therefore that directory should be writable by Solr).

Once you’ve figured out how your fields are represented in Solr and constructed your select statement, it’s just a matter of putting all the pieces together.  

Use this XML snippet as a template for your own data-config.xml:

Screen Shot 2014-04-16 at 10.28.55

Then browse to the Solr web interface, choose a core, and select “DataImport” in the left column.  

There you’ll see instructions for how to issue a “full-import” (the entire collection) or “delta-import” (only documents since the last import) command.  You may wish to uncheck the “clean” checkbox when running a “full-import” to prevent solr cleaning up the index before each import.  Cleaning up the index can take sometime, especially on large datasets. (Note that this meaning of “clean” is not the same as that used in the “dataimporter.request.clean” argument mentioned above.)

Debugging is very straight-forward.

You can look at your mysql-query-log to see the queries executed and activate LogTransformer to capture any solr errors. You can activate LogTransformer by adding the following to your entity tag:

Screen Shot 2014-04-16 at 10.30.41

Unlike other transformers this does not apply to a particular field so the attributes are applied on the entity itself.

Valid logLevels are "trac", "debug", "info", "warn", and "error".

With these steps, you’ll be able to quickly import large collections into Solr directly from your database, and still take advantage of Sunspot’s easy and simple interface.