Update: I have pushed my Python code to GitHub (repo is here). My implementation is a tad more advanced than this tutorial. See the Readme file and code comments.
Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene (Solr website).
Goal
My goal is to demonstrate building an e-commerce gallery page with search, pagination, filtering and multi-select that mirrors the expectations of a typical user. See this article for a nice explanation of the multi-select filtering I am trying to implement.
Search should work for phrase queries like “mens shrt gap” or “gap 2154abc”, factoring in typos, various word forms (stemming) and phonetic spelling.
Solr Setup
Solr 7 is installed locally on my computer with an active connection to a database. Solr is using the deltaQuery feature (in db-data-config.xml
) to detect changes in my database and import those records into Solr.
Web Development Setup
I have a basic Django/React app with Python 3. See this article for ideas on how to integrate Django with React. I recommend following these instructions to create your own Solr client.
I was considering using pySolr as a client, but it lacks good documentation and seems to have been neglected since 2015 (like most Solr libraries). Nevertheless, pySolr can work if you are ready to comb through the GitHub issues and codebase.
If you are using pySolr:
Paste
export DEBUG_PYSOLR=’true’
into your terminal before running your server, and you will be able to view the URL generated by pySolr.The URL you see in your terminal doesn’t seem to be clued in about URL encoding issues, so a query like
Dolce & Gabbana
will work on your website, but break when you paste the URL into a browser.
Facets & Facet Pivots
Facets are synonymous with product categories or specs. Solr has an option to return the available facets with their respective counts for a specific query. You can control the minimum number of products required in a facet by setting facet.mincount=<number>
.
For example, if you are selling brand named clothing, facets
might refer to gender, style and material. If the search was for “mens casual gap”, the facets returned would look like this (notice the constraints on gender
and style
):
"facet_fields" : {
"gender" : [
"Men" , 25,
"Women" , 0
],
"style" : [
"casual", 10,
"dress", 0
],
"material" : [
"wool", 15,
"cotton", 10
],
}
Example Query
Let’s run through an example:
from urllib.request import urlopen
import urllib.parse
import simplejson
def gallery_items(current_query):
solr_tuples = [
# text in search box
('q', "mens shirt gap"),
# how many products do I want to return
('rows', current_query['rows_per_page']),
# offset for pagination
('start', current_query['start_row'] * current_query['rows_per_page']),
# example of a default sort,
# for search phrase leave blank to allow
# for relevancy score sorting
('sort', 'price asc, popularity desc'),
# which fields do I want returned
('fl', 'product_title, price, code, image_file'),
# enable facets and facet.pivots
('facet', 'on'),
# allow for unlimited amount of facets in results
('facet.limit', '-1'),
# a facet has to have at least one
# product in it to be a valid facet
('facet.mincount', '1'),
# regular facets
('facet.fields', ['gender', 'style', 'material']),
# nested facets
('facet.pivot', 'brand,collection'),
# edismax is Solr's multifield phrase parser
('defType', 'edismax'),
# fields to be queried
# copyall: all facets of a product with basic stemming
# copyallphonetic: phonetic spelling of facets
('qf', 'copyall copyallphonetic'),
# give me results that match most fields
# in qf [copyall, copyallphonetic]
('tie', '1.0'),
# format response as JSON
('wt', 'json')
]
solr_url = 'http://localhost:<port>/solr/<core>/select?'
# enocde for URL format
encoded_solr_tuples = urllib.parse.urlencode(solr_tuples)
complete_url = solr_url + encoded_solr_tuples
connection = urlopen(complete_url)
raw_response = simplejson.load(connection)
Phrase search will be discussed in the next section — Schema Modeling.
- I would suggest using tuples for each key-value pair as it will be easier to
urlencode
. It will also be easier to manipulate, particularly when you have a complicatedfq
with a ton ofAND
,OR
logic (which will happen very soon if you are doing filtering). - Each facet group will have its own
fq
field. This ensures thatAND
logic is applied across filter groups. Here is code for applyingOR
logic within a facet group:
def apply_facet_filters(self):
if self.there_are_facets():
for facet, facet_arr in self.facet_filter.items():
if len(facet_arr) > 0:
new_facet_arr = []
for a_facet in facet_arr:
new_facet_arr.append("{0}: \"{1}\"".format(facet, a_facet))
self.solr_args.append(('fq', ' OR '.join(new_facet_arr)))
-
facet.pivot.mincount
allows you to control the minimum number of products required for afacet.pivot
, but beware, if you set it to 0, your server will likely crash. - I’ve found that field values needed to be formatted in quotes:
‘fq’: “brand: \”{0}\””.format(current_query[‘current_brand’])
-
facets
are returned in arrays like[‘brand’, ‘gap’]
, not adict()
which I find inconvenient. Here is one way to format them:
import more_itertools as mit
facets = {}
for k,v in raw_response['facet_counts']['facet_fields'].items():
spec_list = [list(spec) for spec in mit.chunked(v, 2)]
spec_dict = {}
for spec in spec_list:
spec_dict[spec[0]] = spec[1]
facets[k] = spec_dict
raw_response['facet_counts']['facet_fields'] = facets
- By default, if a user selects a facet in a facet group, Solr will return that facet group with only the selected facet, since the search has been narrowed down. But many times, a user would like still like to view the unselected facets and associated counts, to enable multi-select. To allow this functionality, use tagging and excluding. See my repo for a possible implementation.
- To create price ranges as a filter with custom intervals, copy
price
to a new field with one of TriefieldTypes
. The new field should haveindexed
andstored
set tofalse
, anddocValues
set totrue
. Then follow the instructions to add custom ranges. See the next section on schema modeling. See my repo for a possible implementation.
Schema Modeling
If you can get past the idea that fields exist simply to store properties of data, and embrace the idea that you can manipulate data so it can be found as users expect it, then you can begin to effectively program relevance rules into the search engine. (Relevant Search, Chapter 5)
We are ready to modify fields in our document schema to conform to the users’ perception of our products.
Take a look at the documentation about how to update the schema, particularly the sections on tokenizing and filtering. Learn about stemming filters. Ask yourself which tokens/filters are relevant for your situation, and whether it should be apply at query or index time.
I will be following a recommendation in the documentation to copy all fields a user might be interested in into a single copyall
field. This solves the albino elephant issue, as well as signal discordance:
As we’ve stated, when users search, they typically don’t care how documents decompose into individual fields. Many search users expect to work with documents as a single unit: the more of their search terms that match, the more relevant the document ought to be. It may surprise you to know that search engine features that implemented this idea were late to the party. Instead, Lucene-based multifield search depended on field-centric techniques. Instead of the search terms, field-centric search makes field scores the center of the ranking function. In this section, we explore exactly why field-centric approaches can create relevance issues. You’ll see that having ranking functions centered on fields creates two problems:
The albino elephant problem — A failure to give a higher rank to documents that match more search terms.
Signal discordance — Relevance scoring based on unintuitive scoring of the constituent parts (title versus body) instead of scoring of the whole document or more intuitive larger parts, such as the entire article’s text or the people associated with this film. (Relevant Search, Chapter 6)
We will be using the Schema API through the Admin UI. You cannot edit the schema file manually (why). Here is the recipe for creating the copyall
field:
- Step 1: Create a
fieldType
for the field. I am using the samefieldType
for both index and query time. I have kept the stemming light to ensure that brand names stay intact.
<fieldType name="facets" class="solr.TextField">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Step 2: Create a
copyall
field with a facets as thefieldType
. SetmultiValued=true
to allow multiple values in the field (as an array). SetomitNorms=true
since users don’t care about the length of each field (docs), and we don’t want Solr to care either.Step 3: Create copyFields for every field in the data source that you want to be copied. Remember, there is no chaining of copyField’s.
{
"add-copy-field":{
"source":"brand",
"dest":"copyall"
}
}
Step 4: Repeat steps 1–3 if you want to create a
copyall
for phonetic spelling. Use an appropriate fieldType. I am using the Beider-Morse Filter.Step 5: Add a tie breaker of 1 to get a most-fields functionality. The docs provide a nice explanation.
Some ideas:
Add index time boosts for products that are more popular and you want them to rank higher in the search results. You could also do a query time boost, something to the effect of
bf='div(1,popularity)'
.Use function queries to customize anything about your the relevancy scoring of your search results.
Consider the N-Gram filter for typo tolerance.
Consider the Edge-N-Gram filter for autocomplete.
Consider using the
text_en
fieldType
for regular English words (it is one of the manyfieldTypes
which come with Solr):
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Debugging and Workflow
- Check analysis in the Admin UI for how particular terms are analyzed at index or query time.
- Add a
console.log
in your code to print the url for every query. SetdebugQuery=true
and read theparsedQeury
andexplain
. All the math fun is lurking in theexplain
(see Relevant Search, Chapter 2). - After re-configuring the schema, make sure to delete all docs in your index and do a fresh full-import from your database. This can be done in the Admin UI.
- If you need to debug the database import, use the Debug-Mode with verbose output.
Further Reading
Relevant Search by Doug Turnbull and John Berryman
The examples in the book use ElasticSearch, but Appendix B provides mappings to Solr. If the book is too long, read chapters 5 & 6. These chapters tackle which strategy to use for matching multi-field (phrase) search with the most relevant results.
Top comments (0)