Developing with IBM Watson Retrieve and Rank: Part 1 Solr Configuration

Chris Ackerson
6 min readMar 17, 2016
IBM Watson Retrieve and Rank

What is Retrieve and Rank

The Retrieve and Rank (R&R) service on the Watson Developer Cloud is a great tool that enables enhancement of standard search implementations with natural language and machine learning capabilities. R&R is really 2 services bolted together. Retrieve is Apache Solr in the cloud and offers all the rich feature set of Solr. There are 2 visible enhancements: 1) a custom query builder optimized for natural language search and 2) a set of algorithms to score semantic relationships between a given query and a Solr document which get fed into the Rank portion of R&R. The ranker is a learning to rank algorithm designed to re-rank objects in a list based on their relevance to some objective.

As a side-note the ranker can be used independently of retrieve and has interesting applications in recommender engines which I’ll cover in a future post. In Part 1 of Developing with IBM Watson Retrieve and Rank, I’ll focus on configuring your solr cluster in Retrieve. In Part 2, I will cover training and evaluation of the Ranker and in Part 3, I’ll cover implementing custom features for the Ranker.

Background

My goal in this post is to highlight how to configure a solr cluster with R&R and how to use the Watson Document Conversion service to help format content for your solr collection.

I found some content from the NIDDK that we had crawled for another set of experiments last year. It’s a small dataset of about 40 documents on topics related to Digestive and Kidney diseases. I’ll implement a solution to answer simple questions related to these topics. Between these 3 posts, we will follow a process to optimize the content, configure a solr cluster, build a ground truth, train a ranker, and run a set of experiments to evaluate our configuration and custom feature.

All code and data assets can be found on github.

Configuring Solr

First step is to create a Retrieve and Rank service in Bluemix and capture your credentials. (Check out the tutorials in the WDC documentation for more detail). Then you can configure Solr:

Create a new Solr Cluster:

curl -H “Content-Type: application/json” -X POST -u “username”:”password” -d “{\”cluster_size\”:\”1\”,\”cluster_name\”:\”niddk_cluster\”}” “https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters"

Capture the Solr Cluster ID <sc3689b816_2b07_4548_96a9_a9e52a063bf1>

Check the status of your Cluster

curl -u “username”:”password” “https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc3689b816_2b07_4548_96a9_a9e52a063bf1"

Once your cluster shows ‘Ready’ you can upload your config.

Upload configuration to Solr

Figure 1 shows the documents that exist in the default Solr config. For now we’ll only worry about updating schema.xml which is where we define the fields that represent a document.

Fig 1. Solr Config documents

Think of a Solr document as a row in a table. Each field represents a column in the table which is some metadata about the document. Take a look at Appendicitis.html. As you can see in Figure 2, each section is some information about Appendicitis like the cause. Each of these sections will become a separate document in our Solr collection.

Fig 2. Section of Appendicitis.html

Our schema fields need to capture the relevant metadata about this section:

  • id — unique identifier for the document
  • source — the original source document
  • doc_type — what type of section this is e.g. cause. This will become relevant for our custom feature
  • topic — The section header
  • text_description — The body content of the document

Below are our fields as defined in schema.xml. See solr documentation for details on schema.xml and solrconfig.xml. The most important thing to note here is the type. If type is “watson_text_en” then you are telling Retrieve to score that field against the incoming query. Any other types are not scored.

<field name=”id” type=”string” indexed=”true” stored=”true” required=”true” multiValued=”false” />
<field name=”source” type=”text_en” indexed=”true” stored=”true” required=”false” multiValued=”true” />
<field name=”doc_type” type=”text_en” indexed=”true” stored=”true” required=”false” multiValued=”true” />
<field name=”topic” type=”watson_text_en” indexed=”true” stored=”true” required=”false” multiValued=”true” />
<field name=”text_description” type=”watson_text_en” indexed=”true” stored=”true” required=”false” multiValued=”true” />
<! — make a copy field using normal OOB solr text_en →
<field name=”text” type=”text_en” indexed=”true” stored=”false” required=”false” multiValued=”true” />
<! — make a copy field using watson_text_en →
<field name=”watson_text” type=”watson_text_en” indexed=”true” stored=”false” required=”false” multiValued=”true” />
<copyField source=”source” dest=”text”/>
<copyField source=”doc_type” dest=”text”/>
<copyField source=”topic” dest=”text”/>
<copyField source=”text_description” dest=”text”/>
<copyField source=”source” dest=”watson_text”/>
<copyField source=”doc_type” dest=”watson_text”/>
<copyField source=”topic” dest=”watson_text”/>
<copyField source=”text_description” dest=”watson_text”/>

Once we have our solr config directory complete, we can upload to our solr cluster.

curl -X POST -H “Content-Type: application/zip” -u “username”:”password” “https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc3689b816_2b07_4548_96a9_a9e52a063bf1/config/niddk_config" --data-binary @/niddk_solr_config.zip

Building your Solr Collection

Machine learning-based applications are only as good as the data that powers them. There are lots of considerations around content curation, optimization, licensing, cleansing and administration which I won’t address in this blog. The following section details how to format content for Solr.

The Watson Document Conversion service allows you to convert and segment HTML, word and pdf documents. When you tell the document conversion service how to segment the HTML documents, each segment is called an answer unit. I started with an example from the WDC github for sending doc conversion output to R&R and amended it for this use-case. doc_conversion.js pushes all files from a directory through document conversion, formats the output for solr, and writes to a local json file.

The config associated with my document conversion call requests “answer units” but you could also ask for normalized html or text. The “selector tags” are h1 and h2. This will combine any deeper child tags like h3’s into the answer unit with its parent. This was important for the particular format of these source documents. There are many advanced customization options that you can explore depending on your source documents and the requirements of your application.

config: {
“conversion_target”: “ANSWER_UNITS”,
“answer_units”: {
“selector_tags”: [“h1”,”h2"]
}
}

Once the answer units are returned by document conversion you can loop through them and write the extracted content to the relevant solr fields.

auContents.forEach(function(auContent) {
if (auContent.media_type === ‘text/plain’) {
solrDoc = {
id: au.id,
source: ‘’,
doc_type: ‘’,
topic: au.title,
text_description: auContent.text
};
}
});

Finally, we need to set our doc_type field so we create basic logic in the addDocumentFields function to identify this type. Again, the doc_type will become relevant when we build our custom feature in Part 3.

if(solrDoc.topic.indexOf(“causes”) > -1 | solrDoc.topic.indexOf(“Causes”) > -1) {
solrDoc.doc_type = ‘cause’;
}

Call doc_conversion.js

node doc_conversion.js -i <dir>

The output of doc_conversion.js is an array of solr documents that can be added to our solr collection.

[{
“id”: “688378d8-f514–4a49–8c6a-80d4db265d41”,
“source”: “Autoimmune Hepatitis”,
“doc_type”: “definition”,
“topic”: “What are autoimmune diseases?”,
“text_description”: “Autoimmune diseases are disorders in which the body’s immune system attacks the body’s own cells and organs with proteins called autoantibodies; this process is called autoimmunity. Autoimmune hepatitis is a chronic disease of the liver. The body’s immune system normally makes large numbers of proteins called antibodies to help the body fight off infections. In some cases, however, the body makes autoantibodies. Certain environmental triggers can lead to autoimmunity. Environmental triggers are things originating outside the body, such as bacteria, viruses, toxins, and medications. “
},
{
“id”: “9399aa31–6b61–4287–903a-4c35ad3ee5ab”,
“source”: “Autoimmune Hepatitis”,
“doc_type”: “cause”,
“topic”: “What causes autoimmune hepatitis?”,
“text_description”: “A combination of autoimmunity, environmental triggers, and a genetic predisposition can lead to autoimmune hepatitis. “
},...]

First we need to create a collection:

curl -X POST -u “username”:”password” “https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc3689b816_2b07_4548_96a9_a9e52a063bf1/solr/admin/collections" -d “action=CREATE&name=niddk_collection&collection.configName=niddk_config&wt=json”

Then we can add our documents and commit to our index:

curl -X POST -H “Content-Type: application/json” -u “username”:”password” “https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc3689b816_2b07_4548_96a9_a9e52a063bf1/solr/niddk_collection/update"?commit=true --data-binary @solrdocs.json

Validate and conclusion

Now we have a working solr cluster on Bluemix. We can test with the following command.

curl -u “username”:”password” “https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc3689b816_2b07_4548_96a9_a9e52a063bf1/solr/niddk_collection/select?q=What%20are%20the%20symptoms%20of%20Appendicitis?&wt=json&fl=id,topic,text_description"

Figure 2 shows a partial response.

Fig 2. Response from Solr

You’ll notice that the top answer is incorrect! We haven’t done any optimization yet so this shouldn’t be too concerning. In this post we walked through how to create and configure a solr cluster with Retrieve and Rank. We showed how you can leverage the Document Conversion service to help format html, word and pdf documents for your solr cluster. In Part 2 we’ll show how to train the ranker and we’ll see how we can evaluate the performance of both solr and R&R.

--

--

Chris Ackerson

I lead AI Product Development at AlphaSense. I'm interested in sharing what I've learned about productizing AI