Developing with IBM Watson Retrieve and Rank: Part 3 Custom Features

10 min readApr 6, 2016

As discussed in Part 1 and Part 2, the IBM Watson Retrieve and Rank service enables developers to configure a solr search cluster on IBM Bluemix and train a machine-learning powered ranking model to improve the relevance of search results.

Machine Learning Features

R&R has a set of native feature scorers that score lexical overlap between a given query/document pair. Those scores are generated through a custom solr plugin in Retrieve. Scores are then sent to the ranker which outputs a ranked list of documents for the query based on its learning. Depending on solr configuration, each feature will score various fields or combinations of fields within each document. Figure 1 shows an example of a call center solution where each document represents an incident and contains a text field for Short Description, Long Description and Tech Notes. Each numeric score in the table represents the output of a given feature scorer where the input is the query “What does the X221 error message mean?” and a document field.

Custom Features

For many R&R implementations, these native lexical features are sufficient to meeting the success criteria of the application e.g. migrate search implementation to the cloud, enable natural language search, drive x% higher relevance for some subset of query types. Once the R&R cluster is in production, we can imagine improving relevance over time as we collect and refine additional training data. That improvement would look something like figure 2 where the x axis represents time and the y axis represents some average relevance metric.

The dotted line represents the theoretical maximum performance of our R&R cluster and is limited by the predictiveness of lexical signal in queries and documents. One way to break through that plateau is to look for additional features that can provide relevance signal to the ranker. We call these custom features for R&R.

In the call center example discussed above, imagine we have access to historical document metrics like the number of views for each incident. An incident with 1000 views is likely to be more relevant for a given query than an incident with 10 views. We can create a custom feature that represents the number of views as a score to the ranker so in this example we would go from 9 scores for each document to 10. And through training, the ranker would learn to judge when and how this new feature is predictive of relevance. If the feature is strongly correlated to our relevance prediction (and not strongly correlated with our lexical features) we would expect to raise our dotted line plateau!

There are many custom features we could create for R&R implementations but they fall into 1 of 3 categories:

DocumentScorer

A document scorer is a class whose input to the score method is a field or fields for a single solr document. Consider a class called DocumentViews, that creates a score based on the number of views in the View field of the solr document.

2. QueryScorer

A query scorer is a class whose input to the score method is a set of query params for a solr query. Consider a class called IsQueryOnTopicScorer, that scores queries based on whether it thinks the underlying query text is on topic for the application domain.

3. QueryDocumentScorer

A query-document scorer is a class whose input to the score method is 1) a set of query params for a solr query, and 2) a field or fields for a solr document. Consider a class that scores the extent to which the “text” of a solr document answers definitional questions. More specifically, the scorer will 1) identify if a query is asking for a definition and 2) if so, identify whether the document contains a likely definition or not.

The features that are native to Retrieve are query-document scorers as they compare lexical overlap between a query and document. We are going to build a custom query-document scorer in this blog.

Lexical Answer Type

In the Jeopardy system, the first stage in the question answer pipeline was called question analysis. In question analysis one of the key tasks was to identify the lexical answer type (LAT) which refers to the term in the question that indicates what entity type should be returned as the answer. For example the LAT in “What is the capital of California” is capital and the system should further generalize that to capital city such that the system knows to return a city as its answer.

In factoid questions there is generally a clear LAT and that LAT corresponds to a clear entity type. In passage question answering, the LAT is often more abstract. In our NIDDK use-case, imagine a user asks “How do I know if I have Appendicitis?” The LAT is technically how but that how probably refers to a set of symptoms that would lead to a diagnosis. So in order to identify the LAT, we need a tool that can understand the abstract intent from an end-user question. Fortunately we have just such a tool in the Watson Developer Cloud.

Natural Language Classifier

From WDC Documentation:

The Natural Language Classifier (NLC) service can help your application understand the language of short texts and make predictions about how to handle them. The service supports English, Arabic, French, Italian, Japanese, Portuguese, and Spanish. A classifier learns from your example data and then can return information for texts that it is not trained on.The service employs a new set of technologies known as “deep learning” to get the best performance possible. Deep learning is a relatively recent set of approaches with similarities to the way the human brain works. Deep learning algorithms offer state of the art approaches in image and speech recognition, and the Natural Language Classifier now applies the technologies to text classification.

In our case, we will train NLC to classify the LAT as represented by the intent of an end-user question for the most common question types we would anticipate in the application. The question types we’ll use are:

Definition — What is Appendicitis?
Cause — What causes Appendicitis?
Symptom — What are the symptoms of Appendicitis?
Complications — What are complications associated with Appendicitis?
Diagnosis — How is Appendicitis diagnosed?
Treatment — How is Appendicitis treated?
Prevention — How do I prevent Appendicitis?

The key to training a classifier is to provide examples of variations in how a question type could be asked. For example, “What are the symptoms of Appendicitis” and “How do I know if I have Appendicitis?” will both map to our symptom class. Below is our classifier training data which is available on github as nlc_train.csv.

How does * develop?,condition_cause
What causes *?,condition_cause
what is the cause of *?,condition_cause
How does * develop,condition_cause
What causes *,condition_cause
How to prevent *?,condition_cause
what is the cause of *,condition_cause
*?,condition_definition
What is *?,condition_definition
explain *,condition_definition
more detail on *,condition_definition
whats *?,condition_definition
what are *,condition_definition
what is *?,condition_definition
what are the symptoms of *?,condition_symptom
What are the symptoms of a *,condition_symptom
what are the signs of *?,condition_symptom
signs that I have *?,condition_symptom
How do I know if I have *?,condition_symptom
symptoms for *,condition_symptom
are * a sign that I have *?,condition_symptom
what are the complications of *?,condition_complications
complications of *,condition_complications
what complications arise from *?,condition_complications
* complications,condition_complications
what issues come from *?,condition_complications
How is * diagnosed?,condition_diagnosis
* diagnosis,condition_diagnosis
How would I know if I have *?,condition_diagnosis
Tests for *?,condition_diagnosis
Diagnose *,condition_diagnosis
What are treatment options for *?,condition_treatment
* treatment,condition_treatment
is there a cure for *?,condition_treatment
cure *,condition_treatment
how to treat *,condition_treatment
treatment for *,condition_treatment
effective ways to heal *,condition_treatment
How can I avoid *?,condition_prevention
How to prevent *,condition_prevention
What are ways to prevent *?,condition_prevention
* prevention,condition_prevention
what are ways not to get *?,condition_prevention

The * is meant as a wildcard to replace the condition entity. The condition is not relevant for intent classification and should be dealt with separately. In our case we will rely on solr to handle entity matching in the search phase. You could just remove the entity completely instead of inserting a *.

Create a Natural Language Classifier service in Bluemix and issue the following commend to train:

curl -u “username”:”password” -F training_data=@nlc_train.csv -F training_metadata=”{\”language\”:\”en\”,\”name\”:\”niddk_classifier\”}” “https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"

Capture the classifier ID and issue the following command to validate that training is complete (with your classifier ID in place of 9a8879x44-nlc-969):

curl -u “username”:”passw” “https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/9a8879x44-nlc-969"

Once trained, use the following command to test your classifier. You can change the text to see how various questions are classified.

curl -G -u “username”:”password” “https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/9a8879x44-nlc-969/classify?text=What is Proctitis”

LAT Custom Feature Logic

In our query-document scorer, we need to compare the LAT identified by NLC to the document returned by solr so that we can return a definition for a definition question, symptoms for a symptom question etc. We could use NLC to classify sections of documents as well, but in our case we can use the structure of the documents to simplify this process. Each document in our NIDDK dataset has section titles called “What is *” “What causes *” “What are the symptoms of *” and so on. If you recall in Part 1, while segmenting and formatting our documents for solr we added basic logic that tagged each section based on these titles. As a result, we have a field in solr called doc_type that maps to each of our LAT types. So our feature scorer logic will look like the following:

Classify the intent of a user question and capture the confidences for each of our 7 classes. For Example:

Question = What are the symptoms of Appendicitis?curl -G -u “1a288545–7f80–4502–99ac-ef420c91f17a”:”oVeEI6ncmaTF” “https://gateway.watsonplatform.net/natural-langage-classifier/api/v1/classifiers/9a8879x44-nlc-969/classify?text=What%20are%20the%20symptoms%20of%20Appendicitis"{“classifier_id” : “9a8879x44-nlc-969”,“url” : “https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/9a8879x44-nlc-969",“text” : “What are the symptoms of Appendicitis”,“top_class” : “condition_symptom”,“classes” : [ {“class_name” : “condition_symptom”,“confidence” : 0.9526760239113281}, {“class_name” : “condition_complications”,“confidence” : 0.01749617382185962}, {“class_name” : “condition_cause”,“confidence” : 0.011224721995999037}, {“class_name” : “condition_diagnosis”,“confidence” : 0.006870593080181918}, {“class_name” : “condition_prevention”,“confidence” : 0.004564829920859066}, {“class_name” : “condition_definition”,“confidence” : 0.0036075458415429787}, {“class_name” : “condition_treatment”,“confidence” : 0.0035601114282294262} ]}

2. For each document returned as a potential answer by solr, assign the NLC confidence for the class that matches the doc_type of that document as the score for our custom feature. For example:

Doc 1:{
 “id”: “a9e69b96–099e-4a02-b1ae-96a0956c484b”,
 “source”: “Appendicitis”,
 “doc_type”: “symptom”,
 “topic”: “What are the symptoms of appendicitis?”,
 “text_description”: “The symptoms of appendicitis are typically easy for a health care provider to diagnose. The most common symptom of appendicitis is abdominal pain. Abdominal pain ...“
 }Doc 2:{
 “id”: “b5eae497–17df-4510–8c07–12d8e18bd6bc”,
 “source”: “Appendicitis”,
 “doc_type”: “definition”,
 “topic”: “What is appendicitis?”,
 “text_description”: “Appendicitis is inflammation of the appendix. Appendicitis is the leading cause of emergency abdominal operations. Spirt MJ. Complicated intra-abdominal infections: a focus on appendicitis and diverticulitis. Postgraduate Medicine. 2010;122(1):39–51. “
 }Doc 1 LAT Feature Score = 0.9526760239113281
Doc 2 LAT Feature Score = 0.0036075458415429787

Custom Feature Proxy

To use custom features with R&R, we need to effectively split the Retrieve and Rank services. We will use a proxy to handle an incoming query, collect the documents along with their Retrieve scores, collect the NLC confidence scores, combine scores and send them to the ranker. Figure 3 shows how this works conceptually.

There are 2 github projects that will allow us to build this proxy:

The custom scorers project allows you to build custom logic within 1 of the 3 feature templates discussed above (document, query, query-document) or leverage a prebuilt feature and generates a python wheel file that is leveraged by the proxy server. The proxy is a python web server that uses Flask.

Build Custom Feature

Clone the custom scorers project
Navigate to /rr_scorers/query_document

There is a prebuilt scorer called nlc_intent_scorer.py which is what we will use. If you want to build your own custom scorer you can leverage the templates which are document_scorer.py, query_scorer.py and query_document_scorer.py.

3. Replace nlc_intent_scorer.py with the version in the niddk github. The only change from the version in the custom scorers project is to replace “title” with “doc_type” since doc_type is the name of the document field that we care about.

4. Generate the wheel. In the root directory of the custom scorers project

pip wheel .

The project will generate a file that looks like rr_scorers-1.0-py2-none-any.whl

Configure R&R Proxy

Clone the custom scorer proxy project
Create logs and answers directories (this will be updated in the project)

mkdir answers
mkdir logs

3. Install custom scorer proxy

./bin/install.sh

4. Edit service.cfg file in the config directory with R&R configuration information

SOLR_CLUSTER_ID=
SOLR_COLLECTION_NAME=niddk_collection
RETRIEVE_AND_RANK_BASE_URL=https://gateway.watsonplatform.net/retrieve-and-rank/api
RETRIEVE_AND_RANK_PASSWORD=
RETRIEVE_AND_RANK_USERNAME=

5. Edit the sample_features.json file in the config directory with your NLC configuration information. This is the file that defines which scorers in the wheel file the proxy server will execute at runtime

{“scorers”:[
{“init_args”:{
“name”:”DocQueryIntentScorer”,
“short_name”:”qd1",
“description”:”Score based on Document/Query alignment”,
“service_url”: “https://gateway.watsonplatform.net/natural-language-classifier/api",
“service_username”: “username”,
“service_password”: “password”,
“classifier_id”:”9a8879x44-nlc-969"},
“type”:”query_document”,
“module”:”nlc_intent_scorer”,
“class”:”QuestionDocumentIntentAlignmentScorer”
}]}

6. Run the Proxy Server

./run.sh ./config/service.cfg ./config/sample_features.json

Train a ranker through the proxy

Training the ranker is the same basic process as followed in Part 2 except we will point the train script to the proxy instead of directly to R&R. There is a file trainproxy.py on github that has this set up. Note the field names are hardcoded for the fl parameter on lines 83 and 95 so if you are using this project, you’ll need to change those. The feature depends on the doc_type field so we need to ensure it’s returned by Retrieve.

python trainproxy.py -u username:password -i gt_train.csv -c sc3689b816_2b07_4548_96a9_a9e52a063bf1 -x niddk_collection -r 30 -n lat_ranker -d -v

The script will generate a file called trainingdata.txt with a new column called qd1 that contains our LAT feature score. A new ranker will be created and trained. Capture the ranker id.

Call the Proxy server at runtime

Your application will call the proxy directly and the proxy will handle the call to R&R. That can be done at the following endpoint.

http://localhost:9216/fcselect?ranker_id=868fedx13-rank-429&q=How%20do%20I%20know%20if%20I%20have%20Chrons?&wt=json&fl=id,topic,text_description,doc_type

Running the same set of experiments as we did in Part 2 produces the following results.

Conclusion

Custom features are a powerful way to provide additional relevance signals to the ranker. There are a set of open source tools that enable creation of these features and the process will simplify in future versions of R&R.