class: center, middle ![Solr Logo](solr_logo_rgb.png) # Solr & Content Recommendation Toby Cole, Technical Architect, Semantico (press 'p' to show notes & demo guide) ??? To work along with the examples, [download solr](http://lucene.apache.org/solr/downloads.html) and run 'example/start.jar'. Solr should be running on [port 8983](http://localhost:8983) --- # What am I talking about? * What are Lucene & Solr? * Tokenization, terms and term-filters. * Relevance - tf-idf algoithm * Solr's API * SolrJ - Java API * More Like This --- # What are Solr & Lucene? Both open source Apache projects. ## Lucene * Java full-text search library * Started in '99 * Documents, fields, text. ## Solr * REST API on top of Lucene * Adds typing & schemas * Nice client library, SolrJ --- # Tokenization & terms Given a sentence: `An ASCII representation of Miley Cyrus' twerking face ;P` tokenization is splitting into indexable chunks `An` `ASCII` `representation` `of` `Miley` `Cyrus` `twerking` `face` `P` term-filters can be used to modify these terms before they're indexed lowercasing: `an` `ascii` `representation` `of` `miley` `cyrus` `twerking` `face` `p` stemming: `an` `ascii` `represent` `of` `milei` `cyru` `twerk` `face` `p` ??? To try this for yourself have a look at the [*Analysis* section in Solr's admin](http://localhost:8983/solr/#/collection1/analysis?analysis.fieldvalue=An%20ASCII%20representation%20of%20Miley%20Cyrus%27%20twerking%20face%20%3BP&analysis.query=An%20ASCII%20representation%20of%20Miley%20Cyrus%27%20twerking%20face%20%3BP&analysis.fieldtype=text_en&verbose_output=0). --- # tf-idf See [Lucene's TFIDF Javadoc](http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) for a good description ## Term Frequency Many occurences in one document == **relevant to document** ## Inverse Document Frequency * Rare terms == **relevant** * Common terms == **not relevant** ## Field length * Shorter == **better** (This is a massive oversimplification, deal with it.) --- # Solr's API - Input Various formats available: * XML * JSON * CSV * Streaming binary content + Tika (Doc, PDFs etc) * Database Import --- # Solr's API - Input XML Over HTTP - `POST /update` ```xml
05991
Bridgewater
Perl
Java
[
…
…
…
]
``` more details - [Solr Update docs](http://wiki.apache.org/solr/UpdateXmlMessages) ??? To add the cheeses data to your solr, [download the data](cheeses.csv) and run this `curl` command: curl http://localhost:8983/solr/update/csv --data-binary @cheeses.csv -H 'Content-type:text/plain; charset=utf-8' If you're running windows, you'll have to [install cygwin](http://www.cygwin.com/). This example uses the [CSV input](http://wiki.apache.org/solr/UpdateCSV) format, which is handy for simple datasets. --- # Solr's API - Output XML Let's have a look * [Hams](http://localhost:8081/solr/select?q=*:*) * [Cheeses](http://localhost:8082/solr/select?q=*:*) (or in JSON) * [Hams](http://localhost:8081/solr/select?q=*:*&wt=json) * [Cheeses](http://localhost:8082/solr/select?q=*:*&wt=json) ??? If you've loaded the data yourself, have a look at the [XML output](http://localhost:8983/solr/select?q=*:*) or the [JSON output](http://localhost:8983/solr/select?q=*:*&wt=json) locally. Have a play with the 'q' parameter, it's the search query. The syntax is defined in the [Lucene's javadoc](http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description) with modifications documented on the [Solr Wiki](http://wiki.apache.org/solr/SolrQuerySyntax) --- # SolrJ * Nice Java API for accessing Solr. * Can use embedded solr (no network overhead, but no distributed search). * Annotated POJOs for indexing and retrieval ```java import org.apache.solr.client.solrj.beans.Field; public class Cheese { @Field public String id; @Field public String name; @Field public String description; } ``` ??? The SolrJ jar is included in the `dist` folder of the solr download. Alternatively you could use the maven artifact:
org.apache.solr
solr-solrj
4.4.0
--- # SolrJ indexing ```java SolrServer solr = new HttpSolrServer("http://localhost:8081/solr"); List
cheeses; //add cheeses server.addBeans(cheeses); ``` --- # SolrJ querying ```java SolrServer solr = new HttpSolrServer("http://localhost:8081/solr"); SolrQuery query = new SolrQuery("description:stringy"); QueryResponse rsp = server.query( query ); //cheesy beans List
beans = rsp.getBeans(Cheese.class); ``` --- # Solr MLT - 'More like this' * Analyses a field within a document to find interesting terms * Remember *Inverse Document Frequency* * 'interesting' == rare throughout corpus of docs, common within doc * Searches index for documents with those terms * Optionally applies boosts (expensive) ??? Couldn't get the MLT example working with the small cheeses dataset :( Checkout the [MLT docs instead](http://wiki.apache.org/solr/MoreLikeThis) --- class: center, middle, big #Questions? ![milei cyru twerk face](cyru.gif) --- class: center, middle #Cheers! [@tubfun](https://twitter.com/tubfun) [@semantico](https://twitter.com/semantico) Slides up on labs.semantico.com tomorrow (unless you're reading them tomorrow, in which case they're up now)