Intro to Solr & Recommendation

class: center, middle

![Solr Logo](solr_logo_rgb.png)

# Solr & Content Recommendation

Toby Cole, Technical Architect, Semantico

(press 'p' to show notes & demo guide)

???

To work along with the examples, [download solr](http://lucene.apache.org/solr/downloads.html) and run 'example/start.jar'.

Solr should be running on [port 8983](http://localhost:8983)

---

# What am I talking about?

* What are Lucene & Solr?
* Tokenization, terms and term-filters.
* Relevance - tf-idf algoithm
* Solr's API
* SolrJ - Java API
* More Like This
---

# What are Solr & Lucene?

Both open source Apache projects.

## Lucene
* Java full-text search library
* Started in '99
* Documents, fields, text.

## Solr
* REST API on top of Lucene
* Adds typing & schemas
* Nice client library, SolrJ

---

# Tokenization & terms
Given a sentence:

`An ASCII representation of Miley Cyrus' twerking face ;P`

tokenization is splitting into indexable chunks

`An` `ASCII` `representation` `of` `Miley` `Cyrus` `twerking` `face` `P`

term-filters can be used to modify these terms before they're indexed

lowercasing:

`an` `ascii` `representation` `of` `miley` `cyrus` `twerking` `face` `p`

stemming:

`an` `ascii` `represent` `of` `milei` `cyru` `twerk` `face` `p`

???

To try this for yourself have a look at the [*Analysis* section in Solr's admin](http://localhost:8983/solr/#/collection1/analysis?analysis.fieldvalue=An%20ASCII%20representation%20of%20Miley%20Cyrus%27%20twerking%20face%20%3BP&analysis.query=An%20ASCII%20representation%20of%20Miley%20Cyrus%27%20twerking%20face%20%3BP&analysis.fieldtype=text_en&verbose_output=0).

---

# tf-idf

See [Lucene's TFIDF Javadoc](http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) for a good description

## Term Frequency

Many occurences in one document == **relevant to document**

## Inverse Document Frequency

* Rare terms == **relevant**
* Common terms == **not relevant**

## Field length

* Shorter == **better**

(This is a massive oversimplification, deal with it.)

---
# Solr's API - Input
Various formats available:

* XML
* JSON
* CSV
* Streaming binary content + Tika (Doc, PDFs etc)
* Database Import

---
# Solr's API - Input XML

Over HTTP - 
`POST /update`

```xml
<add>
  <doc>
    <field name="employeeId">05991</field>
    <field name="office">Bridgewater</field>
    <field name="skills">Perl</field>
    <field name="skills">Java</field>
  </doc>
  [<doc> … </doc>
  …
  <doc> … </doc>]
</add>
```

more details - [Solr Update docs](http://wiki.apache.org/solr/UpdateXmlMessages)

???

To add the cheeses data to your solr, [download the data](cheeses.csv) and run this `curl` command:

curl http://localhost:8983/solr/update/csv --data-binary @cheeses.csv -H 'Content-type:text/plain; charset=utf-8'

If you're running windows, you'll have to [install cygwin](http://www.cygwin.com/).

This example uses the [CSV input](http://wiki.apache.org/solr/UpdateCSV) format, which is handy for simple datasets.

---
# Solr's API - Output XML

Let's have a look

* [Hams](http://localhost:8081/solr/select?q=*:*)
* [Cheeses](http://localhost:8082/solr/select?q=*:*)

(or in JSON)

* [Hams](http://localhost:8081/solr/select?q=*:*&wt=json)
* [Cheeses](http://localhost:8082/solr/select?q=*:*&wt=json)

???

If you've loaded the data yourself, have a look at the [XML output](http://localhost:8983/solr/select?q=*:*) or the [JSON output](http://localhost:8983/solr/select?q=*:*&wt=json) locally.

Have a play with the 'q' parameter, it's the search query. The syntax is defined in the [Lucene's javadoc](http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description) with modifications documented on the [Solr Wiki](http://wiki.apache.org/solr/SolrQuerySyntax)

---
# SolrJ

* Nice Java API for accessing Solr.
* Can use embedded solr (no network overhead, but no distributed search).
* Annotated POJOs for indexing and retrieval

```java
import org.apache.solr.client.solrj.beans.Field;

public class Cheese {
    
    @Field
    public String id;
    @Field
    public String name;
    @Field
    public String description;

}
```

???

The SolrJ jar is included in the `dist` folder of the solr download.

Alternatively you could use the maven artifact:

<dependency>
		<groupId>org.apache.solr</groupId>
		<artifactId>solr-solrj</artifactId>
		<version>4.4.0</version>
	</dependency>
---

# SolrJ indexing
```java
  SolrServer solr = new HttpSolrServer("http://localhost:8081/solr");

List<Cheese> cheeses;
  //add cheeses
  server.addBeans(cheeses);
```

---

# SolrJ querying

```java
  SolrServer solr = new HttpSolrServer("http://localhost:8081/solr");

SolrQuery query = new SolrQuery("description:stringy");
  QueryResponse rsp = server.query( query );
  //cheesy beans
  List<Cheese> beans = rsp.getBeans(Cheese.class);
```
---
# Solr MLT - 'More like this'

* Analyses a field within a document to find interesting terms
 * Remember *Inverse Document Frequency* 
 * 'interesting' == rare throughout corpus of docs, common within doc
* Searches index for documents with those terms
* Optionally applies boosts (expensive)

???

Couldn't get the MLT example working with the small cheeses dataset :(

Checkout the [MLT docs instead](http://wiki.apache.org/solr/MoreLikeThis)

---
class: center, middle, big

#Questions?

![milei cyru twerk face](cyru.gif)

---
class: center, middle

#Cheers!

[@tubfun](https://twitter.com/tubfun)

[@semantico](https://twitter.com/semantico)

Slides up on labs.semantico.com tomorrow
(unless you're reading them tomorrow, in which case they're up now)