User Based Personalization Engine with Solr


You have a varied collection of media you want to personalize. It could be links, websites, friends, animals, recipes, videos, etc. The content has meta-data attributes that are’t very clean. Sometimes things like brand name are abbreviated, categories are pluralized, different casing, etc. Media also has destination attributes, maybe it’s only applicable for Florida residents, or people who live in a 30 miles radius of Tampa. Or in a more complex example; maybe within a predefined market zone (custom geo polygon representing a sales territory or market). Also media can be in different formats, web, email, video, kiosk, etc.


You need to personalize this content given the following dimensions:

  • The meta-data about the media such as brand, flavor, category, feature, price, value, gtin, etc
  • The current market information such as clicks, impressions, purchases, views, avg spend, etc
  • Standard algorithims such as frequency, recency, segmentation based on an individual user


  • As a personalization engine I want to be able to get all media applicable to a particular user, ranked sorted by it’s score.
  • As a personalization engine I want to be able to return relevant media that takes into account endpoint information such as lat/long, banner, etc so that I dont recommend media that doesn’t fit the channel, store, sales objective.

Custom Implementation

Say you built this buy hand, what do you need at minimum…

  • Code/Db/Schema to store documents representing media
  • Code for APIs around fetching/filtering/pagination of those documents
  • Code to import those documents
  • Scale-out of the scoring computation
  • Media filter semantics (for example, score documents with the word ‘RED BULL’ in it)
  • Auto-suggest/did-you-mean (useful to show related media or near duplicates)
  • Built in geo/location filter semantics (for example, score documents in a 5km radius of Tampa)
  • Flexible document model to allow scoring many different types of media with varying quality of meta-data
  • Ability to score based upon any number of analytic models both on media and user

All these things can be done and for many that might be fun way to learn and grow. And for some companies it might be the best approach, all things considered. However, today we won’t talk about building a custom personalization engine, today we’ll explore what it would look like if you leveraged SOLR for this.

Personalization of Media Using Solr

Today we’ll take a stab at user-based personalization in SOLR. Why? Because it solves for most of the above, has built in cloud scale, other people have done it, it has a mature API used by large companies, and has baked in administration functions, and more. So how do we get started? First, some references and blogs about what we are trying to do.


Solr Building Blocks

Media Ingestion

So to store media, we already have that in SOLR care of it’s built in support for XML, CSV, JSON, JDBC, and vast array of other formats. For ease, we can just post documents to solr or using a JDBC endpoint. Connecting SOLR with Hadoop is an easy task as well care of the Hive JDBC driver so regardless of where the media is, it can be pushed or pulled with ease.

Basic Network Filtering

To filter media by basic things like “only media for this advertiser” we can just use out-of-the-box solr queries. So if our media has “advertiser_id” as an attribute we can simply do “/media/select?q=advertiser_id:UTC12341234”. Solr is great at this. Further if we want to only get media by a particular site or network we can just decorate those tags in the media and we’ll be able to slice and dice media. Typically these “filters” are synonymous with “business rules” so we can also let external parties pass us this information, and we can avoid having to be concerned with these details (which is great not to have to worry about it or create custom APIs).

Geo/Location Filtering

SOLR has a wealth of geo/location filtering abilities, from bounding boxes, to radius, to custom polygon shapes. Media that has attributes like lat/long can be searched for, and if a user is in a particular area we can find relevant deals within (N) km of their current location. Really powerful stuff when combined with market zones!

Media Management

Since all media ends up in SOLR we can use native search functionality to manage and monitor the media. Faceted search to power top(N) media, get insights into overlapping media, duplicates, and fuzzy matching allow us to see all the media at a glance and browse/pivot it to however a business user feels they need to. Out-of-the-box UX experiences can be used, or downloaded to drive this (hue/solr).

Generic Relevancy Algorithms

SOLR comes with some fairly nice relevancy Solr Relevency FAQ. Note it already has built in functions for scoring relevancy on basic audience information like clicks/popularity. So you could probably stop here if you just wanted to, let’s say, score media by overall clicks in the past hour. In-fact linkedIn and other use this and there is a nice power-point deck here on Implementing Click Through Relevance Ranking

Domain Specific Relevancy

So we are 99% there, but lets say we need to tailor the scores and have finer more mathematical control over scoring. We can do this by implementing domain specific language concepts into SOLR. It’s already got the plug-and-play semantics for this so that we can in real-time mash a users preferences/behavior data/segmentation information with each piece of media to compute a score or many scores. Its opened up to us by implementing Solr Function Queries. Solr already has many out of the box, the only piece that is missing is being able to get your user-information mixed with your media.

And because solr has built in support for this we can filter, sort, and return these mathematical gems to build up an expressive library of domain specific functions.

Example: Recency Function

Let’s start with a basic example, we want to compute the recency in days since the last click on a particular category. We need to be able to tell our function “who” so it can lookup the users information (from a database, API, etc) and we also need to tell it “what” we want to score on.


In this instance “myvalue” is the value returned back to us (we aren’t sorting yet or filtering). “click_recency” is our custom function. “USERID” is the user identifier which will be used to look-up whatever information you have about the consumer category clicks, and finally “category” is the field name in the SOLR media index to weight against.

Assume we have a document index as follows:

media #   |  category   |  .... |  .... | ...
1         |  AUTO       |  .... |  .... | ...
2         |  PET        |  .... |  .... | ...
3         |  ANIME      |  .... |  .... | ...

Assume we have access to an API that will return information about a particular user (maybe from a database, nosql, or some paid-provider like a DMP).

    "id": "1234",
    "clicks": {
       "category": {
          "CLOTHES": 10,
          "SOFTWARE": 9,
          "PET": 9

In our simple example, our user model doesn’t contain anything like when clicks were made, etc. Just aggregates, but depending on the richness of your user model you could certainly create computations that take into account frequency, recency, etc.


public class MyUserFunction extends ValueSourceParser { 
    private MyUserService users;
    // called when our function is initialized so we can
    // configure it by telling it where external sources are
    // or maybe how much/long to cache, or the uid/pwd to access, etc...
    public void init(NamedList configuration) {
       // get your user information api endpoint here
       String api_endpoint = configuration.get("url").toString(); 
       users = new MyUserService(api_endpoint, ..., ...);

    public ValueSource parse(FunctionQParser fp) throws SyntaxError {
       String user_id = fp.parseArg();
       String field_to_eval = fp.parseArg();
       User user = users.get(user_id);
       return new UserRecencyValueSource(user, field_to_eval);


public class UserRecencyValueSource extends ValueSource {
   public UserRecencyValueSource(User user, String field_to_eval) {
     // ...

   public FunctionValues getValues(Map context, AtomicReaderContext reader) throws IOException {
      // calculate hash of users recency
      HashSet<String, Double> clicks_by_category = // ...;
      return new DocTermsIndexDocValues(this, reader, field_to_eval) {
         public Object objectVal(int doc) {
             String field_value = strVal(doc);
             // has this person clicked on it, if not just return 0
             if (!clicks_by_category.containsKey(field_value)) return 0D;
             return clicks_by_category.get(field_value);


To enable our component we update the solrconfig and register our user function.

  <valueSourceParser name="click_recency" class="com.foo.bar.MyUserFunction">
    <str name="userinfo.url">http://..../users</str>
    <str name="userinfo.uid">test</str>
    <str name="userinfo.pwd">test</str>     

Now with this in place, we can sort/filter/etc by our custom component and because it’s implemented in a standard way we can also combine it with all the other SOLR functions and any other solr functions you might have in your library. So…

&sort=log(max(click_recency(tsnyder,'category'), click_recency(tsnyder,'brand'))) desc

So in the above hypothetical we sort all documents by the log base 10 of the maximum value returned by either category click recency or brand click recency in descending order. Now imagine your library grew to contain frequency, spend, quantity, and more. Considering SOLR also has functions such as scale, cos, tan, etc we can now create a very flexible manner of scoring documents in a possibly infinite number of ways.

Final Thoughts

If you are still questioning how powerful this concept is, go check out Solr Image Search and a live demo of image search in solr which uses SOLR Query Functions to find patterns within images and return similar images.


Leave a Reply

Your email address will not be published. Required fields are marked *