Misc

Easy Method Scheduler with Spring 3.x

Scheduling by Annotation

The Java community is wide, and sometimes its easy to miss something simple. But always “google it first”, if you are doing something and it seems harder than it should, you’re probably doing it wrong. So today, lets go ahead an schedule a re-occurring process in a Jersey REST service.

Use Case: Updating cached values every (N) period

Say you have a website that needs to be responsive, but it needs to “call home” every-now-and-then to make sure it’s running with the latest settings and data. In effect you want an expiring cache that should expire every (N) period (TTL or Total Time To Live). With java and spring this is super easy.

Code

We need a public method that does not return a value and we need to define the period. Spring has many options (http://docs.spring.io/spring/docs/3.0.x/spring-framework-reference/html/scheduling.html) for defining the period; from cron, to fixed interval, etc. In this instance I am choosing fixedRate which will allow the method to execute and it will not restart the timer till the method completes (so if the refresh operation takes awhile we wont overlap executions).

private static int TEN_MINUTES = 1000 * 60 * 10;
@Scheduled(fixedRate = TEN_MINUTES)
public void refresh() {
   // ... do something cool to update your cache/data
}

Spring

Now we add our spring information and, low-and-behold, we’ll have automated polling! It really is that easy!

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:task="http://www.springframework.org/schema/task"
    xsi:schemaLocation="http://www.springframework.org/schema/task http://www.springframework.org/schema/task/spring-task-3.0.xsd">

    <task:annotation-driven executor="myExecutor" scheduler="myScheduler"/>
    <task:executor id="myExecutor" pool-size="5"/>
    <task:scheduler id="myScheduler" pool-size="10"/>

</beans>
Standard
Misc

User Based Personalization Engine with Solr

Problem

You have a varied collection of media you want to personalize. It could be links, websites, friends, animals, recipes, videos, etc. The content has meta-data attributes that are’t very clean. Sometimes things like brand name are abbreviated, categories are pluralized, different casing, etc. Media also has destination attributes, maybe it’s only applicable for Florida residents, or people who live in a 30 miles radius of Tampa. Or in a more complex example; maybe within a predefined market zone (custom geo polygon representing a sales territory or market). Also media can be in different formats, web, email, video, kiosk, etc.

Objective

You need to personalize this content given the following dimensions:

  • The meta-data about the media such as brand, flavor, category, feature, price, value, gtin, etc
  • The current market information such as clicks, impressions, purchases, views, avg spend, etc
  • Standard algorithims such as frequency, recency, segmentation based on an individual user

Stories

  • As a personalization engine I want to be able to get all media applicable to a particular user, ranked sorted by it’s score.
  • As a personalization engine I want to be able to return relevant media that takes into account endpoint information such as lat/long, banner, etc so that I dont recommend media that doesn’t fit the channel, store, sales objective.

Custom Implementation

Say you built this buy hand, what do you need at minimum…

  • Code/Db/Schema to store documents representing media
  • Code for APIs around fetching/filtering/pagination of those documents
  • Code to import those documents
  • Scale-out of the scoring computation
  • Media filter semantics (for example, score documents with the word ‘RED BULL’ in it)
  • Auto-suggest/did-you-mean (useful to show related media or near duplicates)
  • Built in geo/location filter semantics (for example, score documents in a 5km radius of Tampa)
  • Flexible document model to allow scoring many different types of media with varying quality of meta-data
  • Ability to score based upon any number of analytic models both on media and user

All these things can be done and for many that might be fun way to learn and grow. And for some companies it might be the best approach, all things considered. However, today we won’t talk about building a custom personalization engine, today we’ll explore what it would look like if you leveraged SOLR for this.

Personalization of Media Using Solr

Today we’ll take a stab at user-based personalization in SOLR. Why? Because it solves for most of the above, has built in cloud scale, other people have done it, it has a mature API used by large companies, and has baked in administration functions, and more. So how do we get started? First, some references and blogs about what we are trying to do.

References

Solr Building Blocks

Media Ingestion

So to store media, we already have that in SOLR care of it’s built in support for XML, CSV, JSON, JDBC, and vast array of other formats. For ease, we can just post documents to solr or using a JDBC endpoint. Connecting SOLR with Hadoop is an easy task as well care of the Hive JDBC driver so regardless of where the media is, it can be pushed or pulled with ease.

Basic Network Filtering

To filter media by basic things like “only media for this advertiser” we can just use out-of-the-box solr queries. So if our media has “advertiser_id” as an attribute we can simply do “/media/select?q=advertiser_id:UTC12341234”. Solr is great at this. Further if we want to only get media by a particular site or network we can just decorate those tags in the media and we’ll be able to slice and dice media. Typically these “filters” are synonymous with “business rules” so we can also let external parties pass us this information, and we can avoid having to be concerned with these details (which is great not to have to worry about it or create custom APIs).

Geo/Location Filtering

SOLR has a wealth of geo/location filtering abilities, from bounding boxes, to radius, to custom polygon shapes. Media that has attributes like lat/long can be searched for, and if a user is in a particular area we can find relevant deals within (N) km of their current location. Really powerful stuff when combined with market zones!

Media Management

Since all media ends up in SOLR we can use native search functionality to manage and monitor the media. Faceted search to power top(N) media, get insights into overlapping media, duplicates, and fuzzy matching allow us to see all the media at a glance and browse/pivot it to however a business user feels they need to. Out-of-the-box UX experiences can be used, or downloaded to drive this (hue/solr).

Generic Relevancy Algorithms

SOLR comes with some fairly nice relevancy Solr Relevency FAQ. Note it already has built in functions for scoring relevancy on basic audience information like clicks/popularity. So you could probably stop here if you just wanted to, let’s say, score media by overall clicks in the past hour. In-fact linkedIn and other use this and there is a nice power-point deck here on Implementing Click Through Relevance Ranking

Domain Specific Relevancy

So we are 99% there, but lets say we need to tailor the scores and have finer more mathematical control over scoring. We can do this by implementing domain specific language concepts into SOLR. It’s already got the plug-and-play semantics for this so that we can in real-time mash a users preferences/behavior data/segmentation information with each piece of media to compute a score or many scores. Its opened up to us by implementing Solr Function Queries. Solr already has many out of the box, the only piece that is missing is being able to get your user-information mixed with your media.

And because solr has built in support for this we can filter, sort, and return these mathematical gems to build up an expressive library of domain specific functions.

Example: Recency Function

Let’s start with a basic example, we want to compute the recency in days since the last click on a particular category. We need to be able to tell our function “who” so it can lookup the users information (from a database, API, etc) and we also need to tell it “what” we want to score on.

http://localhost:8080/solr/media/select/?q=*:*&fl=*,myvalue:click_recency(USERID,'category')

In this instance “myvalue” is the value returned back to us (we aren’t sorting yet or filtering). “click_recency” is our custom function. “USERID” is the user identifier which will be used to look-up whatever information you have about the consumer category clicks, and finally “category” is the field name in the SOLR media index to weight against.

Assume we have a document index as follows:

media #   |  category   |  .... |  .... | ...
1         |  AUTO       |  .... |  .... | ...
2         |  PET        |  .... |  .... | ...
3         |  ANIME      |  .... |  .... | ...

Assume we have access to an API that will return information about a particular user (maybe from a database, nosql, or some paid-provider like a DMP).

{
    "id": "1234",
    "clicks": {
       "category": {
          "CLOTHES": 10,
          "SOFTWARE": 9,
          "PET": 9
       }
    },
    ...,
    ...
}

In our simple example, our user model doesn’t contain anything like when clicks were made, etc. Just aggregates, but depending on the richness of your user model you could certainly create computations that take into account frequency, recency, etc.

Parser

public class MyUserFunction extends ValueSourceParser { 
    private MyUserService users;
    // called when our function is initialized so we can
    // configure it by telling it where external sources are
    // or maybe how much/long to cache, or the uid/pwd to access, etc...
    public void init(NamedList configuration) {
       // get your user information api endpoint here
       String api_endpoint = configuration.get("url").toString(); 
       users = new MyUserService(api_endpoint, ..., ...);
    }

    public ValueSource parse(FunctionQParser fp) throws SyntaxError {
       String user_id = fp.parseArg();
       String field_to_eval = fp.parseArg();
       User user = users.get(user_id);
       return new UserRecencyValueSource(user, field_to_eval);
    }
}

Extractor

public class UserRecencyValueSource extends ValueSource {
   public UserRecencyValueSource(User user, String field_to_eval) {
     // ...
   }

   public FunctionValues getValues(Map context, AtomicReaderContext reader) throws IOException {
      // calculate hash of users recency
      HashSet<String, Double> clicks_by_category = // ...;
      return new DocTermsIndexDocValues(this, reader, field_to_eval) {
         public Object objectVal(int doc) {
             String field_value = strVal(doc);
             // has this person clicked on it, if not just return 0
             if (!clicks_by_category.containsKey(field_value)) return 0D;
             return clicks_by_category.get(field_value);
         }
      };
   }
}

Enable

To enable our component we update the solrconfig and register our user function.

  <valueSourceParser name="click_recency" class="com.foo.bar.MyUserFunction">
    <str name="userinfo.url">http://..../users</str>
    <str name="userinfo.uid">test</str>
    <str name="userinfo.pwd">test</str>     
  </valueSourceParser>

Now with this in place, we can sort/filter/etc by our custom component and because it’s implemented in a standard way we can also combine it with all the other SOLR functions and any other solr functions you might have in your library. So…

&sort=log(max(click_recency(tsnyder,'category'), click_recency(tsnyder,'brand'))) desc

So in the above hypothetical we sort all documents by the log base 10 of the maximum value returned by either category click recency or brand click recency in descending order. Now imagine your library grew to contain frequency, spend, quantity, and more. Considering SOLR also has functions such as scale, cos, tan, etc we can now create a very flexible manner of scoring documents in a possibly infinite number of ways.

Final Thoughts

If you are still questioning how powerful this concept is, go check out Solr Image Search and a live demo of image search in solr which uses SOLR Query Functions to find patterns within images and return similar images.

Standard
Misc

AWS CLI and JQ For Automation of Environments

So I’ve had a problem recently, it might be familiar to others.

I constantly need to provision a brand new environment and I always run into a snag. Basically, vagrant keeps state in a folder under .vagrant and for the vagrant cloud plugins this has the AWS instance id it THINKS it should provision to, if the folder doesn’t exist, then a new instance is created.

The problem comes in when I might want to shutdown an instance in the ec2 console, the ec2 instance itself blows up, or maybe I want to provision a new environment. Any of these actions causes the jenkins state to go out of sync with the real world, because the instance ID in the .vagrant folder no longer matches reality – and when this happens we lose all ability to provision or re-provision.

I’ve been solving this by wiping out my workspace every time this happens to provision a brand new environment, it’s not that bad, except we have clusters of machines that require this wipe-out and re-do. Also we waste time pulling down for git stuff that hasn’t actually changed. And finally, worst of all it’s manual and that involves the whole tribal knowledge thing…. yuck..

Consider more complex environments when we end up running multiple jenkins instances, one jenkins might have the ./vagrant folder and another doesn’t. Or maybe the worst happens, and our jenkins box gets knocked out. Without this state it would cause us to bring up multiple instances (both expensive, and possibly leading to errors). So what to do?

With a little bash voodoo we can scan amazon for instances using amazon cli by looking for instances already in ‘pending’ or ‘active’ state! And then we can then let vagrant know what really is the state of the world.

We can “find” our instances by issuing the below command using the excellent aws cli along with JQ.

aws ec2 describe-tags --filters Name=resource-type,Values=instance | \
         jq '.Tags[] | {Key,Value,ResourceId}' | \
         jq '. | select(.Key=="Name")' | \ 
         jq '. | select(.Value=="YOURNAMEHERE").ResourceId'

This little gem will return either null or the instance ID of the machine you are looking for (our instances are all uniquely named). So now we can use this to conditionally run either a vagrant up YOURNAMEHERE or vagrant provision YOURNAMEHERE depending on the result.

The trick to getting provisioning to work from scratch (lets say you configure jenkins to reset your git every build) is to create the correct file in .vagrant/machines/YOURNAMEHERE/aws/id when the above yields an instance id that is active.

NAME="YOUR_BASE_VAGRANT_NAME_HERE"

cd puppet
chmod +x *.sh
cd "boxes/$NAME"

for ID in $(seq 1 $NUMBER_OF_CAPACITY_NODES);
do
  NODE_ID=$(printf "%02d" $ID)
  VAGRANT_NAME="$NAME-$NODE_ID"
  EC2_TAG_NAME="$NAME-$NODE_ID-$ENV"
  echo "$EC2_TAG_NAME"

  rm -rf ".vagrant/machines/$NAME/aws" || true
  mkdir -p ".vagrant/machines/$NAME/aws"

  INSTANCE_ID=$(aws ec2 describe-instances --filters Name=instance-state-name,Values=running,pending | jq '.Reservations[]' | jq '.Instances[] | { InstanceId, Tags }' | jq 'select(has("Tags")) | select(.Tags[]["Key"] == "Name" and .Tags[]["Value"] == "'"$EC2_TAG_NAME"'") | .InstanceId')
  INSTANCE_ID=$(echo $INSTANCE_ID | sed "s/\"//g")
  echo $INSTANCE_ID

  if [ -n "$INSTANCE_ID" ]; then
    echo "=================== PROVISION ENVIRONMENT ======================="
    echo $INSTANCE_ID > ".vagrant/machines/$VAGRANT_NAME/aws/id"
    DEPLOYMENT_OUTPUT=`ENV=$ENV NODE=$NODE_ID vagrant provision $VAGRANT_NAME`
    test 0 -eq `echo $DEPLOYMENT_OUTPUT | grep "VM not created" | wc -l` -a 0 -eq $?
  else
    echo "=================== BRAND NEW ENVIRONMENT ======================="
    DEPLOYMENT_OUTPUT=`ENV=$ENV NODE=$NODE_ID vagrant up $VAGRANT_NAME --provider=aws`
    test 0 -eq `echo $DEPLOYMENT_OUTPUT | grep "VM not created" | wc -l` -a 0 -eq $?
  fi

done

Now we can provision/re-provision and also use the standard amazon control panel – so we can blow away instances in the ec2 console, and then on next jenkins push, it will detect no instances, and will automatically provision the new environment.

Important to note, that the above allows us to spindle up (N) number of vagrant instances. Useful, for example, when node 09 was terminated and when we run this the script will provision 01-08 and bring up a new 09.

Standard
Java, Misc

Standard Maven : Multi-Module Maven Projects and Parent POMs

Abstract

So you have a platform you are trying to make – most of it written using Java, maybe a little nodejs, zeromq or some other such goodies but you want to manage your java resources using as much out-of-the-box engineering to reduce overall boilerplate and headaches that come with a compiled language. You decide to go with maven because, well it’s better than JAR hell, and has more cool kids using it (github, etc). Tooling support is more than decent and you’ve got support by most if not all Apache projects for dependencies.

Requirements

  • The system should be able to build the entire project and all dependencies in one go.
  • The system should be able to load the entire solution into an IDE (Eclipse)
  • The system should follow DRY and KISS – things should be in one place, and things should only do one thing and do it well
  • The system should be able to create a pit of success (proper documentation, unit tests, code coverage, reports, site creation, etc)

Structure

The example solution structure below is the same as that is used by Hibernate and some top level Apache projects.

./your_solution
 ./pom.xml            (solution level pom, glue for which modules are part of solution)
 ./modules            (all modules for the solution)
   ./core             (common shared lib, usually domain objects, etc)
     ./src
     ./pom.xml        (standard maven module pom with ref to parent)
   ./api              (api for exposing jersey, jax-ws, jax-b services)
     ./src
     ./pom.xml        (standard maven module pom with ref to parent)
   ./parent           (parent pom container)
     ./pom.xml        (standard parent pom)

Features

Solution POM

The solution POM simply exists to setup the list of modules. Used primarily by build tools as well as IDE's this POM contains a listing of all modules that make up this solution. This combined with default properties and/or build profiles, default group id, packaging, etc.

Parent POM

Contains all common settings, build plugins, common libraries, frameworks, test tools, etc. Normally this includes all standards like custom repositories, which java version to target. Also common build dependencies such as junit, logging frameworks.

Finally the parent pom usually specifies the site generators, javadoc options, and team members, etc for site generation.

Module POM(s)

Specific to each logical component in the overall app/solution architecture. Normally separated out in classic N-Tier patterns - domain object module, business rule module, data module, api module, webapp module, etc.

Continuous Delivery

  • Enabled by applying top level code quality checks in the parent POM (checkstyle, PMD, etc)
  • Enabled by applying top level doc generation in the parent POM (JavaDocs, maven sites, etc)
Standard
Misc

Self Assembling Nodes for Elastic Compute Resource Resolution

So lets set this up, as a simple problem statement:

You have an elastic compute grid, which allows you to add/remove resources at will. In general, as you add capacity by adding things like MySQL instances, memcache, redis, etc resource you want to let your various tiers get notified of these additions/subtractions such that as you add resources your applications can sort of just “hang out” and wait for the resources they need to become available. You dont want to have to manually configure IPs, Ports, DNS, etc. Simply “hot-swap” these resources in and feel assured that (N) or more clients can find available services out there in the wild.

SOA had the concept of a bus and discovery, which usually goes something like either (A) use a single registry to perform lookup/routing or (B) use UDP broadcasts to let your applications “self-discover” their environment.

I am now a huge fan of puppet and nodejs along with zeromq and as such I believe the solution is made much simplier thanks to epgm/pgm protocols.

Lets take a look at same nodejs code…

Infrastructure subscriber

So with those lines of code you have now rest assured that when infrastructure is added with the specified tag (can be any string) that you will be able to pick it up and do something with it (IE: configure it).

Infrastructure publisher

Given this simple and effective setup as you add nodes to your Amazon EC2, rackspace, whatever instance provided you are within BROADCAST range you should be able to pick up instances dynamically, perform an operation, and then wait for those instances to disappear or go-offline, and re-act accordingly.

Use Case

As a user I have the option of hosting many SOLR instances for a BIG data solution. One of the problems with managing a large number of SOLR instances is the complexity involved in infrastructure. In order to create shards and distributed queries I need to update my solr.config files with all solr instances that are available to ensure my queries cover the entire spectrum. However, as I add or remove nodes in my cloud I want this configuration to stay in-sync. Normally this requires automation but I dont want to have a human typing in IP addresses or requiring a complex management structure to handle the complexity. What I want is the ability to drop in solr instances and pluck them out and have it elastically adjust and compensate.

How the solution maps

With the proposed solution we now have a means such that as SOLR instances get added we can invoke a puppet process to update our SOLR configuration (using FACTER). The moment a change is detected we update FACTER with the latest information needed, and then execute PUPPET, which will then do the required re-configuration/uninstallation/etc tasks.

But what about zookeeper?

The latest SOLR trunk (4.0) includes zookeeper for this kind of thing, but this introduces complexity as we need leader election, fail-over for zookeeper farm, and have a central point of contention. While this may be good for some instances, in other instances where the farm itself could be upwards of 100 servers (100’s of billions of SOLR docs) we want to handle outages gracefully.

Using this approach we can also have our instances geographically dispersed by creating a “forwarder” that can listen to a separate network and receive the same pub/sub updates.

Looking forward to open-sourcing this to github as time goes on.

Standard