Question : If you happen to like a particular Movie A, what is the probability you will also like Movie B ?

I’ll try to describe a simple rudimentary system that will try to answer that question.

Let's make the following assumptions :

  • We have an online community of users that we can collect information on whether they liked a particular movie and store those answers in a database.
  • The end user can make 3 distinct choices for each movie: they can either like it, dislike it or not care for it.

Setting up the Environment

Our database schema is represented by the following three tables.

sql1.png

The Users table contain information about our users. The Ratings table contain the actual votes cast for a particular RatingDefition and the RatingDefinition table contains the Movie names we’ll use for this example.

You can use the following scripts to create the database objects and populate with relevant data. For this particular case we have a total of 50 movies with 10000 users and a total of 500000 like/dislike/neutral votes. For this reason the populate script may take a while to complete.

 

The basic algorithm of calculating the probability of liking a Movie B given that you like Movie A is implemented as a User Defined Function that takes 2 parameters. First the movie that we liked and the second is the movie that we want to test.

CREATE FUNCTION [dbo].[fnGetProbabilityOfLikingItem] 
(
    @ItemAlreadyLiked UNIQUEIDENTIFIER,
    @ItemToBeTested UNIQUEIDENTIFIER
)
RETURNS FLOAT
AS
BEGIN
RETURN ( 
  
        SELECT  (SUM(CASE WHEN r.Rating >0 THEN 1.0 ELSE 0.0 END)/SUM(CASE WHEN r.Rating !=0 THEN 1.0 ELSE 0.0 END)) * 100
        FROM    Ratings r
        WHERE   r.UserId IN
                    (
                    -- all the users that liked that item
                    SELECT UserId 
                    FROM dbo.Ratings 
                    WHERE RatingDefinitionId = @ItemAlreadyLiked AND rating > 0
                    )
                    AND r.RatingDefinitionId = @ItemToBeTested
        )
END

This UDF determines the universe of users that had a favorable review of Movie A and then gathers all the other movies that they’ have also liked and tabulates them based on the number of votes cast allowing us to do pair-wise comparisons like this :

DECLARE @ItemAlreadyLiked UNIQUEIDENTIFIER
DECLARE @ItemToBeTested UNIQUEIDENTIFIER
     
SELECT  @ItemAlreadyLiked = Id 
FROM    dbo.RatingDefinition
WHERE   Name = 'Titanic'
  
SELECT  @ItemToBeTested = Id
FROM    dbo.RatingDefinition
WHERE   Name = 'The Dark Knight'
  
SELECT  ProbabilityOfLiking = dbo.fnGetProbabilityOfLikingItem(@ItemAlreadyLiked, @ItemToBeTested)
  
RESULT : 
  
ProbabilityOfLiking
----------------------
50.9977
  
(1 row(s) affected)

Since the function is written as a scalar UDF we can also to things like :

query_thumb.png

It maybe useful to present a list to the user that they may also be interested in viewing the other movies listed. One thing to note here is the calculated probability numbers appear to be constrained within close proximity of each other. This is due to the fact the algorithm used to generate the votes relies on a random number generator which yields a uniform distribution of votes (there a roughly equivalent number of like, dislike and neutral votes) resulting in this tight spread.

 

In closing the method outlined here is a much simplified version of what actually might be used in a real world application where the business logic will typically require analysis across several dimensions of varying numbers and their effects on each other rather than the single dimension that we have chosen to take a look at here.

 

Here's a small class that can be used to generate Combinations/Permutations. 

public class Combinatorics
{
    public static IEnumerable<IEnumerable<T>> Permutations<T>(IList<T> items, int setSize)
    {
        if (setSize > 0)
            foreach (T item in items)
            {
                IList<T> remainingItems = items.Where(n => !item.Equals(n)).ToList();

                foreach (IEnumerable<T> remainingPermutations in Permutations(remainingItems, setSize - 1))
                    yield return Enumerable.Concat<T>(new T[] { item }, remainingPermutations);
            }
        else
            yield return new T[0];
    }

    public static IEnumerable<IEnumerable<T>> Combinations<T>(IList<T> items, int setSize)
    {
        int[] a = new int[setSize];
        List<T> returnList = new List<T>(setSize);

        for (int i = 0; i < a.Length; i++)
            a[i] = i;

        foreach (int item in a)
            returnList.Add(items[item]);

        yield return returnList.ToArray();

        long numLeft = GetTotalNumberOfCombinations(items.Count, setSize);

        while (numLeft > 1)
        {
            int i = setSize - 1;
            while (a[i] == items.Count - setSize + i)
                i--;

            a[i] = a[i] + 1;
            for (int j = i + 1; j < setSize; j++)
                a[j] = a[i] + j - i;

            numLeft--;

            returnList.Clear();

            foreach (int item in a)
                returnList.Add(items[item]);

            yield return returnList.ToArray();

        } 
    }

    private static long GetTotalNumberOfCombinations(long itemCount, long setSize)
    {
        //takes advantage of the fact that COMBIN(100,97) == COMBIN(100,3) 
        
        if (itemCount < setSize) return 0;
        if (itemCount == setSize) return 1;

        long delta, iMax;

        delta = setSize < itemCount - setSize ? itemCount - setSize : setSize;
        iMax = setSize < itemCount - setSize ? setSize : itemCount - setSize;
        
        long returnValue = delta + 1;

        for (long i = 2; i <= iMax; ++i)
            checked { returnValue = (returnValue * (delta + i)) / i; }

        return returnValue;
    }
}