Question : If you happen to like a particular Movie A, what is the probability you will also like Movie B ?

I’ll try to describe a simple rudimentary system that will try to answer that question.

Let's make the following assumptions :

  • We have an online community of users that we can collect information on whether they liked a particular movie and store those answers in a database.
  • The end user can make 3 distinct choices for each movie: they can either like it, dislike it or not care for it.

Setting up the Environment

Our database schema is represented by the following three tables.


The Users table contain information about our users. The Ratings table contain the actual votes cast for a particular RatingDefition and the RatingDefinition table contains the Movie names we’ll use for this example.

You can use the following scripts to create the database objects and populate with relevant data. For this particular case we have a total of 50 movies with 10000 users and a total of 500000 like/dislike/neutral votes. For this reason the populate script may take a while to complete.


The basic algorithm of calculating the probability of liking a Movie B given that you like Movie A is implemented as a User Defined Function that takes 2 parameters. First the movie that we liked and the second is the movie that we want to test.

CREATE FUNCTION [dbo].[fnGetProbabilityOfLikingItem] 
    @ItemAlreadyLiked UNIQUEIDENTIFIER,
        SELECT  (SUM(CASE WHEN r.Rating >0 THEN 1.0 ELSE 0.0 END)/SUM(CASE WHEN r.Rating !=0 THEN 1.0 ELSE 0.0 END)) * 100
        FROM    Ratings r
        WHERE   r.UserId IN
                    -- all the users that liked that item
                    SELECT UserId 
                    FROM dbo.Ratings 
                    WHERE RatingDefinitionId = @ItemAlreadyLiked AND rating > 0
                    AND r.RatingDefinitionId = @ItemToBeTested

This UDF determines the universe of users that had a favorable review of Movie A and then gathers all the other movies that they’ have also liked and tabulates them based on the number of votes cast allowing us to do pair-wise comparisons like this :

SELECT  @ItemAlreadyLiked = Id 
FROM    dbo.RatingDefinition
WHERE   Name = 'Titanic'
SELECT  @ItemToBeTested = Id
FROM    dbo.RatingDefinition
WHERE   Name = 'The Dark Knight'
SELECT  ProbabilityOfLiking = dbo.fnGetProbabilityOfLikingItem(@ItemAlreadyLiked, @ItemToBeTested)
(1 row(s) affected)

Since the function is written as a scalar UDF we can also to things like :


It maybe useful to present a list to the user that they may also be interested in viewing the other movies listed. One thing to note here is the calculated probability numbers appear to be constrained within close proximity of each other. This is due to the fact the algorithm used to generate the votes relies on a random number generator which yields a uniform distribution of votes (there a roughly equivalent number of like, dislike and neutral votes) resulting in this tight spread.


In closing the method outlined here is a much simplified version of what actually might be used in a real world application where the business logic will typically require analysis across several dimensions of varying numbers and their effects on each other rather than the single dimension that we have chosen to take a look at here.



Comments are closed