Question : If you happen to like a particular Movie A, what is the probability you will also like Movie B ?
I’ll try to describe a simple rudimentary system that will try to answer that question.
Let's make the following assumptions :
- We have an online community of users that we can collect information on whether they liked a particular movie and store those answers in a database.
- The end user can make 3 distinct choices for each movie: they can either like it, dislike it or not care for it.
Setting up the Environment
Our database schema is represented by the following three tables.
The Users table contain information about our users. The Ratings table contain the actual votes cast for a particular RatingDefition and the RatingDefinition table contains the Movie names we’ll use for this example.
You can use the following scripts to create the database objects and populate with relevant data. For this particular case we have a total of 50 movies with 10000 users and a total of 500000 like/dislike/neutral votes. For this reason the populate script may take a while to complete.
The basic algorithm of calculating the probability of liking a Movie B given that you like Movie A is implemented as a User Defined Function that takes 2 parameters. First the movie that we liked and the second is the movie that we want to test.
CREATE FUNCTION [dbo].[fnGetProbabilityOfLikingItem] ( @ItemAlreadyLiked UNIQUEIDENTIFIER, @ItemToBeTested UNIQUEIDENTIFIER ) RETURNS FLOAT AS BEGIN RETURN ( SELECT (SUM(CASE WHEN r.Rating >0 THEN 1.0 ELSE 0.0 END)/SUM(CASE WHEN r.Rating !=0 THEN 1.0 ELSE 0.0 END)) * 100 FROM Ratings r WHERE r.UserId IN ( -- all the users that liked that item SELECT UserId FROM dbo.Ratings WHERE RatingDefinitionId = @ItemAlreadyLiked AND rating > 0 ) AND r.RatingDefinitionId = @ItemToBeTested ) END
This UDF determines the universe of users that had a favorable review of Movie A and then gathers all the other movies that they’ have also liked and tabulates them based on the number of votes cast allowing us to do pair-wise comparisons like this :
DECLARE @ItemAlreadyLiked UNIQUEIDENTIFIER DECLARE @ItemToBeTested UNIQUEIDENTIFIER SELECT @ItemAlreadyLiked = Id FROM dbo.RatingDefinition WHERE Name = 'Titanic' SELECT @ItemToBeTested = Id FROM dbo.RatingDefinition WHERE Name = 'The Dark Knight' SELECT ProbabilityOfLiking = dbo.fnGetProbabilityOfLikingItem(@ItemAlreadyLiked, @ItemToBeTested) RESULT : ProbabilityOfLiking ---------------------- 50.9977 (1 row(s) affected)
Since the function is written as a scalar UDF we can also to things like :
It maybe useful to present a list to the user that they may also be interested in viewing the other movies listed. One thing to note here is the calculated probability numbers appear to be constrained within close proximity of each other. This is due to the fact the algorithm used to generate the votes relies on a random number generator which yields a uniform distribution of votes (there a roughly equivalent number of like, dislike and neutral votes) resulting in this tight spread.
In closing the method outlined here is a much simplified version of what actually might be used in a real world application where the business logic will typically require analysis across several dimensions of varying numbers and their effects on each other rather than the single dimension that we have chosen to take a look at here.