Facebook and Fake News

With the increasing popularity of social media for every generation we have seen an increased spread of fake news being shared between peers. Without fact checking, many articles with catch titles…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Building Recommender System Using Machine Learning Techniques And Python

You gonna buy a mobile phone on Amazon and Amazon recommends you to buy a screen protection glass and a case cover along with it. It also gives you discount if you buy all these three things together. If you go to Youtube to watch your favorite song, Youtube recommends you to watch similar other songs which you may like. On Google search, at the bottom of the search results page you get related searches. You get recommendations everywhere; on Netflix, on Facebook, on Twitter, all social networking sites have built their own recommender systems . Not only on the online websites but also in the supermarkets you find recommendations are given directly or indirectly. In the supermarkets the things which are bought together are kept together or they offer deals if you buy things together. How do they know your buying patterns ? How do they know your likes and dislikes? We are going to understand that secret in this article.

Recommender system is an important and a popular area of machine learning. It has helped business to increase their sales and profits. We cannot imagine a market place today which is not having any recommender system in place. These companies analyse our transaction data, they try to deduce our buying patterns. From the buying patterns they create association rules. These association rules are then used to form recommendations and deals. There are various data mining algorithms such as k-means, Page Rank, EM data mining, and Apriori which are used to build the recommender systems. In this article we are going to build our recommender system using Apriori algorithm.

Apriori algorithm is one of the most popular and classical algorithm in data mining. It uses association rule mining which is a technique to identify underlying relations between different items in the dataset. Let’s first understand association rule mining using a supermarket example. In a supermarket customers can buy variety of items. Usually, there is a pattern in what the customers buy. For example, if someone wants to prepare a bread sandwich then most probably she will buy bread, butter, jam and tomato sauce. The chances that tomato sauce is being bought along with bread and butter is very high. From the supermarket transaction data we can find that there are many transactions in which all the above four things are bought together. So, now we can say that tomato sauce is strongly associated with bread, butter and jam. And, supermarket can create a new rule in their system to keep these four things together because they are frequently bought together. Association rule mining is nothing but a technique to find out such rules.

The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses). Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

With the help of following simple example we will explore the internal steps of Apriori algorithm. Following is a sample super market dataset. We are going to run Apriori on that dataset to mine association rules from it.

Apriori uses three important computational factors to mine association rules from the input dataset. These three computational factors are support, confidence and lift. Let’s see how to compute them below.

This is the first step of Apriori computations. In this step, first, the algorithm identifies all unique items from the input dataset. Following is the list of unique items present in our sample dataset. Along with the unique items list it also finds out the total number of records in the input dataset. In our case total number of records = 5.

The next step is to calculate the support for each item present in the above list. Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions.

Suppose we want to find support for item “tomato sauce”. This can be calculated as:

Support(“tomato sauce”) = Transactions containing(“tomato sauce”) / Total Transactions;

In our case there are three transactions out of total five which has item “tomato sauce” in them. Hence,

Support(“tomato sauce”) = 3 / 5 = 0.6;

Likewise, Apriori calculates support for all the items. Please see the calculated support for all the items in the table below.

The next step is to calculate support for group of items having two items in a group. When there is more than one item in a group then it is called an itemset. For example itemset (“bread, tomato sauce”). While creating an item set the order of the items does not matter. Itemset (“bread, tomato sauce”) is same as itemset (“tomato sauce, bread”).

The formula to calculate the support for an itemset is same as that of formula for an item. For example, support for itemset “bread, tomato sauce” is calculated as:

Support(“bread , tomato sauce”) = Transactions containing(“bread , tomato sauce”) / Total Transactions;

In our case there are three transactions out of total five which has both the items bread and tomato sauce in them. Hence,

Support(“bread, tomato sauce”) = 3 / 5 = 0.6;

Likewise Apriori calculates support for all the itemsets of size 2. Please see the calculated support for all the itemsets in the table below.

Similarly Apriori calculates support for all the possible itemsets. It continues to generate itemsets of size 1 to N-1 where N is the total number of items in the largest transaction. In our example all transactions have equal number of items in them, and that value is 3. Hence, in our case value of N is 3, hence, the algorithm will generate itemsets of size 1 and 2 only. It will stop at 2, as in our case N-1 =2.

Confidence of a rule ( bread => tomato sauce ) refers to the likelihood of tomato sauce is also bought if bread is bought. Confidence is an indication of how often the rule has been found to be true. Confidence is calculated by finding the number of transactions where bread and tomato sauce are bought together, divided by total number of transactions where bread is bought. Mathematically, it is represented as

Confidence ( A=>B) = (Transactions containing both (A and B)) / (Transactions containing A);

The above equation can also be represented as:

Confidence ( A=>B) = Support (A and B))/Support(A);

So, the support values calculated in the earlier step can be used here to calculate the confidence.

Coming back to our example; there are three transactions in which bread and tomato sauce are bought together and there are four transactions in which bread is brought. Hence ,

Confidence ( bread => tomato sauce ) = 3/4 = 0.75

Confidence can give some important insights, but it also has a major drawback. It only takes into account the popularity of the itemset A (in our case A is “bread” ) and not the popularity of B ( in our case B is “tomato sauce”) . If B (tomato sauce) is equally popular as A (bread) then there will be a higher probability that a transaction containing A (bread) will also contain B(tomato sauce) thus increasing the confidence. This drawback is overcome in calculating lift.

Lift or lift ratio is the ratio of confidence to expected confidence. Expected confidence is the confidence divided by the frequency of B. The Lift tells us how much better a rule is at predicting the result than just assuming the result in the first place. Greater lift values indicate stronger associations.

While calculating lift(A=>B) we take into account the popularities of both the items A and B. This lift ratio shows how likely item B is purchased when A is purchased. Mathematically lift is expressed as

Lift(A=>B) = Confidence ( A=>B) / Support(B) ;

Or,

Lift(A=>B) = Support ( A=>B) / ( Support(A) * Support(B) );

Now, let’s go back to our example and calculate lift for (bread => tomato sauce).

lift(bread => tomato sauce) = Confidence ( bread => tomato sauce ) / Support(tomato sauce);

lift(bread => tomato sauce) = 0.75 / 0.6 = 1.25

A lift value greater than 1 means that item B is likely to be bought if item A is bought, while a value less than 1 means that item B is unlikely to be bought if item A is bought.

Now, let’s build a working python program which will perform all the calculations described above. It should list down all the unique items, create unique item sets, calculate support for items and item sets, calculate confidence and lift for all the possible item-itemsets combinations. The program should return us recommendations for given item/item set.

The first thing required in any machine learning/ data mining system is data. We are using a supermarket data to build a recommender system for the super market. Our data is in csv form. Each line in the csv represents a list of items purchased by a customer.

We are making use of package apyori to calculate association rule. If it is not installed on your system then install it using command “pip install apyori”

Now, let’s write the python code. Here, we have imported the required packages and we have defined a class Recommender and have defined its constructor. While instantiating a Recommender object we need to provide it the datafile.

Next step is to compute the association rules. For that we need to compute support, confidence and lift all these things explained above. Following function is defined to compute all these things.

In the code above you can see that all the important computations are done in the function call apriori(), which comes from the imported package apyori. We do not have to do anything!! All things are taken care by apriori()!! This is the reason I am in love with this language. It is so simple; it is so powerful!! The ecosystem is so rich that we can find most of the things have already been implemented. This reminds me of a famous python joke. One day little Jony was flying up in the sky!! Little Jack asked him from the ground “Hey Jony, how come you are flying so up in the sky? What did you do?”. Little Jony replied “Nothing… I just imported one antigravity module of python!!”. :D :D

The apriori() returns us the output in the form shown below. Our next task is to extract the information from this output and store it in a dictionary for easy access.

Apriori output:

To extract the rules from the output of apriori() we have defined extractRules(). This function extracts the rules and stores them in AssociationRulesDictionary in the descending sorted order of lift values.

The above defined two methods should not be called outside of the class definition. Me being an old fashioned C++ developer had to stick to this restriction. I leave it to you guys; if you want to call these two functions outside of the class then you may.

We have defined a method studyRules() which calls the above two functions computeRules() and extractRules(). This function studyRules() plays a role of a template method here. In case you want to extend Recommender class and use another algorithm instead of Apriori to compute association rules then studyRules() will help you to get the polymorphic behavior. An object of Recommender class should call studyRules() to compute and extract the association rule from the input data file.

Now, we have got all computations done and all association rules extracted into a dictionary. Let’s define some APIs to return recommendations and deals using the computed association rules.

recommend() returns recommendations for the selected item/itemset. For example, if you want to find out which is the mostly preferred item along with ‘red wine’ then you can get this value by calling recommend() as shown below.

objectName.recommend([‘red wine’], 1)

In our case, for the data file we are using, it returns us

[([‘spaghetti’], 2.095966120638976)]

This tells us that ‘spaghetti’ is the most preferred item along with ‘red wine’.

Now, let’s formulate deals from these recommendations. API showDeals() is defined to create deals using the recommendations returned by recommend(). In the showDeals() we are calculating discount percentage by using the lift value of the recommended itemset. API recommend() returns somewhat abstract output; in the showDeals() we are trying to give some meaning to it.

Now, our entire recommender system is ready!! Let’s play with it to see how it behaves.

Hey! Alexa! show me the two topmost deals on ‘Red Wine’. And Alexa shows you the following deals.

1. If you buy [‘spaghetti’] along with [‘red wine’] then you will get 20.96 % discount on total cost!!

2. If you buy [‘mineral water’] along with [‘red wine’] then you will get 16.3 % discount on total cost!!

Isn’t it amazing. You can uses the same recommender system to build a video or songs recommender. You just need to provide the corresponding datafile. For example, it would be something like this.

Here, Michael would recommend five songs which are mostly listened with ‘summer of 69’ and ‘Billie Jean’!!

You can also find the complete program below. If you have any comments/concerns with the implementation please feel free to reach to me at srkhedkar@gmail.com

Add a comment

Related posts:

How To Survive Fatphobic Comments Over the Holidays

The holidays are incredibly stressful. In fact, 88% of Americans view the holidays as the most stressful time of the year. Unfortunately, this stress includes being exposed to a barrage of negative…

How To NumPy

NumPy is a powerful library for scientific computing in Python. It provides tools for working with arrays and matrices of numerical data and performing mathematical operations on them. If you are new…

Thoughts on Turning 60

You probably have no idea that it would be such a hurdle for me to write that sentence just a few months ago. Some of that is a prison of my own creation, but I’ve never been good at not giving a…