Recommendation Systems

Jump to: navigation, search

Pradeep Tomar and Gurjit Kaur

In Wang et. al [11], an analysis of Quora’s features is carried out. The relation between various entities in the Question & Answer system has been studied. Three different types of analyses have been carried out through the graph -between topics and users, among various users and related questions. The user and the social graphs help in activity and the relatedness of the questions. Their study tries to answer several questions such as the role of traditional topics, role of super users in driving users’ attention towards questions of non- related questions and how better questions are filtered. Their study finds that the views and answers of a question do have a role to play in the relevance of a question. Comparison has been made between Quora and stack overflow. Quora being an integrated social- networking site has more involvement graphs and patterns. A BFS search based crawler has been used to extract more than four lakh questions, 56000 unique topics and more than 2 lakh users. In the user- topic graph, out of total extracted data it has been found that 95% users have at least one topic common as they have to select a topic during sign- up process. Around 27% of the users follow around 1000 topics and these users happen to be intellectual. The user- topic graph tells that the topics of the questions and the views associated to it attract more users towards the question. The second graph studied is social graph between the users. The follower fitting distribution of Quora has been found better than Facebook. Their study finds that people attract followers by contributing high- quality answers by clustering the followees. There may be a bias in the up votes as a super user who has more followers gets more up votes even when the quality of the answer may not be the best among others. The third kind of graph used for the analysis is related- questions graph. Questions are denoted as nodes and edges as similarity measure. Their study finds that the question relation is stable over some time by taking two snapshots within a gap of two months. Questions with more related questions attract more users. Maity et. al [12] analyze and attempt to predict the popularity of topics under which the questions are categorized. A framework has been designed to predict and the popularity of topics and Latent Dirichlet Analysis topic modeling approach has been used to categorize the question text into topics. The dataset collected spans over four years and consists of 822,040 questions across 80,253 topics and 1,833,125 answers. It has been found that the topics that are stable do not have much varying questions with time. K- Shell analysis has been performed for determining inter- topic dynamics. Context feature, content feature and the user feature have been considered for learning about topic popularity. N- gram modeling has been used for analyzing the corpus. Salton et. al [13] introduced TF- IDF the very first time. They analyze the text collection as a vector with each term consisting of a frequency value. The term weighting in Boolean query system is also discussed. Common class assignment is also discussed for keyword based clustering of the documents for better representation of the knowledge. The observations in their study are useful today as TF-IDF effectively can describe the relevance of a term in document. Thada et. al [14] compare Jaccard, Dice and Cosine similarity coefficient to find the best fitness value using the genetic algorithm. The documents have been retrieved from Google using search query and matched for similarity. The Jaccard, dice and Cosine similarity coefficients are used and selection, mutation and crossover operators are applied. Roulette function has been used after every generation. The ten queries used are “Anna hazare anti-corruption”, “Osama bin laden killed”, “Mouse Disney movie”, “Stock market mutual fund”, “Fiber optic technology information”, “Britney spear music mp3”, “Health medicine medical disease”, “Artificial intelligence neural network”, “Sql server music database” and “Khap panchayat honour killing”. The spelling and syntax of these queries may not be grammatically correct but that is how they have been given in their study. The crossover probability taken is 0.7, probability of mutation is 0.01 and the number of iterations is 150. The results indicate that the performance of cosine similarity is better than dice and Jaccard similarity. Das et. al [2] discuss the methods used in Google news recommendation. The three methods are PLSI, co- visitation counts and MinHash algorithm. The MinHash using Map- Reduce model has been discussed. Using PLSI, the user and item modeling has been discussed by taking joint distribution between the users and the items. For PLSI, their study makes use of map- reduce expectation- maximization. To understand the co- visitation, the graph pattern is studied where nearest items are those that have co- visitation by a user. The recommendation system has been evaluated on a live traffic. Abel et. al [15] analyse the content based cross-domain recommendation systems based on collective model of recommendation. The characteristics of tag based profiles for social networking based systems have been analyzed. Their evaluation shows that tag based user modeling is better than other methods and it helps in solving cold start and sparsity problems. Tag based personomy has been defined for a user which includes tags, resources and tag assignments respectively. Kaminkas et. al [16] proposed a location based music recommendation method that is based on tags. The context considered is place of interest. This is a collective method based content recommendation system. Their study states that an emotion is attached with a user listening to a particular kind of music at a particular location. The location is taken as a cross-domain source here. Each tag has a usage probability for the category it falls in. Li et. al [17] propose a collaborative filtering model for sparsity reduction. This is an adaptive method based model. The user item rating patterns have been transferred into a sparse rating matrix in a targeted domain. Cluster level rating pattern has been used. Their study states that the users and the items can be non- identical or non- overlapping. In their study, the transferred knowledge is termed ‘codebook’. Their study states that “Codebook is a (k x l) matrix which compresses the cluster-level user-item rating patterns of k user clusters and l item clusters in the original rating matrix”. Li et. al [18] propose a transfer rating matrix generative model for cross domain recommendation system. The model is for collaborative filtering based recommendation and is collective model based. Cluster level rating matrix has been used and user item joint mixture model has been focused. Their study states that “the advantage of rating matrix generative model is that it can share the useful knowledge by pooling the data from multiple tasks”. Pal et. al [19] have carry out a temporal study of experts in Question and Answer communities. Unsupervised machine learning methods have been used to study the behavioral patterns that distinguish one expert from the other. Stack Overflow dataset has been used and point-wise ratio of the best answer time series with the answer time series has been used as temporal series analysis. The results of the work carried out are that experts’ probability of answering questions increases with time which is in contrast to the ordinary users. The temporal analysis shows that as an expert gains reputation, ordinary users acknowledge her and that leads to less participation by the ordinary users. Abe et. al [20] study the reinforcement learning for two conditions. The linear function which relates to a feature vector for every action has been considered. Two cases have been studied, one in which the unknown linear function is applied and continuous-value reward is given and the other in which probability of obtaining larger binary-value reward is obtained. Auer et. al [21] show that the optimal logarithmic regret is achievable over time. The UCB1 policy is discussed and it has been shown that UCB1 can achieve logarithmic regret uniformly through the use of various theorems. The MinHash technique was first introduced by Broder et. al [22] in 1997. In their study, the minhash function has been described as the function that generates random permutations. The shingling is discussed for representing documents. Two documents sets resemble each other if their intersection divided by union is nearly equal to 1. Three important theorems have been discussed regarding min- wise hash functions. Theorem 1 states that “for a min- wise independent function F, |F| is at least as large as the LCM of 1, 2,.., n and hence |F|= en-o(n)”. Theorem 2 states that “there exists a min- wise independent family F of size less than 4n. Theorem 3 states that “there is a family F of size at most n2n-1, such that F with an associated μ is min- wise independent. These three theorems establish a remarkable useful application of min hash function. Andrei Z. Broder [23], discuss that it is not necessary to compare whole documents for resemblance. Random sampling can be used for comparing the relative size of intersection of documents. Documents can be broken into “sketches” or “shingles” which represent the document. His study states that clustering m documents into closely resembling ones can be done in time proportional to mlogm. The Jaccard similarity can be estimated by finding the similarity of min hash functions which randomly permute the sets being compared. Andrei Z. Broder in [24], discuss that two documents can be compared for resemblance and containment. The resemblance and containment means “roughly the same” and “roughly contained” respectively. The resemblance function returns a value [0, 1] where 1 means total similarity and 0 means no similarity. The containment function describes if a document is contained in another. Its value is also in the range [0, 1]. The fingerprint comparison can also be done as the document and the comparison is done by breaking into shingles. Li et. al [25], discuss the use of offline evaluation of news articles using contextual bandit algorithms. The contextual bandit algorithms involve the context of a news. The approach is data- driven rather than simulator- driven and a replay methodology is used. The exploration- exploitation problem helps in figuring out good quality news. A feature vector is associated with every arm which is known as a context vector. The exploration has been used to choose sub- optimal arms so as to gather more information about them. Son et. al [1] discuss news article recommendation using location. Explicit localized semantic analysis is used for topic representation of documents. Geographical topic modeling is about associating topics with geographic locations. The geographic topics and regions are used as the two latent variables. A score function is used which measures the appropriateness of a news article to a particular location. The topic space is defined using the Wikipedia concepts. Chou et. al [26], propose an approach in which the news article with the highest score with each user visit is recommended during exploitation phase and during the exploration phase, exploration is done for articles with high reward possibility with uncertainty measures. This work is for ICML 2012 Exploration and Exploitation Challenge. National Taiwan University won the first phase of this challenge and the paper summarizes their solution. In their study, a scoring model is used which estimates click through rate of an article gathered over time and also measures the discussed system’s uncertainty estimation.

Parrytomar (talk)07:43, 14 May 2017