Number of Businesses: 61184
Number of Users: 366715
Number of Reviews: 1569264
Tokenize Reviews → Remove stop words → POS Tagging → Lemmatize → Train LDA → Label latent topics
Process of classifying words into their parts of speech based on its definition as well as its context.
[('Few', 'JJ'), ('things', 'NNS'), ('harder', 'JJR'), ('put', 'VB'), ('up', 'RP'), ('than', 'IN'), ('good', 'JJ'), ('example', 'NN')]
"Process of grouping diffrent inflected forms of word so that they can be analyzed as single word"
wordnet_lemmatizer.lemmatize(‘aardwolves’) # u'aardwolf'
wordnet_lemmatizer.lemmatize(‘is’, pos=’v’) # u'be'
wordnet_lemmatizer.lemmatize(‘are’, pos=’v’) # u'be
"...is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar."
"I love this place and eat here at least once a week. Their pad si yew, pad Thai, dumplings, Siam rolls, and their curries are their best dishes. Especially the curry. You can't go wrong at Thai moon, so I suggest you go and enjoy the food!"
Model prediction:
['scores': u'0.060*thai + 0.036*food + 0.032*curry + 0.028*lunch + 0.027*spicy + 0.023*place + 0.021*tea + 0.019*dish + 0.017*chicken + 0.015*time']
Neo4j with "py2neo". Example:
from py2neo import Node, Relationship, Graph
graph = Graph()
userNode = Node('User', name='name', id=jsonData'user_id')
topicNode = Node('topics', 'key', topicKey)
likes = Relationship(userNode, 'topiclike', topicNode)
graph.create(likes)
query = ('match (u1 {id:"UZyNf0h26jNVvpHT-GF2oA"})'
' - [rel:topiclike]-> r <- [re:topiclike] - (u2) '
'with count(r) as topiccounts, u2 as user2 '
'where topiccounts > 2 return user2, topiccounts'
' order by topiccounts desc limit 10')
result = graph.cypher.execute(query)
userid | topiccounts | |
---|---|---|
1 | kGgAARL2UmvCcTRfiscjug | 20 |
2 | 4ozupHULqGyO42s3zNUzOQ | 19 |
3 | DrWLhrK8WMZf7Jb-Oqc7ww | 16 |
4 | 90a6z--_CUrl84aCzZyPsg | 15 |
5 | 4G68oLRY3aHE5XUt_MUUcA | 12 |
6 | iwUN95LIaEr75TZE_JC6bg | 11 |
7 | pEVf8GRshP9HUkSpizc9LA | 10 |
8 | Ovpa3S8xD96dLE5eDxcxJg | 10 |
9 | WmAyExqSWoiYZ5XEqpk_Uw | 9 |
10 | lmiDCrmas8TxRsbIGZX9Pg | 9 |