Dissecting Trump’s Most Rabid Online Following, by Trevor Martin | FiveThirtyEight
Comparing subreddits, with Latent Semantic Analysis in R
The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders.
New Gensim feature : Author-topic modeling. LDA with metadata. | RaRe Technologies
une nouvelle extension pour #gensim qui pourrait être très utile pour des corpus du genre #SPIP : une fois les topics modélisés à partir du #LDA, on sait les associer non seulement aux articles, mais aussi aux tags (mots-clés, auteurs), ce qui permet de savoir quels sont les auteurs proches, les thématiques similaires, etc.
Douwe Osinga’s Blog : Building Spotify’s Song Radio in 100 lines of Python
the Python library GenSim contains a great implementation of Word2Vec. So if we feed this playlists containing song ids, rather than sentences containing words, it will after a while learn relationships between songs. Suggesting a playlist based on a song becomes than again a straightforward nearest neighbor search.
Applying Data Science to the Supreme Court: Topic Modeling Over Time with #NMF (and a #D3.js bonus) — Emily Barry
LDA was the obvious choice to do first, as is evident when you google “#topic_modeling algorithm.” (...)
Then I read about Non-negative Matrix Factorization (NMF) and found that in uses similar to mine, its robustness far surpassed LDA. NMF extracts latent features via matrix decomposition, and you can use TFIDF which is a huge plus.
Big Social Data Analytics in Journalism and Mass Communication
this present study is the first attempt to validate the efficacy of the LDA model in the context of journalism and mass communication research. Considering its decent performance, future research should consider using this method to analyze mass communication text, especially to process large-scaled social media data. For example, when communication scholars have a big dataset, but are unsure of the topics or attributes that exist inside of it, our results suggest the LDA-based analysis will be more effective than using the most frequently used words to devise topic lists.
After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.