library(tidyverse)
# Load 50 sample tweets
<- read_csv("tweet-examples.csv") %>% pull(tweet50)
example_tweets # Simulate a corpus of 250,000 tweets
<- rep(example_tweets, 5000)
corpus
# Create the vector of ten words to search for
<- c("money", "beer", "friends", "fhdjasl", "dfjha", "dshjfsa", "fdjh", "vjhd", "cvmna", "dfhda")
keywords
# Score using grepl approach ----
# We only want to match on distinct words, so we use the word boundaries approach and paste all of the words together with |
<- function(texts, keywords) {
score_grepl sapply(texts, function(x) grepl(sprintf("\\b(%s)\\b", paste0(keywords, collapse = "|")), x), USE.NAMES = F)
}
system.time(score_grepl(corpus, keywords[1])) # About 2.85s
system.time(score_grepl(corpus, keywords)) # About 9.5s
system.time(score_grepl(corpus, rep(keywords, 10))) # About 120s
# Score with tokenizer appraoch using tokenizers package ----
library(tokenizers)
# Note that this is a bit more sophisiticated than just using grepl, since we ask it to strip out punctuation and urls.
<- function(texts, keywords) {
score_tokenizers <- tokenize_tweets(texts, strip_punct = TRUE, strip_url = T)
tkns sapply(tkns, function(x) any(keywords %in% x), USE.NAMES = F)
}
system.time(score_tokenizers(corpus, keywords[1])) # About 5s
system.time(score_tokenizers(corpus, keywords)) # About 5s
system.time(score_tokenizers(corpus, rep(keywords, 10))) # About 6s
In my text analysis work, I frequently score texts for the presence or absence of various ``keywords’’. Because I work with some large corpora (collections of texts), for example the billions of tweets in my job market paper, this can be a time-consuming task. I have previously done most of this in Python, but right now I’m also interested in doing it quickly in R for ad hoc analyses.
In this post I test two different methods for detecting words: a simpler `grepl’-based approach that using a regex search to identify any texts with at least one of the matching words, and slightly more involved approach first tokenizes (i.e., breaks up sentences into their component word ‘tokens’) the texts and then searches through those tokens to see if any match the given keywords.
What did I learn here? If I’m only interested in detecting a small number of keywords, a simple grepl
based approach is fine. But if I want to search for more than 5-10 keywords, tokenizing first is better (I also test a tokenizing approach that uses the quanteda
package, but do not see any notable differences in performance). Note that this would also be true for creating multiple scores from the same set of texts.