tweetfinder: Find tweets embedded and mentioned in news articles online¶
Package on pypi: https://pypi.org/project/tweetfinder/
Code: https://github.com/dataculturegroup/Tweet-Finder
Documentation: https://tweet-finder.readthedocs.io
A small Python library for finding Tweets embedded in online news articles, and mentions of Tweets. We wrote this because we suspected that current research approaches were significantly under-counting the number of Tweets embedded in online news stories. Our initial evaluation confirms this.
Quickstart¶
Install with pip: pip install tweetfinder
.
from tweetfinder import Article
my_article = Article(url="http://my.news/article") # this will load and parse the article
# you can list discover all the tweets that are embedded in the HTML
num_embedded = my_article.count_embedded_tweets()
tweets_embedded = my_article.list_embedded_tweets() # metadata about tweets that are embedded
# you can also discover any mentions of twitter (in English), like "tweeted that" or "in a retweet"
num_mentions = my_article.count_mentioned_tweets()
tweet_mentions = my_article.list_mentioned_tweets() # list of text snippets that mention a tweet
Motivation¶
Why are embedded tweets being undercounted? Two main reasons:
Not everyone embeds tweets following the ``blockquote` guidelines from Twitter <https://help.twitter.com/en/using-twitter/how-to-embed-a-tweet>`__
Many new websites render their content via Javascript, not raw HTML so unless you run in a browser and execute the Javascript, you won’t see the embedded tweets on the page source
Some of our initial numbers behind this:
Out of 1000 stories that mentioned twitter, our library found 640 embedded tweets in raw HTML
Goose3, which is what current papers seems to use, found 518 in the same set of stories (ie. it missed about 20%)
If you add in support for processing Javascript-based embeds, we found 859 (35% more) that traditional raw HTML-based counting approaches miss
These to-be-published results confirm our suspicion - most large quantitative news projects are under-counting embedded Tweets by around 35% or mre. This library is our attempt to help fix that.
Why does that matter? Understanding how Twitter (and other platforms) is used in news media is critical for building a better map of how the media ecosystem functions. News shapes how we see the world; studying the architectures of information flows around us is critical for preventing the spread of hate speech, misinformation, and supporting newsrooms and democracy.
API¶
When you create an Article the HTML is downloaded (if needed) and parsed immediately to find any mentions of twitter and any embedded tweets. There a number of methods to return the information found:
my_article.embeds_tweets()¶
Return True
or False
depending on if there are any tweets
embedded in the article.
my_article.count_embedded_tweets()¶
Return the number of tweets embedded in the article.
my_article.list_embedded_tweets()¶
Return a list
of dicts
with information about the tweets found.
The properties in this dict
depend on how we found the tweet. It
could look like this:
[{
'tweet_id': '//twitter.com/sliccard',
'html_source': 'blockquote url fallback'
'username': '',
'full_url': 'https://twitter.com/sliccardo',
}]
Properties: * tweet_id
: the unique id of the tweet, can be used in
concert with Twitter’s API to pull more metadata (always included) *
html_source
: a string indicating which method the tweet was found
with (always included) * full_url
: the complete URL to the tweet on
Twitter (sometimes included) * username
: the twitter username of
the author of the tweet, including the “@” (sometimes included)
my_article.mentions_tweets()¶
Return True
or False
depending on if there are any mentions of
tweets in the article.
my_article.count_mentioned_tweets()¶
Return the number of mentions of tweets in the article.
my_article.list_mentioned_tweets()¶
Return a list
of dicts
with information about the mention of a
tweet. It will look like this:
[{
'phrase': 'tweeted',
'context': 'in March last year. He decided to comfort himself by bingeing on a favourite TV show. “I randomly tweeted something about putting on the first episode of a TV series. I’m slightly afraid to say that it was',
'content_start_index': '670',
}]
Properties: * phrase
: the phrase matched as a mention of twitter *
context
: a window of characters around the phrease to help you
understand where it occurred * content_start_index
: the index into
my_article.get_content()
you can use to find the match
Development¶
If you want to work on this module, clone the repo and install
dependencies: make requirements-dev
.
Distribution¶
Run
make test
to make sure all the test passUpdate the version number in
tweetfinder/__init__.py
Make a brief note in the version history section below about the changes
Run
make sphinx-docs
to update the documentationRun
make build-release
to create an install packageRun
make release-test
to upload it to PyPI’s test platformRun
make release
to upload it to PyPI
Version History¶
v1.0.1: fix packaging to include data files required
v1.0.0: added documentation and evaluation scripts
v0.2.1: fix case-related bug in finding mentions
v0.2.0: better documentation
v0.1.0: initial release for testing
Credits¶
This library is part of the Media Cloud project, and is supported by the Co-Lab for Data Impact and the Data Culture Group at Northeastern University.
Maintainers:¶
Rahul Bhargava
Dina Zemlyanker