Open source intelligence. Twitter scraping using Twint

Enrique Bonsón
David
 Perea
Open source intelligence.
Twitter scraping using Twint
2
Agenda
 
Open Source Intelligence (
OSINT)
Open Source Tools (GITHUB)
Twint
Case study
3
Open Source Intelligence (OSINT)
 
Open source intelligence (OSINT) is the discipline that pertains to intelligence
produced from publicly available information that is collected, exploited, and
disseminated in a timely manner to an appropriate audience for the purpose of
addressing a specific intelligence requirement.
OSINT is derived from the systematic collection, processing, and analysis of publicly
available, relevant information in response to intelligence requirements.
U.S. Army (2010). FM 2-0 intelligence
Publicly available information on the Internet can be collected and processed with
open source tools that can help gather the data from hundreds of sites in minutes and
thus easing the collection phase.
4
Open Source Tools 
(GITHUB)
 
GitHub Inc. is a web-based hosting service for version control using Git. It is mostly
used for computer code.
Git is version-control system for tracking changes in computer files and coordinating
work on those files among multiple people.
GitHub make the relationships between users and between users and work artifacts
transparent. This transparency enables developers to better use information such as
technical value and social connections when making work decisions.
Tsay, J., Dabbish, L., Herbsleb, J., (2014). 
Influence of social and technical factors for evaluating contribution
in GitHub. Proceedings of the 36th International Conference on Software Engineering
5
Open Source Tools 
(GITHUB)
 
Why GitHub?
96 million public repositories.
GitHub´s user create and maintain influential technologies alongside the world´s
largest open source community.
31 million developers.
Developers use GitHub for personal projects, from experimenting with new
programming languages to hosting their life´s work.
2,1 million business and organizations.
Business all sizes use GitHub to support their development process and to securely
build software.
GitHub, (2018). Web site official. https://github.com/
6
Open Source Tools 
(GITHUB)
 
The importance of Github repositories is based on the social relationship between
users and repositories. To identify useful software there are two elements in Github
repositories: stars and forks .
Hu, Y.,  Zhang, J., Bai, X., Yu, S., Yang, Z., (2016) Influence analysis of Github repositories. SpringerPlus
“Fork” is producing a personal copy of someone else’s project. Forks act as a sort of
bridge between the original repository and your personal copy. You can submit pull
requests to help make other people’s projects better by offering your changes up to
the original project. Forking is at the core of social coding at GitHub.
GitHub (2017). Forking Projects · GitHub Guides
"Star" on repositories to keep track of projects you find interesting and discover similar
projects in your news feed.
GitHub (2018). About starts · GitHub Help
7
Twint -Twitter Intelligence Tool
 
Twint is an advanced Twitter scraping tool written in Python that allows for scraping
Tweets from Twitter profiles 
without
 using Twitter's API.
Twint utilizes Twitter's search operators to let you scrape Tweets from specific users,
scrape Tweets relating to certain topics, hashtags & trends, or sort
out 
sensitive
 information from Tweets like e-mail and phone numbers.
Twint also makes special queries to Twitter allowing you to also scrape a Twitter user's
followers, Tweets a user has liked, and who they follow.
Poldi, F., Zacharias, C., (2018). TWINT - Twitter Intelligence Tool. GitHub project
8
Twint -Twitter Intelligence Tool
 
9
Twint -Twitter Intelligence Tool
 
10
A case study. Twitter in Andalusian municipalities
 
This study provides a general overview of the way local governments use
Twitter as a communication tool to engage with their citizens.
A sample of the 29 most populated Andalusian local governments is examined.
The results show there is not a significant relationship between the population
of a municipality and its Twitter activity, and there is a significant negative
relationship between activity, audience and engagement.
The findings of the study also show that particular media and content types
generate higher engagement.
11
A case study. Methodology (I)
 
Sampling
The tweets were scraped at the beginning of June 2018 using Twint. A total of 345,960
tweets were obtained, which were analyzed later.
They represented more than 85% of the total of tweets published by the studied
municipalities since they joined Twitter until 31th May 2018.
The 15% not scraped are the retweets retweeted from the analyzed accounts. They
can not be scraped because, for the time being, Twint is not able to scrape retweets.
12
A case study. Methodology (II)
 
Coding
The different content types that a tweet can contain and the operative categories for
the content of the tweets were identified and defined.
The media type was identified according to the possible media that can be added in
the tweets.
The identification of content type was based on the lists of local services prepared by
Torres and Pina (2001) and later adapted by several authors (Bonsón et al., 2015; Martí
et al., 2012). Even so, the list was further refined according to words included in the
tweets provided by local governments.
13
A case study. Methodology (III)
 
Dictionary of content types
The tweets are classified into 7 types of categories. Two of these are Sport and
Employment-Education.
 
removePunctuation
removeNumbers
stopwords
TermDocumentMatrix
findFreqTerms
14
Flowchart of 
scraping and 
content analysis
 
Twitter accounts
@Ayto_Sevilla
@malaga
@ayuncordoba_es
@...
Twint scraping
twint --userlist Ayto_Sevilla,malaga,ayuncordoba - o ayto.csv - - csv
CSV
“tm” package
corpus
VectorSource
tm_map
stopwords
findFreqTerms
stripWhitespace
removePunctuation
removeNumbers
content_transformer
TermDocumentMatrix
Refine categories
Define dictionaries
“dplyr” package
Python
Python
R
R
filter
Dictionary creation
grepl
Classified
tweets
15
A case study. 
Results (I)
 
16
A case study. Results (II)
 
17
A case study. Results (III)
 
18
A case study. Results (IV)
 
19
A case study. Results (V)
 
20
A case study. Conclusions (I)
 
96.55% of the largest Andalusian municipalities have an official Twitter account.
The accounts differ in terms of their activity and audience.
The most common tweeted content appeared to be cultural and marketing content
(26.37%).
Regarding media type, website links (35.68%) were the most frequently used,.
Citizens tend to choose retweets more often than favorites or replies to interact with
the municipality.
21
A case study. Conclusions (II)
 
No significant relationship was found between municipality size and Twitter activity.
A significant negative relationship was found between Twitter activity (measured by
the number of published tweets), the audience (measured by the number of
followers), and the citizen engagement.
Photos and videos generate the highest rates of favorites and retweets.
The media type that generates the highest response rate was plain text.
Sport content generated the most retweets and favorites.
The content that tended to receive the most replies was related to environmental
issues.
22
A case study. Future Research
 
The future studies could adopt our approach and apply it to a comparative context
with other national and international regions, which could improve the generalizability
and understanding of the results.
Conduct research exploring the reasons leading users to interact with municipalities’
Twitter pages.
Propose a Twitter commitment model for the municipalities to explain the relationship
between the antecedent factors, the commitment that occurs, and the attitudinal and
behavioral effects on the users.
Slide Note
Embed
Share

This content delves into open source intelligence (OSINT) and the tools like Twint used for Twitter scraping. It explores the discipline of OSINT and how publicly available information is collected, processed, and disseminated for intelligence purposes. Additionally, it discusses the significance of GitHub as a web-based hosting service for version control using Git, and its role in managing open source projects.

  • Open Source Intelligence
  • Twint
  • GitHub
  • OSINT
  • Twitter Scraping

Uploaded on Feb 23, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Open source intelligence. Twitter scraping using Twint Enrique Bons n David Perea

  2. Agenda Open Source Intelligence (OSINT) Open Source Tools (GITHUB) Twint Case study 2

  3. Open Source Intelligence (OSINT) Open source intelligence (OSINT) is the discipline that pertains to intelligence produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement. OSINT is derived from the systematic collection, processing, and analysis of publicly available, relevant information in response to intelligence requirements. U.S. Army (2010). FM 2-0 intelligence Publicly available information on the Internet can be collected and processed with open source tools that can help gather the data from hundreds of sites in minutes and thus easing the collection phase. 3

  4. Open Source Tools (GITHUB) GitHub Inc. is a web-based hosting service for version control using Git. It is mostly used for computer code. Git is version-control system for tracking changes in computer files and coordinating work on those files among multiple people. GitHub make the relationships between users and between users and work artifacts transparent. This transparency enables developers to better use information such as technical value and social connections when making work decisions. Tsay, J., Dabbish, L., Herbsleb, J., (2014). Influence of social and technical factors for evaluating contribution in GitHub. Proceedings of the 36th International Conference on Software Engineering 4

  5. Open Source Tools (GITHUB) Why GitHub? 96 million public repositories. GitHub s user create and maintain influential technologies alongside the world s largest open source community. 31 million developers. Developers use GitHub for personal projects, from experimenting with new programming languages to hosting their life s work. 2,1 million business and organizations. Business all sizes use GitHub to support their development process and to securely build software. GitHub, (2018). Web site official. https://github.com/ 5

  6. Open Source Tools (GITHUB) The importance of Github repositories is based on the social relationship between users and repositories. To identify useful software there are two elements in Github repositories: stars and forks . Hu, Y., Zhang, J., Bai, X., Yu, S., Yang, Z., (2016) Influence analysis of Github repositories. SpringerPlus Fork is producing a personal copy of someone else s project. Forks act as a sort of bridge between the original repository and your personal copy. You can submit pull requests to help make other people s projects better by offering your changes up to the original project. Forking is at the core of social coding at GitHub. GitHub (2017). Forking Projects GitHub Guides "Star" on repositories to keep track of projects you find interesting and discover similar projects in your news feed. GitHub (2018). About starts GitHub Help 6

  7. Twint -Twitter Intelligence Tool Twint is an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API. Twint utilizes Twitter's search operators to let you scrape Tweets from specific users, scrape Tweets relating to certain out sensitive information from Tweets like e-mail and phone numbers. topics, hashtags & trends, or sort Twint also makes special queries to Twitter allowing you to also scrape a Twitter user's followers, Tweets a user has liked, and who they follow. Poldi, F., Zacharias, C., (2018). TWINT - Twitter Intelligence Tool. GitHub project 7

  8. Twint -Twitter Intelligence Tool Categories Descriptions Commands Scrape all the Tweets from user's timeline. twint -u [username] General Scrape all the Tweets from several user's timeline (max. 25 users) twint -userlist [username],[username] Scrape all Tweets from the user's timeline containing a specific word. twint -u [username] -s [word] Collect every Tweet containing a specific word from everyone's Tweets. twint -s [word] Word Display Tweets by verified users that Tweeted about specific words. twint -s [ words ] --verified Collect Tweets from the user's that were tweeted before year. twint -u [username] --year [year] Date Collect Tweets from the user's that were tweeted since date. twint -u [username] --since [y-m-d] Scrape Tweets from the user's and save to file.txt. twint -u [username] -o [name file].txt Scrape Tweets from the user's and save as a csv file. twint -u [username] -o [name file].csv --csv File Scrape Tweets from the user's and save as a json file. twint -u [username] -o [name file].json --json 8

  9. Twint -Twitter Intelligence Tool Categories Descriptions Commands Scrape a Twitter user s followers twint -u [username] --followers Scrape who a Twitter user follows twint u [username] --following Audience Collect all the Tweets a user has favorited twint -u [username] favorites Collect full user information a person follows twint -u [username] --following --user-full Show Tweets from the user's that might have phone numbers or email addresses. twint -u [username] --email --phone Scrape Tweets from the user's from a radius of kms around a place in location and export them to a csv file. twint -g="longitude, latitude, number km" -o [name file].csv csv Misc. Use a slow, but effective method to gather Tweets from a user's profile (Gathers ~3200 Tweets, Including Retweets). twint -u [username] --profile-full Use a quick method to gather the last 900 Tweets (that includes retweets) from a user's profile. twint -u [username] --retweets Resume a search starting from the specified Tweet ID. twint -u [username] --resume [Tweet ID] 9

  10. A case study. Twitter in Andalusian municipalities This study provides a general overview of the way local governments use Twitter as a communication tool to engage with their citizens. A sample of the 29 most populated Andalusian local governments is examined. The results show there is not a significant relationship between the population of a municipality and its Twitter activity, and there is a significant negative relationship between activity, audience and engagement. The findings of the study also show that particular media and content types generate higher engagement. 10

  11. A case study. Methodology (I) Sampling The tweets were scraped at the beginning of June 2018 using Twint. A total of 345,960 tweets were obtained, which were analyzed later. They represented more than 85% of the total of tweets published by the studied municipalities since they joined Twitter until 31th May 2018. The 15% not scraped are the retweets retweeted from the analyzed accounts. They can not be scraped because, for the time being, Twint is not able to scrape retweets. 11

  12. A case study. Methodology (II) Coding The different content types that a tweet can contain and the operative categories for the content of the tweets were identified and defined. The media type was identified according to the possible media that can be added in the tweets. The identification of content type was based on the lists of local services prepared by Torres and Pina (2001) and later adapted by several authors (Bons n et al., 2015; Mart et al., 2012). Even so, the list was further refined according to words included in the tweets provided by local governments. 12

  13. A case study. Methodology (III) Dictionary of content types The tweets are classified into 7 types of categories. Two of these are Sport and Employment-Education. 13

  14. Flowchart of scraping and content analysis Twitter accounts Twint scraping CSV @Ayto_Sevilla @malaga @ayuncordoba_es @... twint --userlist Ayto_Sevilla,malaga,ayuncordoba - o ayto.csv - - csv Python removePunctuation removeNumbers stopwords TermDocumentMatrix findFreqTerms Refine categories Classified tweets tm package Define dictionaries stripWhitespace removePunctuation removeNumbers content_transformer TermDocumentMatrix corpus VectorSource tm_map stopwords findFreqTerms R Dictionary creation dplyr package grepl filter 14

  15. A case study. Results (I) Table 3. Table 3. Number of followers and tweets Maximum Average Minimum Std. Deviation Audience 149,040 16,746 Activity 109,737 14,531 1,500 1,457 36,430 20,362 Table 4. Table 4. Percentage of content types Content type Cultural and Marketing Other Transport and Public works Security and Health Employment and Education Sport Environment Table 5 Table 5. Percentage of media types Media type Web links Text Photo/Video Web links - Photo/Video Other Percentage 26.37% 25.42% 16.11% 12.63% 11.32% 4.14% 4.01% Percentage 35.68% 28.63% 16.04% 13.79% 5.86% 15

  16. A case study. Results (II) Table 7. Table 7. Relationship between population and activity Dependent variable Activity (number of tweets) of inhabitants) Independent variable Population (number Spearman s coefficient Significance Conclusion 0.192 0.327 No confirmed ** Significant at p<0.01 (2-tailed) Table 8. Table 8. Relationship between activity, population, audience, and citizen engagement Dependent variable variable coefficient Activity (number of tweets) Population (number of inhabitants) Audience (number of followers) ** Significant at p<0.01 (2-tailed) Independent Spearman s Significance Conclusion Negative relationship 0.716 ** 0.000 Engagement 0.285 0.142 No confirmed Negative relationship 0.595 ** 0.000 16

  17. A case study. Results (III) 17

  18. A case study. Results (IV) 18

  19. A case study. Results (V) 19

  20. A case study. Conclusions (I) 96.55% of the largest Andalusian municipalities have an official Twitter account. The accounts differ in terms of their activity and audience. The most common tweeted content appeared to be cultural and marketing content (26.37%). Regarding media type, website links (35.68%) were the most frequently used,. Citizens tend to choose retweets more often than favorites or replies to interact with the municipality. 20

  21. A case study. Conclusions (II) No significant relationship was found between municipality size and Twitter activity. A significant negative relationship was found between Twitter activity (measured by the number of published tweets), the audience (measured by the number of followers), and the citizen engagement. Photos and videos generate the highest rates of favorites and retweets. The media type that generates the highest response rate was plain text. Sport content generated the most retweets and favorites. The content that tended to receive the most replies was related to environmental issues. 21

  22. A case study. Future Research The future studies could adopt our approach and apply it to a comparative context with other national and international regions, which could improve the generalizability and understanding of the results. Conduct research exploring the reasons leading users to interact with municipalities Twitter pages. Propose a Twitter commitment model for the municipalities to explain the relationship between the antecedent factors, the commitment that occurs, and the attitudinal and behavioral effects on the users. 22

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#