text mining

Text as data & data in the text

Studying conflicts in post-Soviet spaces through structured analysis of textual contents available on-line

castarter - content analysis starter toolkit for R

castarter is a more modern, fully-featured, and consistent iteration of castarter - Content Analysis Starter Toolkit for the R programming language (a previous iteration is still available as castarter.legacy. It facilitates text mining and web scraping by taking care of many of the most common file management issues, keeps tracks of download advancement in a local database, facilitates extraction through dedicated convenience functions, and allows for basic exploration of textual corpora through a Shiny interface.


tifkremlinen is a package providing a single dataset - kremlin_en - including all contents published on the English-language version of kremlin.ru starting with 31 December 1999 and until 31 December 2020. Yearly updates will likely be made available. Link to repo on GitHub Link to official version of dataset with all details

castarter.legacy - content analysis starter toolkit for R

castarter (now castarter.legacy) is designed to make it easy also for relatively inexperienced users to create a textual dataset from a website, or a section of a website, keep it up-to-date, and explore it through word frequency graphs or a web interface that makes it possibe to tag items. Documentation is available on castarter’s website.

Surfing the post-Soviet web with style. Text mining post-Soviet de facto states

Scholars working on the post-Soviet space frequently refer to web contents at different stages of their research process. However, they …