Russian state institutions full-text datasets

Abstract

A collection of corpora based on contents extracted from the websites of Russian state institutions

Giorgio Comai (2024): Russian state institutions full-text datasets – A collection of corpora based on contents extracted from the websites of Russian state institutions, v. 1.0, Discuss Data, https://doi.org/10.48320/0578D7FE-35F7-4E9E-A29D-926618A5C6BD


This is a collection of full-text datasets based on contents extracted from the websites of Russian state institutions.

All datasets do not include items published after 31 December 2023.

These datasets have been introduced in the following book chapter, which offers additional context:

Comai, Giorgio (2025, forthcoming), “Text-mining on-line sources from Russia openly”, in Autocracy, Influence, War: Russian Propaganda Today, edited by Paul Goode

The name of each corpus is composed of the bare domain name, a two letter code of the main language of the contents, and the year of release of the dataset, separated by an underscore, e.g. kremlin.ru_ru_2024 for the Russian-language version of Kremlin.ru.

This release includes the following websites:

  • Russia’s president, kremlin.ru, in English, filename: kremlin.ru_en_2024, from 1999-12-31 to 2023-12-31. Items included: 33 165
  • Russia’s president, kremlin.ru, in Russian, filename: kremlin.ru_ru_2024, from 1999-12-31 to 2023-12-31. Items included: 45 538
  • Russia’s MFA, mid.ru, in English, filename: mid.ru_en_2024, from 2003-01-04 to 2023-12-31. Items included: 25 943
  • Russia’s MFA, mid.ru, in Russian, filename: mid.ru_ru_2024, from 2003-01-02 to 2023-12-31. Items included: 56 203
  • Russia’s government, government.ru, in Russian, filename: government.ru_ru_2024, from 2006-06-22 to 2023-12-30. Items included: 17 135
  • Russia’s government (archived version), archive.government.ru, in Russian, filename: archive.government.ru_ru_2024, from 2008-05-07 to 2013-05-21. Items included: 7 103
  • Russia’s prime minister (archived version), archive.premier.gov.ru, in Russian, filename: archive.premier.gov.ru_ru_2024, from 2008-05-07 to 2012-05-07. Items included: 3 323
  • Russia’s Duma, duma.gov.ru, in Russian, filename: duma.gov.ru_ru_2024, from 2006-04-05 to 2023-12-30. Items included: 29 094
  • Russia’s Duma (transcripts), transcript.duma.gov.ru, in Russian, filename: transcript.duma.gov.ru_ru_2024, from 1994-01-11 to 2023-12-15. Items included: 6 032

File formats: compressed csv files (.csv.gz); Open Document Spreadsheets (.ods)

A web version of the documentation accompanying this release is available online: https://tadadit.xyz/datasets/2024/russian_institutions_2024/

Explore through a basic web interface: https://explore.tadadit.xyz/2024/ru_institutions_2024/

Related