Wikidata for data journalism

(with R and ‘tidywikidatar’)

Wikidata Data Reuse Days 2022

Giorgio Comai

(OBCT/CCI - EDJNet)

17 March 2022

1 / 39

Contentswho am I?
why I felt the need for a new package for interacting with Wikidata?
tidywikidatar - a new(ish) R package
how it solves these problems
how tidywikidatar works
actual use cases
an interactive web interface for checking manually what automatic approaches may not get right
what next?
2 / 39

Hi, I'm Giorgio Comai

researcher and data analyst working for OBCT/CCI
we coordinate the European Data Journalism Network

one of our goals is to create tools for data journalists
I am the author of the R package tidywikidatar - an R package that facilitates exploring 'Wikidata' through tidy data frames
online mostly as @giocomai e.g. on GitHub, Twitter, Wikidata
I have a website, giorgiocomai.eu

3 / 39

Why a new package for R?there are other Wikidata packages for R, in particular:WikidataQueryServiceR - SPARQL queries!
WikidataR - all of Wikidata's complexity, which in R translates into nested lists that differ depending on the content

R has quickly grown to prominence in data journalism largely thanks to the tidyverse, a suite of packages based on a consistent grammar that has data frames at its core. There was not an easy way to use Wikidata in a way that is consistent with this logic
4 / 39

What's the matter?R users mostly hate nested lists
some probably also hate SPARQL, but even more simply don't know much about SPARQL
if you don't know what to expect, it's a pain to process data
existing tools are not fit for the iterative data analysis process that is at the core of data journalism
re-running an analysis with minor changes is a very common part of the workflow... without built-in caching, this can be painfully slow
5 / 39

`tidywikidatar`

Check out website with documentation and examples: https://edjnet.github.io/tidywikidatar/

everything in tabular format
one row, one piece of information
easy local caching
easy integration with dplyr piped routines
get image credits from WikiMedia commons
include Wikipedia in the exploration, or use it as a starting point

6 / 39

A couple of examples of practical use cases7 / 39

Olympics 2020 medalists by place of birth

https://github.com/EDJNet/olympics2020nuts

8 / 39

Main air routes that could be travelled by train

Wikidata used for: defining city hubs for airports, getting coordinates of airports (for excluding those on islands), use unique identifiers for merging with train dataset https://edjnet.github.io/european_routes/

9 / 39

Mapping diversity

https://medium.com/european-data-journalism-network/finding-gendered-street-names-a-step-by-step-walkthrough-with-r-7608c2d36a77

10 / 39

tidywikidatarThe basics11 / 39

Enable local caching

library(dplyr, warn.conflicts = FALSE)
library("tidywikidatar")
tw_enable_cache()
tw_set_cache_folder(
  path = fs::path(fs::path_home_r(),
                  "R",
                  "tw_data"))
tw_set_language(language = "en")
tw_create_cache_folder(ask = FALSE)

12 / 39

Or e.g. MySQL

library(dplyr, warn.conflicts = FALSE)
library("tidywikidatar")
tw_enable_cache(SQLite = FALSE)
tw_set_cache_db(driver = "MySQL",
                host = "localhost",
                port = 3306,
                database = "tidywikidatar",
                user = "secret_username",
                pwd = "secret_password")

13 / 39

Search

tw_search("European Union")

## # A tibble: 10 × 3
##    id       label                                           description         
##    <chr>    <chr>                                           <chr>               
##  1 Q458     European Union                                  economic and politi…
##  2 Q319328  European Union                                  antifascist resista…
##  3 Q326124  directive of the European Union                 legislative act of …
##  4 Q185441  Member state of the European Union              state that is party…
##  5 Q208202  European Union law                              body of treaties an…
##  6 Q392918  European Union agency                           distinct body of th…
##  7 Q1126192 European Union Prize for Literature             award               
##  8 Q308905  presidency of the Council of the European Union rotating presidency 
##  9 Q207203  Europol                                         police agency of th…
## 10 Q276255  European Union Aviation Safety Agency           European Union civi…

14 / 39

Get an item

tw_search("European Union") %>% 
  slice(1) %>% 
  tw_get()

## # A tibble: 557 × 4
##    id    property value                rank  
##    <chr> <chr>    <chr>                <chr> 
##  1 Q458  label_en European Union       <NA>  
##  2 Q458  alias_en EU                   <NA>  
##  3 Q458  alias_en E.U.                 <NA>  
##  4 Q458  alias_en eu                   <NA>  
##  5 Q458  alias_en 🇪🇺                   <NA>  
##  6 Q458  alias_en Europe               <NA>  
##  7 Q458  P1448    Европейски съюз      normal
##  8 Q458  P1448    Evropská unie        normal
##  9 Q458  P1448    Den Europæiske Union normal
## 10 Q458  P1448    Europäische Union    normal
## # … with 547 more rows

15 / 39

Get a specific property

tw_search("European Union") %>% 
  slice(1) %>% 
  tw_get_property(p = "P31")

## # A tibble: 6 × 4
##   id    property value    rank  
##   <chr> <chr>    <chr>    <chr> 
## 1 Q458  P31      Q4120211 normal
## 2 Q458  P31      Q3623811 normal
## 3 Q458  P31      Q1335818 normal
## 4 Q458  P31      Q7210356 normal
## 5 Q458  P31      Q1048835 normal
## 6 Q458  P31      Q170156  normal

16 / 39

Get a specific property

tw_search("European Union") %>% 
  slice(1) %>% 
  tw_get_property(p = "P31") %>% 
  tw_label()

## # A tibble: 6 × 4
##   id             property    value                        rank  
##   <chr>          <chr>       <chr>                        <chr> 
## 1 European Union instance of regional organization        normal
## 2 European Union instance of economic union               normal
## 3 European Union instance of supranational union          normal
## 4 European Union instance of political organization       normal
## 5 European Union instance of political territorial entity normal
## 6 European Union instance of confederation                normal

17 / 39

What about qualifiers?

e.g. when did member states join the EU?

tw_get_qualifiers(id = "Q458", # European Union
                  p = "P150") # contains administrative territorial entity

## # A tibble: 37 × 8
##    id    property qualifier_id qualifier_prope… qualifier_value qualifier_value…
##    <chr> <chr>    <chr>        <chr>            <chr>           <chr>           
##  1 Q458  P150     Q31          P580             +1958-01-01T00… time            
##  2 Q458  P150     Q183         P580             +1958-01-01T00… time            
##  3 Q458  P150     Q142         P580             +1958-01-01T00… time            
##  4 Q458  P150     Q142         P1012            Q3769           wikibase-entity…
##  5 Q458  P150     Q142         P1012            Q17012          wikibase-entity…
##  6 Q458  P150     Q142         P1012            Q17054          wikibase-entity…
##  7 Q458  P150     Q142         P1012            Q17063          wikibase-entity…
##  8 Q458  P150     Q142         P1012            Q17070          wikibase-entity…
##  9 Q458  P150     Q142         P1012            Q126125         wikibase-entity…
## 10 Q458  P150     Q38          P580             +1958-01-01T00… time            
## # … with 27 more rows, and 2 more variables: rank <chr>, set <dbl>

18 / 39

What about qualifiers?

tw_get_qualifiers(id = "Q458", # European Union
                  p = "P150") %>% # contains administrative territorial entity 
  filter(qualifier_property == "P580") %>% # keep only "start time"
  transmute(country = tw_get_label(qualifier_id),
            start_time = qualifier_value)

## # A tibble: 28 × 2
##    country                    start_time           
##    <chr>                      <chr>                
##  1 Belgium                    +1958-01-01T00:00:00Z
##  2 Germany                    +1958-01-01T00:00:00Z
##  3 France                     +1958-01-01T00:00:00Z
##  4 Italy                      +1958-01-01T00:00:00Z
##  5 Luxembourg                 +1958-01-01T00:00:00Z
##  6 Kingdom of the Netherlands +1958-01-01T00:00:00Z
##  7 Denmark                    +1973-01-01T00:00:00Z
##  8 Republic of Ireland        +1973-01-01T00:00:00Z
##  9 Greece                     +1981-01-01T00:00:00Z
## 10 Portugal                   +1986-01-01T00:00:00Z
## # … with 18 more rows

for more, check tw_get_property_with_details()

19 / 39

Dealing with multiple results when only one is needed

Easy questions can be difficult: in which country is London?

tibble::tibble(city_qid = c("Q84")) %>% 
  dplyr::mutate(city_label = tw_get_label(city_qid), 
                country_qid = tw_get_p(id = city_qid,
                                       p = "P17")) %>% 
  tidyr::unnest(country_qid) %>% 
  mutate(country = tw_get_label(country_qid))

## # A tibble: 8 × 4
##   city_qid city_label country_qid country                                    
##   <chr>    <chr>      <chr>       <chr>                                      
## 1 Q84      London     Q2277       Roman Empire                               
## 2 Q84      London     Q110888     Kingdom of Essex                           
## 3 Q84      London     Q105092     Kingdom of Mercia                          
## 4 Q84      London     Q105313     Kingdom of Wessex                          
## 5 Q84      London     Q179876     Kingdom of England                         
## 6 Q84      London     Q161885     Great Britain                              
## 7 Q84      London     Q174193     United Kingdom of Great Britain and Ireland
## 8 Q84      London     Q145        United Kingdom

20 / 39

Dealing with multiple results when only one is needed

keeping first result is tricky
keeping only preferred may still give more than one result
people who love tabular data often want just one result, that needs to be "good enough"

tibble::tibble(city_qid = c("Q84", "Q220")) %>% 
  dplyr::mutate(city_label = tw_get_label(city_qid), 
                country_qid = tw_get_p(id = city_qid,
                                       p = "P17",
                                       preferred = TRUE,
                                       latest_start_time = TRUE,
                                       only_first = TRUE)) %>%
  dplyr::mutate(country_label = tw_get_label(country_qid))

## # A tibble: 2 × 4
##   city_qid city_label country_qid country_label 
##   <chr>    <chr>      <chr>       <chr>         
## 1 Q84      London     Q145        United Kingdom
## 2 Q220     Rome       Q38         Italy

21 / 39

Different entry points22 / 39

Searchtw_search() - search strings
Querytw_query() - simple queries based on property/value couples
tw_get_all_with_p() - get all items that have a given property, irrespective of their value
Based on Wikipediatw_get_wikipedia_page_links() - Get all Wikidata Q identifiers of all Wikipedia pages linked to input
tw_get_wikipedia_page_section_links() - All identifiers found in a specific section of a Wikipedia page
23 / 39

An example starting from Wikipedia24 / 39

Election of the President of the Republic in Italy

Election of the President of the Republic in Italy
the electoral college can vote literally for whoever they like
the list ends up including very different candidates, from respected intellectuals to football players and porn actors
almost all of them with one thing in common: they are on Wikidata, but Wikidata does not know they have something in common.

25 / 39

Wikidata identifiers

Take a single section:

df <- tw_get_wikipedia_page_section_links(
  title = "Elezione del Presidente della Repubblica Italiana del 2022",
  section_title = "IV scrutinio",
  language = "it")
df %>% select(wikipedia_title, qid)

wikipedia_title	qid
Adnkronos	Q359875
Alberto Airola	Q14636378
Aldo Giannuli	Q3609233
Alessandro Altobelli	Q346945
Alessandro Barbero	Q960451
Carlo Nordio	Q19357364
Dino Zoff	Q180661
Domenico De Masi	Q3713005

26 / 39

Find out more

pob_df <- df %>% 
  select(qid) %>% 
  mutate(name = tw_get_label(qid)) %>%
  mutate(place_of_birth_id = tw_get_p(id = qid, p = "P19",only_first = TRUE)) %>%
  mutate(place_of_birth = tw_get_label(place_of_birth_id)) %>%
  mutate(place_of_birth_coordinates = tw_get_p(id = place_of_birth_id,
                                               p = "P625",
                                               only_first = TRUE))
pob_df

## # A tibble: 36 × 5
##    qid       name               place_of_birth_… place_of_birth place_of_birth_…
##    <chr>     <chr>              <chr>            <chr>          <chr>           
##  1 Q359875   Adnkronos          <NA>             <NA>           <NA>            
##  2 Q14636378 Alberto Airola     Q9474            Moncalieri     45,7.683333     
##  3 Q3609233  Aldo Giannuli      Q3519            Bari           41.125277777778…
##  4 Q346945   Alessandro Altobe… Q128211          Sonnino        41.414458333333…
##  5 Q960451   Alessandro Barbero Q495             Turin          45.066666666667…
##  6 Q19357364 Carlo Nordio       Q5475            Treviso        45.672219444444…
##  7 Q180661   Dino Zoff          Q53131           Mariano del F… 45.916666666667…
##  8 Q3713005  Domenico de Masi   Q277969          Rotello        41.7475,15.0041…
##  9 Q3723207  Elisabetta Belloni Q220             Rome           41.893055555556…
## 10 Q726247   Franco Grillini    Q94979           Pianoro        44.383333333333…
## # … with 26 more rows

27 / 39

Here they are on a map

pob_sf <- pob_df  %>%
  tidyr::separate(
    col = place_of_birth_coordinates,
    into = c("pob_latitude","pob_longitude"),
    sep = ",",
    remove = TRUE,
    convert = TRUE) %>%
  filter(is.na(pob_latitude)==FALSE) %>% 
  sf::st_as_sf(coords = c("pob_longitude", "pob_latitude"), crs = 4326)
library("ggplot2")
pop_gg <- ggplot() +
  geom_sf(data = ll_get_nuts_it(level = 3, no_check_certificate = TRUE)) +
  geom_sf(data = pob_sf, colour = "deeppink4") +
  theme_minimal()

## ℹ Source: https://www.istat.it/it/archivio/222527
## ℹ Istat (CC-BY)

28 / 39

All the usual things we expect from Wikidataoccupation_df <- df %>% 
  pull(qid) %>% 
  tw_get_property(p = "P31") %>% # get "instance of"
  filter(value == "Q5") %>% # keep only humans
  tw_get_property(p = "P106") %>% # get occupation
 # filter(value!="Q82955") %>% # exclude politicians
  group_by(value) %>% 
  count(sort = TRUE) %>% 
  ungroup() %>% 
  transmute(occupation = tw_get_label(value), n)



occupation
n


politician
18

university teacher
5

judge
5

association football player
5

jurist
4

journalist
4

lawyer
3

association football manager
3

historian
2

sociologist
2

writer
2

high civil servant
1

deputy chairperson
1

essayist
1

music director
1

professor
1

entrepreneur
1

media proprietor
1

conductor
1

economist
1

diplomat
1

psychologist
1

film director
1

sports executive
1

radio personality
1

medievalist
1

theatrical director
1

academic
1

basketball player
1

physician
1

film critic
1

magistrate
1

musician
1

clerk
1

banker
1

television presenter
1


29 / 39

occupation	n
politician	18
university teacher	5
judge	5
association football player	5
jurist	4
journalist	4
lawyer	3
association football manager	3
historian	2
sociologist	2
writer	2
high civil servant	1
deputy chairperson	1
essayist	1
music director	1
professor	1
entrepreneur	1
media proprietor	1
conductor	1
economist	1
diplomat	1
psychologist	1
film director	1
sports executive	1
radio personality	1
medievalist	1
theatrical director	1
academic	1
basketball player	1
physician	1
film critic	1
magistrate	1
musician	1
clerk	1
banker	1
television presenter	1

And other things useful for data visualisation and interactive interfaces, e.g. quick access to images...

president_df <- tw_search("Sergio Mattarella") %>% 
  tw_filter_first(p = "P31", q = "Q5") 
president_df %>% tw_get_image()

## # A tibble: 1 × 2
##   id       image                    
##   <chr>    <chr>                    
## 1 Q3956186 Presidente Mattarella.jpg

president_df %>%  tw_get_image(format = "embed", width = 300) %>% pull(image)

## [1] "https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/Presidente Mattarella.jpg&width=300"

30 / 39

...with metadata and credits line

president_df %>% tw_get_image_metadata() %>% 
  tidyr::pivot_longer(cols = -1, names_to = "property", values_transform = as.character)

## # A tibble: 18 × 3
##    id       property                   value                                    
##    <chr>    <chr>                      <chr>                                    
##  1 Q3956186 image_filename             "Presidente Mattarella.jpg"              
##  2 Q3956186 object_name                "Presidente Mattarella"                  
##  3 Q3956186 image_description          "<a href=\"https://en.wikipedia.org/wiki…
##  4 Q3956186 categories                 "Attribution only license|Images from th…
##  5 Q3956186 assessments                ""                                       
##  6 Q3956186 credit                     "<a rel=\"nofollow\" class=\"external te…
##  7 Q3956186 artist                     "Unknown author<span style=\"display: no…
##  8 Q3956186 permission                  <NA>                                    
##  9 Q3956186 license_short_name         "Attribution"                            
## 10 Q3956186 license_url                 <NA>                                    
## 11 Q3956186 license                     <NA>                                    
## 12 Q3956186 usage_terms                 <NA>                                    
## 13 Q3956186 attribution_required        <NA>                                    
## 14 Q3956186 copyrighted                 <NA>                                    
## 15 Q3956186 restrictions               "personality"                            
## 16 Q3956186 date_time                  "2020-04-14 18:33:11"                    
## 17 Q3956186 date_time_original         "4 March 2015"                           
## 18 Q3956186 commons_metadata_extension "1.2"

31 / 39

Back and forth between Wikidata and Wikipedia

This gets the Q identifier of all pages linked from a the Wikipedia page of a given Q identifier. Easy peasy :-)

president_df %>% 
  tw_get_wikipedia(language = "it") %>% # gets url of Wikipedia page from QID
  tw_get_wikipedia_page_links(language = "it") %>% 
  select(wikipedia_title, qid)

## # A tibble: 489 × 2
##    wikipedia_title                             qid  
##    <chr>                                       <chr>
##  1 Fabio Vander                                <NA> 
##  2 Ordine per Meriti Eccezionali               <NA> 
##  3 Discussioni template:Capi di Stato d'Europa <NA> 
##  4 1941                                        Q5231
##  5 1987                                        Q2429
##  6 1989                                        Q2425
##  7 1990                                        Q2064
##  8 1998                                        Q2089
##  9 1999                                        Q2091
## 10 2001                                        Q1988
## # … with 479 more rows

32 / 39

Integration with interactive interfaces33 / 39

34 / 39

35 / 39

Issues and next steps36 / 39

General issues

if you are processing many thousands of items, the current approach can be very slow when run for the first time (near instant thanks to caching later)
- no obvious long term solution, but a future version will allow for an easier way to share the cache to make sure others can also run the script instantly
no easy way to "give back" to Wikidata
- I'm considering some sort of integration with QuickStatements, and welcome recommendations on best practices (not so much on the technical implementation, but rather, what kind of instructions or warnings I should expose to users)

37 / 39

Issues related to the web interface and mappingfinalise the web interface, hopefully it will be out and public by the summer
make it more generic, for easier checks of arbitrary sets of strings
consider data issues, in particular the fact that many streets have an own item on Wikidata
consider ways to "give back" newly gathered data to Wikidata (the data generated via the web interface will anyway be public)
consider what to do with indirect outputs, e.g. lists of people who have street dedicated to them, but who apparently are not on Wikidata (also considering the inherent biases)
38 / 39

`tidywikidatar`

Check out website with documentation and examples: https://edjnet.github.io/tidywikidatar/

everything in tabular format
one row, one piece of information
easy local caching
easy integration with dplyr piped routines
get image credits from WikiMedia commons
include Wikipedia in the exploration, or use it as a starting point

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Wikidata for data journalism

(with R and ‘tidywikidatar’)

Giorgio Comai

(OBCT/CCI - EDJNet)

17 March 2022

Contents

Hi, I'm Giorgio Comai

Why a new package for R?

What's the matter?

tidywikidatar

A couple of examples of practical use cases

Olympics 2020 medalists by place of birth

Main air routes that could be travelled by train

Mapping diversity

tidywikidatar

The basics

Enable local caching

Or e.g. MySQL

Search

Get an item

Get a specific property

Get a specific property

What about qualifiers?

What about qualifiers?

Dealing with multiple results when only one is needed

Dealing with multiple results when only one is needed

Different entry points

Search

Query

Based on Wikipedia

An example starting from Wikipedia

Election of the President of the Republic in Italy

Wikidata identifiers

Find out more

Here they are on a map

All the usual things we expect from Wikidata

And other things useful for data visualisation and interactive interfaces, e.g. quick access to images...

...with metadata and credits line

Back and forth between Wikidata and Wikipedia

Integration with interactive interfaces

Issues and next steps

General issues

Issues related to the web interface and mapping

tidywikidatar

Contents

Help

`tidywikidatar`

`tidywikidatar`

`tidywikidatar`