3 Extracting data from Google Maps
Google Maps does not allow to extract all the street names for a given country, or all streets with a given name in a country. While there may be more efficient ways (suggestions welcome), we proceed by extracting all the names of villages, towns and cities in relevant region from OpenStreetMap, and then query Google Maps for Tito street (or similar) in each of them.
3.1 Extracting all places OpenStreetMap
We use previously downloaded OpenStreetMap dumps with different filters.
# filter only places
dir.create(path = file.path("data", "o5m-places"), showWarnings = FALSE)
for (i in countries) {
if (file.exists(file.path("data", "o5m-places", paste0(i, "-places.o5m")))==FALSE) {
system(paste0('./osmfilter data/o5m/', i, '-latest.o5m --keep="place=*" --drop-version > ', 'data/o5m-places/', i, '-places.o5m'))
}
}
# export to csv only street type, name, and lon/lat
dir.create(path = file.path("data", "csv-places"), showWarnings = FALSE)
for (i in countries) {
if (file.exists(file.path("data", "csv-places", paste0(i, "-places.csv")))==FALSE) {
system(paste0('./osmconvert64 data/o5m-places/', i, '-places.o5m --all-to-nodes --csv="@id @lat @lon place name" > data/csv-places/', i, '-places.csv', " --csv-separator='; '"))
}
}
all_places <- data_frame()
for (i in countries) {
# Import from csv
places <- read_delim(file = file.path("data", "csv-places", paste0(i, "-places.csv")), delim = "; ", col_names = FALSE, locale = locale(decimal_mark = "."), trim_ws = TRUE)
places <- cbind(places, i)
all_places <- bind_rows(all_places, places)
}
colnames(all_places) <- c("id", "lat", "lon", "place", "name", "country")
all_places <- all_places %>% filter(is.na(name)==FALSE)
ExportData(data = all_places, "all_places")
The data are available as a spreadsheet in .csv, .xlsx, and as a data frame in R’s .rds format.
This filter provides a list of 43826 place names; testing mutiple street names for each of them would require a large (and costly) number of queries to the Google Api. We can therefore filter the data in order to include only place names that are tagged as city, town, suburb, or village. This should include all inhabited locations with more than 1000 residents, and exclude places tagged as “locality”, “isolated_dwelling”, and “hamlet”, which are expected to be mostly irrelevant.
# http://wiki.openstreetmap.org/wiki/Tag:place%3Dvillage
all_places_over1000 <- all_places %>% filter(is.na(name)==FALSE) %>% filter(place == "city" | place == "town" | place == "suburb" | place == "village") %>% distinct()
This more restrictive filter provides a sizable, but somewhat more managable dataset of 25881 place names.
The data are available as a spreadsheet in .csv, .xlsx, and as a data frame in R’s .rds format.
3.2 What are potential street names that should be queried?
By simply querying “tito” for all place names emerging from the filter, we would likely still receive meaningful results. However, querying for potential street names should give more accurate results. We can base a list of potential street names in each country on previously extracted OpenStreetMaps data.
OSM_tito_all <- ImportData("OSM_tito_all")
for (i in unique(OSM_tito_all$country)) {
ShowTable(
OSM_tito_all %>% filter(country==i) %>% select(streetname, country) %>% count(streetname, country, sort = TRUE) %>% select(streetname, n, country)
)
}
streetname | n | country |
---|---|---|
Titova ulica | 3 | slovenia |
Cesta maršala Tita | 2 | slovenia |
Titova cesta | 2 | slovenia |
Trg maršala Tita | 2 | slovenia |
Titov most | 1 | slovenia |
Titov trg | 1 | slovenia |
Titov trg / Piazza Tito | 1 | slovenia |
Titova - Nasipna | 1 | slovenia |
Ulica Josipa Broza-Tita | 1 | slovenia |
streetname | n | country |
---|---|---|
Maršala Tita | 6 | croatia |
Ulica Maršala Tita | 4 | croatia |
Obala Maršala Tita | 2 | croatia |
Titov trg | 2 | croatia |
Hodaliste marsala tita | 1 | croatia |
Josipa Broza Tita | 1 | croatia |
Obala Josipa Broza Tita | 1 | croatia |
Obala m. Tita | 1 | croatia |
Poljana maršala Tita | 1 | croatia |
Trg J. B. Tita | 1 | croatia |
Trg Josipa Broza Tita | 1 | croatia |
Trg maršala Tita | 1 | croatia |
Trg Maršala Tita | 1 | croatia |
Ulica Josipa Broza Tita | 1 | croatia |
Ulica Josipa Broza-Tita | 1 | croatia |
streetname | n | country |
---|---|---|
Maršala Tita | 11 | bosnia-herzegovina |
Titova | 4 | bosnia-herzegovina |
Titova ili Put Oficirske Škole | 1 | bosnia-herzegovina |
Trg maršala Tita | 1 | bosnia-herzegovina |
Ul Maršala Tita | 1 | bosnia-herzegovina |
ul. Maršala Tita | 1 | bosnia-herzegovina |
streetname | n | country |
---|---|---|
Maršala Tita | 5 | serbia |
Титоградска | 4 | serbia |
Aleja Maršala Tita | 1 | serbia |
Marsala Tita | 1 | serbia |
Ulica Marsala Tita | 1 | serbia |
Тито Маршал | 1 | serbia |
Титова | 1 | serbia |
streetname | n | country |
---|---|---|
Marsala Tita | 2 | montenegro |
Gjergj Kastrioti - Skënderbeu / Maršal Tito | 1 | montenegro |
Josipa Broza Tita | 1 | montenegro |
Maršala Tita | 1 | montenegro |
Titove Korenice | 1 | montenegro |
trg Maršala Tita | 1 | montenegro |
streetname | n | country |
---|---|---|
Маршал Тито | 14 | macedonia |
bul. Marsal Tito | 1 | macedonia |
Marsal Tito | 1 | macedonia |
Marshal Tito | 1 | macedonia |
ul. Marshal Tito | 1 | macedonia |
Кеј Маршал Тито | 1 | macedonia |
Титова Митровачка | 1 | macedonia |
Титовелешка | 1 | macedonia |
Титово Ужице | 1 | macedonia |
ул. Маршал Тито | 1 | macedonia |
Улица Маршал Тито | 1 | macedonia |
Considering that if Google Maps does not find exact matches, it offers a similar result (and accordingly deals with transliteration when needed), querying for ‘Titov’ and ‘Maršala Tita’ should provide an almost complete set of cases.
Shortcomings of this approach:
- if there are towns/villages with same name, in the same country, but in different region, only one is counted (Google decides which)
- if there is more than one street in the same village with similar name (say, both a “Marshal Tito street” and a “Marshal Tito Boulevard”), then only one is counted.
3.3 Finding “titov”" on Google Maps
all_places_over1000 <- all_places_over1000 %>% filter(is.na(name)==FALSE) %>% distinct(name, country, .keep_all = TRUE)
titovQuery <- paste("titov", all_places_over1000$name, all_places_over1000$country, sep = ", ")
This is the kind of queries that will be made:
ShowTable(head(data_frame(Query = titovQuery)))
Query |
---|
titov, Ljubljana, slovenia |
titov, Banovci, slovenia |
titov, Postojna, slovenia |
titov, Piran, slovenia |
titov, Izola, slovenia |
titov, Kranj, slovenia |
Google Maps API has a daily quota of 2500 free queries per day. We can either make 2500 queries per day (it would take more than a week for checking only “Titov” streets) or pay the 0.50 USD/per 1000 queries fee. In this case, querying for all “Titov” streets should cost less than 10 USD.
### if using API, uncomment this section
# saveRDS(object = "<API>", file = "GoogleApiKey.rds")
# register_google(key = readRDS("GoogleApiKey.rds"), account_type = "premium", day_limit = 50000)
## this just prepares a properly structured data frame
# titovResults <- cbind(geocode("Titov, Sarajevo, Bosnia and Herzegovina", output='more', messaging=TRUE, override_limit=TRUE), Query = "Maršala Tita, Sarajevo, Bosnia and Herzegovina")
# if (file.exists(file.path("data", "titovResults.rds")==FALSE)) {
# for (i in seq_along(titovQuery)) {
# temp <- try(geocode(location = titovQuery[i], output='more', messaging=TRUE))
# Sys.sleep(time = 1.5) #wait in order to stay within API limits
# if (is.na(temp$lon)==FALSE) {
# temp <- cbind(temp, Query = titovQuery[i])
# titovResults <- bind_rows(titovResults, temp)
# # saves the results as the process goes (just in case)
# ExportData(data = titovResults, filename = "titovResults", xlsx = FALSE)
# }
# }
# }
### This makes only the number of queries allowed in a given day, then it stops.
### If you re-run this another day it will pick up from where it left.
dir.create(path = "temp", showWarnings = FALSE)
# do nothing if already no free queries available
if (geocodeQueryCheck()>1) {
if (file.exists(file.path("data", "titovResults.rds"))==FALSE) {
#this simply aims to prepare a properly structured data frame
titovResults <- cbind(geocode("Titov, Sarajevo, Bosnia and Herzegovina", output='more'),
Query = "Titov, Sarajevo, Bosnia and Herzegovina", QueryId = 0)
start <- sum(titovProgressId, 1)
stop <- sum(titovProgressId, geocodeQueryCheck())
if (stop>length(titovQuery)) {
stop <- length(titovQuery)
}
temp <- data_frame(lon = "")
for (i in start:stop) {
temp <- geocode(location = titovQuery[i], output='more', messaging=FALSE)
Sys.sleep(time = 1.5) #wait in order to stay within API limits
if (is.na(temp$lon)==FALSE) {
temp <- cbind(temp, Query = titovQuery[i], QueryId = i)
titovResults <- bind_rows(titovResults, temp)
# saves the results as the process goes (just in case)
ExportData(data = titovResults, filename = "titovResults", xlsx = FALSE, showDataLink = FALSE)
}
saveRDS(object = i, file = file.path("temp", "titovProgressId.rds"))
}
} else {
# If this script has already been run, start from where it was last interrupted due to query limit
titovResults <- ImportData(filename = "titovResults")
titovProgressId <- readRDS(file = file.path("temp", "titovProgressId.rds"))
if (titovProgressId<length(titovQuery)) {
start <- sum(titovProgressId, 1)
stop <- sum(titovProgressId, geocodeQueryCheck())
if (stop>length(titovQuery)) {
stop <- length(titovQuery)
}
temp <- data_frame(lon = "")
for (i in start:stop) {
# If it receives an "over_query_limit" warning then skip
if (temp$lon!="OVER_QUERY_LIMIT") {
# makes sure over quota is properly dealt with: if over quota, just skips
temp <- tryCatch(expr = geocode(location = titovQuery[i], output='more', messaging=FALSE), warning = function(w) {
msg <- conditionMessage(w)
if (grepl(pattern = "OVER_QUERY_LIMIT", x = msg) == TRUE) {
return(data_frame(lon = "OVER_QUERY_LIMIT", lat = "OVER_QUERY_LIMIT"))
} else if (grepl(pattern = "ZERO_RESULTS", x = msg) == TRUE) {
return(data_frame(lon = "ZERO_RESULTS", lat = "ZERO_RESULTS"))
} else {
return(data_frame(lon = msg, lat = msg))
}
})
if (temp$lon=="OVER_QUERY_LIMIT") {
# do nothing really
} else {
Sys.sleep(time = 1.5) #wait in order to stay within API limits
if (is.na(temp$lon)==FALSE & temp$lon!="ZERO_RESULTS") {
temp <- cbind(temp, Query = titovQuery[i], QueryId = i)
titovResults <- bind_rows(titovResults, temp)
# saves the results as the process goes, so it can be stopped anytime and nothing is lost
ExportData(data = titovResults, filename = "titovResults", xlsx = FALSE, showDataLink = FALSE)
}
saveRDS(object = i, file = file.path("temp", "titovProgressId.rds"))
}
}
}
}
}
}
3.4 Querying separately for ‘Marshal tito’ (‘Маршал Тито’)
Given the “popularity” of Maršala Tita/Маршал Тито, even if many have been already captured by querying for Titov, the same process is now repeated for Maršala Tita (‘Маршал Тито’ in Macedonia). Here are some examples of the queries that will be made.
marsalaTitaQuery <- paste("Maršala Tita", all_places_over1000$name, all_places_over1000$country, sep = ", ")
marsalaTitaQuery[grepl(pattern = ", macedonia", x = marsalaTitaQuery)] <- gsub(pattern = "Maršala Tita, ", replacement = "Маршал Тито, ", marsalaTitaQuery[grepl(pattern = ", macedonia", x = marsalaTitaQuery)])
head(x = marsalaTitaQuery)
[1] “Maršala Tita, Ljubljana, slovenia” “Maršala Tita, Banovci, slovenia”
[3] “Maršala Tita, Postojna, slovenia” “Maršala Tita, Piran, slovenia”
[5] “Maršala Tita, Izola, slovenia” “Maršala Tita, Kranj, slovenia”
head(x = marsalaTitaQuery[grepl(pattern = ", macedonia", x = marsalaTitaQuery)])
[1] “Маршал Тито, Гевгелија, macedonia” “Маршал Тито, Скопје, macedonia”
[3] “Маршал Тито, Струмица, macedonia” “Маршал Тито, Неготино, macedonia” [5] “Маршал Тито, Габрене, macedonia” “Маршал Тито, Струга, macedonia”
dir.create(path = "temp", showWarnings = FALSE)
# do nothing if already no free queries available
if (geocodeQueryCheck()>1) {
if (file.exists(file.path("data", "marsalaTitaResults.rds"))==FALSE) {
#this simply aims to prepare a properly structured data frame
marsalaTitaResults <- cbind(geocode("Maršala Tita, Sarajevo, Bosnia and Herzegovina", output='more'), Query = "Maršala Tita, Sarajevo, Bosnia and Herzegovina", QueryId = 0)
start <- sum(1)
stop <- geocodeQueryCheck()
if (stop>length(marsalaTitaQuery)) {
stop <- length(marsalaTitaQuery)
}
temp <- data_frame(lon = "")
for (i in start:stop) {
temp <- geocode(location = marsalaTitaQuery[i], output='more', messaging=FALSE)
Sys.sleep(time = 1.5) #wait in order to stay within API limits
if (is.na(temp$lon)==FALSE) {
temp <- cbind(temp, Query = marsalaTitaQuery[i], QueryId = i)
marsalaTitaResults <- bind_rows(marsalaTitaResults, temp)
# saves the results as the process goes (just in case)
ExportData(data = marsalaTitaResults, filename = "marsalaTitaResults", xlsx = FALSE, showDataLink = FALSE)
}
saveRDS(object = i, file = file.path("temp", "marsalaTitaProgressId.rds"))
}
} else {
# If this script has already been run, start from where it was last interrupted due to query limit
marsalaTitaResults <- ImportData(filename = "marsalaTitaResults")
marsalaTitaProgressId <- readRDS(file = file.path("temp", "marsalaTitaProgressId.rds"))
if (marsalaTitaProgressId<length(marsalaTitaQuery)) {
start <- sum(marsalaTitaProgressId, 1)
stop <- sum(marsalaTitaProgressId, geocodeQueryCheck())
if (stop>length(marsalaTitaQuery)) {
stop <- length(marsalaTitaQuery)
}
temp <- data_frame(lon = "")
if (start!=stop) {
for (i in start:stop) {
# If it receives an "over_quey limit" warning then skip
if (temp$lon!="OVER_QUERY_LIMIT") {
# makes sure over quota is properly dealt with: if over quota, just skips
temp <- tryCatch(expr = geocode(location = marsalaTitaQuery[i], output='more', messaging=FALSE), warning = function(w) {
msg <- conditionMessage(w)
if (grepl(pattern = "OVER_QUERY_LIMIT", x = msg) == TRUE) {
return(data_frame(lon = "OVER_QUERY_LIMIT", lat = "OVER_QUERY_LIMIT"))
} else if (grepl(pattern = "ZERO_RESULTS", x = msg) == TRUE) {
return(data_frame(lon = "ZERO_RESULTS", lat = "ZERO_RESULTS"))
} else {
return(data_frame(lon = msg, lat = msg))
}
})
if (temp$lon=="OVER_QUERY_LIMIT") {
# do nothing really
} else {
Sys.sleep(time = 1.5) #wait in order to stay within API limits
if (is.na(temp$lon)==FALSE & temp$lon!="ZERO_RESULTS") {
temp <- cbind(temp, Query = marsalaTitaQuery[i], QueryId = i)
marsalaTitaResults <- bind_rows(marsalaTitaResults, temp)
# saves the results as the process goes, so it can be stopped anytime and nothing is lost
ExportData(data = marsalaTitaResults, filename = "marsalaTitaResults", xlsx = FALSE, showDataLink = FALSE)
}
saveRDS(object = i, file = file.path("temp", "marsalaTitaProgressId.rds"))
}
}
}
}
}
}
}
3.5 Polishing the results
Removing results included multiple times, and results that are not streets or squares.
titovResults <- ImportData(filename = "titovResults")
marsalaTitaResults <- ImportData(filename = "marsalaTitaResults" )
TitoGmapsResults <- bind_rows(titovResults, marsalaTitaResults) %>%
filter(type=="route", country != "Italy") %>% # exclude non-YU and non streets/squares
filter(grepl(pattern = "Tit|Tит", x = route)) %>% # remove most non-tito
filter(!grepl(pattern = "Strozzi|Brezova", x = route)) %>% # remove remaining non-Tito
distinct(address, .keep_all = TRUE) %>% # remove those with same address
distinct(locality, route, .keep_all = TRUE) %>% #remove same locality, same street name
distinct(lon, lat, route)
ExportData(data = TitoGmapsResults, filename = "TitoGmapsResults")
The data are available as a spreadsheet in .csv, .xlsx, and as a data frame in R’s .rds format.