Automated edits/JaTrainWikipedia
Goal
This is a proposal for an mechanical edit, allowing to complete the tag "wikipedia" for the Japanese train stations, under the following account:
There is about 9500 train stations nodes for Japan in Openstreetmap, but only 1000 of them have a tag "wikipedia" linking to their corresponding page. The goal of this script is to:
- fill the tag "wikipedia" if not present
- convert the outdated tag "wikipedia:ja" (pointing to an URL) to the right format
- automate most of the work
- be safe: point on the right page (do not point on disambiguation page) and give up if we cannot be sure
- provide some samples for the local mappers, so they can discuss about the validity of the project
- allow the user to review the modifications before any commit to the server
- prevent the commits to be a burden for the server
Implementation
The process will perform the following tests.
Creating a station extract from Wikipedia
The first thing is to retrieve the Japanese Wikipedia dump and extract from it the list of the stations and their location. It will be only used to check that picked the right Wikipedia page for the OSM station node. No Wikipedia data will be put in the Openstreetmap database (licenses are not compatible).
Listing the train station nodes
Then we will retrieve all the OSM nodes to update, by:
- retrieving a recent XML dump of the Japan data
- filtering it using Osmosis and keeping only the nodes with the tag «railway=station» (I don't have the command line, I used OSMembrane)
Processing the stations in batch
Starting from here, the job will be done by a Python script.
The tool would not process all the stations at once, but by batches. Each batch would contain a number of stations low enough:
- to be allow a human review (in JOSM)
- not to be a burden for the servers
The stations of a batch will be about in the same region. The result of each batch will be saved in an OSM XML.
Processing a station
If the tag for a station is already filled, we can skip this station.
Retrieving the latest version of the node
The script will get it by using the API 0.6 (for example, using this download URL). We will retrieve many nodes at once.
Converting the tag
If a tag «wikipedia:ja -> URL» is present, we convert it to the format «wikipedia -> ja:name_of_page».
Finding the right Wikipedia page (current implementation)
Most of the time, the name is the «kanji name» + «駅». But we have to make sure that it does not point on a Wikipedia disambiguation page. And we can compare the coordinates from Openstreetmap and from Wikipedia.
- From the Wikipedia extract, get a list of potentially matching stations. Each one will have coordinates and a page name
- If we have many stations in a close distance (arbitrary distance: 500m), skip this station (not safe. For instance, there may be a JR station and another one).
- If we have one station within rage, pick this one
Finding the right Wikipedia page (previous implementation)
Most of the time, the name is the «kanji name» + «駅». But we have to make sure that it does not point on a Wikipedia disambiguation page. And we can compare the coordinates from Openstreetmap and from Wikipedia.
- retrieve the Japanese edit page for this station
- look in the local cache if we already have it
- otherwise, download it. Its URL is http://ja.wikipedia.org/w/index.php?title= 'node name' 駅&action=edit
- If this page contains the text «{{aimai} }», it is a disambiguation page, and we can skip this node.*
- we can easily extract the coordinates from the page (if present) and check against the OSM coordinates, because they are present in a line starting by «|座標»
Submitting
After having processed the whole input file, the user will have as result a number of output XML files. He can:
- open one of them in JOSM
- see the modifications (search for «railway station» and see the history)
- if ok, submit
Source code
The source code is available there.
https://github.com/Fabiensk/osm-enrich
Sample output file
You can get the current OSM output files. To check the modifications, you can open it in JOSM and look at the history of the nodes.
A log file is also generated to explain why the station are updated or not (distance between Wikipedia and Openstreetmap to big, disambiguation page)
Improvements and other tasks
- clean the code of the script, make it more modular and put it on Github
Status
Done
- principle is accepted
- preliminary output files and modification log are available
- on 9241 stations nodes, 8197 could be modified
- a first edit has been made public
To do
- get the feedback from the local mappers (japanese mailing list).
- decide on many nodes per commit we will have
- commit the rest
- clean the code, put it on github