Andhra Pradesh/Notes/Arjunaraoc/Improving geodata accuracy on OSM and Wikidata
Author:arjunaraoc
Last major content update: 2024-04-10
Geodata of places is available from multiple crowd sourced platforms such as Wikipedia, Wikidata and OSM. Wikidata identifier usage on OSM helps in identifying inaccuracies and fixing them. This post is a how to guide.
Andhra Pradesh places data - Background info
Places can be categorized into urban and rural. Rural places consist of revenue villages and hamlets affiliated to some of them. Revenue village information of about 15K for Andhra Pradesh is available from 2011 census documents. Hamlets are twice that number approximately, taking the total place count to more than 40K. Hamlets are more common in hilly areas.
Wikipedians started creating articles for places around 2007. While the towns and cities had substantial content, village article is very brief limiting mostly to the location and hierarchical administrative structure of village, such as mandal, and district. The article generation continued for several years till the notability review enforcement launch. Soon after Wikidata launch around 2010, Wikidata items were created based on the infobox data of the village article. Further info was added from census data. During 2019, Telugu Wikipedians took up a project to create village articles for all revenue villages, with content based on census data. Several hamlet articles were added in English Wiki and Telugu wiki as well.
All revenue villages are identified by census 2011 location code. This is added as a property on Wikidata. During 2018, one OSM editor uploaded data about places from Bhuvan portal to OSM. The number of places became 30K. During 2019, mandals and their admin headquarters (670) were added to OSM along with their Wikidata item. During the same time, about 800 places on OSM were populated with Wikidata values manually. As the data on all these platforms is crowdsourced and as there are several places with similar names sometimes with in a district and mostly across districts, there are several mismatches between what Wikidata shows as the location and the actual location in OSM. On 4 April 2022, Andhra Pradesh was reorganised with 13 districts becoming 26 districts. While all Wikidata items and Telugu Wikipedia articles were updated systematically to reflect the same, English Wikipedia articles may have errors.
As of April 2024, there were 33939 places in OSM, with 1739 of them having Wikidata attribute. There were 16058 revenue villages in Wikidata, with 1594 English wiki articles, amounting to 9.93% of total revenue villages. Number of active contributors to Wikipedia, Wikidata, and OSM is only 2-3 people. Thus it is not feasible to have projects with identified timelines to improve the coverage and accuracy. At the same time, efficient working with Wikidata and OSM requires esoteric script programming skills and complex software tools.
While commercial map providers provide map data in local languages, the local language labels are usually transliterated from English, resulting in errors. OSM, Wikidata and Wikipedia platforms provide a way to improve the local language maps leveraging Wikidata values on OSM, through semi automated updated of Telugu names. This work grew from such a need. The idea is to document as much and as clearly as possible, so that even users with less programming skills and exposure to web based OSM tools and interested in improving the maps can do a lot of work. The initial scope is to work on revenue villages out of 1739 places with Wikidata on OSM.
Analysis of places data in AP
Scripts
- Places count in OSM by district
- Places with Wikidata count in OSM by district
- Count of revenue villages in Wikidata by district in AP
- Count of revenue villages in Wikidata with enwiki article links by district in AP
Data
as on 2024-04-23
Wikidata | District name | osm_places | osm_places_wd | wd_places_RV | enwiki_places_RV | enwiki% of wd |
---|---|---|---|---|---|---|
Q110714850 | Alluri Sitharama Raju | 3785 | 2351 | 2956 | 24 | 0.88% |
Q110714857 | Anakapalli | 1176 | 60 | 651 | 38 | 5.84% |
Q15212 | Anantapur | 1143 | 39 | 486 | 33 | 6.79% |
Q110714854 | Annamayya | 3095 | 38 | 443 | 36 | 8.13% |
Q110876712 | Bapatla | 735 | 117 | 268 | 75 | 27.99% |
Q15213 | Chittoor | 2558 | 41 | 779 | 49 | 6.29% |
Q110714859 | Dr. B. R. Ambedkar Konaseema | 666 | 108 | 303 | 104 | 34.32% |
Q15338 | East Godavari | 521 | 57 | 259 | 53 | 20.46% |
Q110714851 | Eluru | 1568 | 137 | 647 | 149 | 23.03% |
Q15341 | Guntur | 371 | 114 | 192 | 71 | 36.98% |
Q110714860 | Kakinada | 575 | 64 | 385 | 71 | 18.44% |
Q15382 | Krishna | 998 | 117 | 455 | 118 | 25.93% |
Q15381 | Kurnool | 715 | 35 | 432 | 26 | 6.02% |
Q110714861 | Nandyal | 717 | 46 | 441 | 36 | 8.16% |
Q110876763 | NTR | 505 | 81 | 292 | 86 | 29.45% |
Q110714862 | Palnadu | 723 | 118 | 350 | 87 | 24.86% |
Q110714856 | Parvathipuram Manyam | 1196 | 147 | 902 | 36 | 3.99% |
Q15390 | Prakasam | 1632 | 76 | 784 | 79 | 10.08% |
Q15383 | Sri Potti Sriramulu Nellore | 1708 | 76 | 637 | 62 | 9.73% |
Q110714863 | Sri Sathya Sai | 1899 | 38 | 445 | 28 | 6.29% |
Q15395 | Srikakulam | 2376 | 112 | 1237 | 65 | 5.25% |
Q110714853 | Tirupati | 2096 | 61 | 994 | 58 | 5.84% |
Q15394 | Visakhapatnam | 407 | 119 | 87 | 8 | 9.20% |
Q15392 | Vizianagaram | 1239 | 590 | 914 | 89 | 9.74% |
Q15404 | West Godavari | 691 | 70 | 272 | 73 | 26.84% |
Q15342 | YSR | 1436 | 56 | 666 | 40 | 6.01% |
Identifying errors in Geodata
Simple Visual identification of potential errors
The geodata is presented in Wikidata page and corresponding English Wikipedia article page using OSM as background map. If one notices that the marker is not near to the names identified on OSM map, then there is possibility of an error. Even if the name is identified on OSM background map, selecting different zoom levels allows checking whether the place is in the correct location.
Query based on the distance between Wikidata and OSM locations
Wikidata and OSM combined query is useful to identify potential errors. A sample query for comparing village location data in a district of Andhra Pradesh and displaying top 10 by distance between the Wikidata and OSM is (https://w.wiki/9hYy) This provides a map view of the places. A table view shows Wikidata link, place name, Wikidata location, osm url for place, osm location and distance between locations of Wikidata and OSM in kilometres. Places are sorted based on decreasing distance. Usually villages are small in area approximately less than 1 sq km and separated with nearby village by at least a kilometre. So all the places which have error of more than 2 kilometres are suspects. Table view is useful to look at the data.
In order to fix these errors, open OSM for the village and Wikimedia map from Wikidata coordinates property. From the OSM map, find out the mandal, district information by selecting query features and pointing to a location close to the original location and clicking it. These can be compared with those listed on Wikidata page. If there are no differences and if the distance between the locations is more than 2 km, the locations are in error.
Query points overlaid on boundary of district for errors
Let's consider an example to understand the need for this. Jonnalagadda is part of Palnadu district, which was newly created from erstwhile Guntur district. There were at least two places by the same name in Guntur district. Wikidata code added in OSM based on the same name in the district turned out to be wrong at the wrong location. Using error distance between Wikidata and OSM location can not uncover the error. So all the Wikidata and OSM locations need to be overlaid on the boundary of the district to find such errors. For Guntur district, I uncovered three such errors with two in nearby district and another in a far away district.
Though Wikidata Map view displays the point results from Wikidata and OSM, it can not add the boundary of the district. So the output of Wikidata query is transformed into a geojson using Openrefine export with template features.[1] (see customisation for query) That file and the boundary for the district are overlaid in JOSM to identify the errors.
Fixing errors
To fix the errors in geodata, we need a way to gather and present the place data from Wikidata and OSM and present the distance between the data points. For villages, if the distance is more than two kilometres, there is potential for error. The data in OSM, Wikidata need to be verfied from primary open and freely licensed sources such as Bharatmaps (AP portal from Bharatmaps, used by government departments and includes several poi layers, Who's on First(https://spelunker.whosonfirst.org) (it was updated in 2023 and utilised various free information sources from Government of India.
Use StateGIS portals of Bharatmaps(AP portal from Bharatmaps) to find out the location of the village by searching in Geocode locator menu. As soon as you start typing name, potential matches with details of admin hierarchy like mandal, district will be shown in the drop down. Select the best match. Use Measure tool. Select the node corresponding to the place to get the location in measurement window. The location will be in Long, lat format, which needs to be converted to lat, long format while entering data in Wikidata property.
Fixing errors in mismatch between Wikidata and OSM
When using the error distance, fix the locations in Wikidata and OSM as required, if the error exceeds 2 km. Some examples are provided in the following sections.
Wikidata location is incorrect
Wikidata location is far from the actual place on OSM. Just updating the Wikidata location is sufficient to fix the issue. Use StateGIS portal of Bharatmaps and update the location as it is most easy. Use Who's on First tool spelunker to get a unique id for the place to update as additional identifer on Wikidata. Images show how a place called Ananthavaram Wikidata is fixed.
OSM location is incorrect
OSM location is far from the actual place as shown by Wikidata. Locate the village in the likely place through Bharatmaps state portal and create/update with Wikidata.
Wikidata, OSM location at wrong place outside district as correct OSM location does not have Wikidata
In this case, Wikidata and OSM locations are at wrong place outside the district, as the correct OSM place does not have Wikidata attribute. Update with the proper Wikidata value for both places.
Wikidata and OSM incorrect, No such place with Wikidata within district
This represents a case where the Wikidata location and the corresponding OSM location are pointing to a place outside the district. As discussed in the previous section, this error can be identified only by overlaying the district boundary with the points from Wikidata+OSM query.
The Wikidata value of the corresponding point in OSM needs update, along with creation of new place inside district.
Lessons learnt from a trial on Guntur district
Corrections data
- Number of places for the district in OSM: About 80
- Wikidata location property corrections: About 12 inside district, About 6 outside district
- OSM corrections: About 10
- Effort spent: About 16 person hours (includes effort spent towards learning different tools for the first time such as spelunker, Bharatmaps, Openrefine export)
Lessons Learnt
- Clean up Wikidata as much as possible before starting work on fixing mismatches between Wikidata and OSM
- If more than two values are present, make it one as much as possible (coordination location, instance of, located in administrative territorial entity
- While deleting additional values, if sources are from GNS, retain them. If the sources are imported from Wikipedia, delete them.
- If the same value is present more than once, delete the one without source
- Fix villages falling outside the selected district first, as they may not be detected properly based on distance measure
- Take up fixing the mismatches based on distance measure
- Bharat maps state GIS portal is a good source for fixes, as it has information from several government departments. POIs such as banks, post offices are available which helps to confirm the right match for the place. Purple triangles represent info from habitations while green circles or stars represent info from census. Purple triangle is to be preferred for location information. In OSM, look for any nearby poi with same village and delete them, as they are near by habitations and part of the same village.
- While copying location from Bharatmaps state GIS portal to Wikidata, change the order to lat, long
- Spelunker tool of Who's on first is also useful, though it is not user friendly, as it does not have geocoder.
References
- ↑ Tony Hirst (May 19, 2014). “Putting Points on Maps Using GeoJSON Created by Open Refine” .
Bibliography
- Introducing Karmashapes by justinelliotmeyers, stepps00 and nvkelso Blogpost on Who's on first- Jun 19, 2023
- CC0 compliant India specific datasets and viewable links by ramSeraph (Dec 9, 2023)
- Discussion on license for import from GOI sources (Jan 8, 2024)
- OSM India telegram group discussion 1 (Jan 9 2024)
- OSM India telegram group discussion 2 (Bharatmaps data ODbL compatibility) (Feb 19, 2024)
- OSM India telegram group discussion 3 (Validity of upload of Andhra Pradesh places from Bhuvan in 2018, inview of NSDAP compliant Bharatmaps in 2021] (April 4 2024)
See Also
- OSM Diary entry titled "Improving geodata accuracy on OSM and Wikidata" (Date: 2024-04-10)
- Villages_in_Andhra_Pradesh#Initiatives
Appendix
openrefine-geojson template for use with wikidata query (Revenue village location in an admin area as per wikidata to generate geojson using openrefine). You can edit the query for the desired admin area and run)
prefix:
{"features": [
row template:
{"geometry": { "coordinates": [ {{cells["long"].value}}, {{cells["lat"].value}} ], "type": "Point"}, "id": {{jsonize(cells["item"].value)}}, "properties": { "name": {{jsonize(cells["itemLabel"].value)}}, "wikidata": {{jsonize(cells["item"].value)}}, "osmid": {{jsonize(cells["osmid"].value)}} }, "type": "Feature" }
row separator:
,
suffix:
], "type": "FeatureCollection"}