Streetmangler
Streetmangler is a project which was created in Russian OSM community for street name canonization and error fixing.
- OSM is a crowdsourced project
- there's usually a number of ways a street name can be written in
- take these two and what you get a map with each street naming style possible
- this is bad
- map is less readable, POLA is violated when the naming style changes from town to town
- map data is less suitable for machine processing
- for countries which use Karsruhe schema (which use street names to match buildings to associated streets) this is extremely bad, as differently written names on streets and in addresses means broken address search
- prone to secondary errors
As an example, here's (incomplete) list of ways "Lenin street" may be written in Russian:
- Улица Ленина
- ул. Ленина
- ул Ленина
- ул.Ленина
- Ленина,ул.
- Ленина улица
- Ленина, Улица
- Ленина Ул
- ЛЕНИНА УЛИЦА
- ...
as you can see, the number of combinations is much higher, and even more if we take into account spelling errors and different number of spaces (pretty common problem as well), and most of them were really encountered in the map data.
Solution
Obviously, some kind of standard is needed for street names, and data should be converted to match it. In Russian community, "улица Ленина" form was chosen.
- Natural form
- Corresponds to toponymists suggestions
- No abbreviations -> less ambiguities (a word may be abbreviated in different ways, but there's only one full form)
The first approach for data normalization I've tried was heuristic one. It helped to fix most errors, but it had it's limitations.
- Too complex code, as there were too many rules and exceptions
- Too many false positives
- Too hard to maintain, output needs manual checking and too much handwork
Thus another approach was chosen, one based on a dictionary. Although it requires extra work of filling the dictionary, this work should only be done once (so normalization based on that dictionary may be considered safe and may be done in unattended way), it can be reused for other purposes, code is more simple (basically, only matching against dictionary is needed), processing is faster and this approach may be extended to other languages.
So far this approach had proven itself to be worthy: after normalization, number of mismatches in Karlsruhe schema addressing used in Russia was reduced from >120000 to ~23000 (and these left are not really naming errors but buildings addressed by non-streets).
Principle
...is simple. Street name usually consists of "name part" and "status part". The first is usually constant, while the second may we written in many different ways. For example, in "Cromwell Road", "Road" is status part and may be written as "Rd.", "rd" and "road". There may be more variants (see example above). When status part is extracted, status/name pair can be matched with the dictionary of street names. There may be different outcomes of this comparison, the first match wins:
- Exact match. Street name matches with the dictionary letter-to-letter, so we take it as it's named correctly.
- Name parts match, but status part is written differently ("Cromwell Rd." vs. "Cromwell Road"). We consider that it's the same name and we may safely suggest to change in to canonical form from the dictionary. If the dictionary is of a high quality, such changes may be performed without additional checks (though it won't hurt).
- Same as above, but with fuzzy matching. This catches spelling errors ("Cromwel Road" vs. "Cromwell Road") and suggests correct form. Such changes are less safe and should be checked by hand. There also may be multiple suggestions. Spelling error here mean one letter removed, added or changed, and it's possible to specify "depth" of fuzzy matching.
- If none of above matched, and the name HAS status part, it means that the names is missing from the dictionary. The name should be then handchecked and added into dictionary if it's correct.
- If none of above matched, and the name has NO status part, but it's name part matches to some name part from the dictionary, we take it as a status part is missing (common in Russia). For example, addr:street="Redcliffe" may really mean "Redcliffe Gardens", "Redcliffe Mews", "Redcliffe Road", "Redcliffe Square", "Redcliffe Street" or even something else. It's safer to not offer suggestions here, such errors should be fixed on the map by hand, and resurvey may be needed.
- Else, we consider it as a non-name. It may actually be unique name without status part, or it may be a garbage sometimes encountered in the map data (like path named "path").
Implementation
Streetmangler, thus, is a library which implements the logic described above. An utility to process OSM data is bundled, and here's an example of how to use it:
- get and compile streetmangler (see below). Your language may not be supported yet, so you may need to write a local and create an empty dictionary (see below).
- get an OSM XML dump of your area of interest: save from JOSM, or export through openstreetmap.ru, or take a whole-country dump (processing is pretty fast, it takes 8 minutes or Core i5 650 to process 15GB dump of Russia)
- chdir into some clean directory and run
./process_names -s -d -c -l <your locale> dump.osm
the utility will
- parse OSM data
- extract street names from name tags of highway=* objects and addr:street (and addrN:streetN variants) tags of any objects
- match extracted names through the logic described above
- produce dumps with lists of street names falling into each classification category:
- dump.canonical_form.txt — canonical form suggestions which may be used to fix OSM data
- dump.spelling_fixed.txt — spelling fix suggestions, which need to be handchecked and either added to the dictionary (if they're false positives) or fixed in OSM
- dump.no_match.txt — names not found in the dictionary; candidates for adding, but better use next one
- dump.no_match.full.txt — same as above, but with expanded status part. Streetmangler expects database to contain unabbreviated names
- dump.counts.* — analogues of above, but with counters for each name. May tell you how many times a name was used, and e.g. focus on processing widely used names
- and some others
- use produced data to fix OSM and to make dictionary more complete
- repeat
Building streetmangler
First, get the sourcecode:
git clone git://github.com/AMDmi3/streetmangler.git
Install required dependencies: cmake, libicu, libexpat2.
Build (you probably don't need Perl/Python bindings now, so let's disable them):
cmake -DWITH_PERL=NO -DWITH_PYTHON=NO . && make
Adding support for new language
Streetmangler is only widely used for Russia, so it only contains rules and dictionaries for Russian names. You will probably need to add support for your language, and it's really easy.
Locale
Locale is a set of rules for specific language. For now, it only stored data on a set of status parts.
cd into lib/locales and create a file named by your language. You may just copy en.cc. Inside, you'll see an array of status part info:
/* 1 2 3 4 5 */ { "Street", NULL, "St.", { "street", "st", NULL }, 0 }, { "Square", NULL, "Sq.", { "square", "sq", NULL }, 0 }, { "Way", NULL, NULL, { "way", NULL }, 0 },
Fill it with all status parts for your language. Columns mean:
- Full name of the status part. This is used when streetmangler converts name to full form (for example, when preparing names for addition to the dictionary, as the dictionary uses full forms)
- Canonical name of the status part. This is used when streetmangler converts name to canonical form (canonical is a form which is used in OSM). Use NULL if it's the same as full form.
- Short name of the status part. This is used when streetmangler converts some name to short form. Use NULL if it's the same as canonical form.
- All possible variations of the status part (including full/canonical/short forms). Used for status part extracting. Don't use dots here. Don't forget terminating NULL. Must be lowercase.
- Flags. Used to lessen word ordering restrictions for some cases, see ru.cc
This is an example for London (maybe whole GB, not sure). Status parts are not abbreviated, so the canonical form is the same as full, thus NULL in the second column. But for some status parts, there's short form. Column 4 just lists all possible forms so the differently abbreviated status parts may be detected. See **ru.cc** as more complex example with more possible abbreviations.
Next, you should add this helper object for Streetmangler to know your locale by name:
Locale::Registrar registrars[] = { Locale::Registrar("en_GB", status_parts), };
change en_GB to locale code for your language.
Next, add you locale to the build by listing it in LOCALE_SRCS in CMakeLists.txt file.
Dictionary
Finally, you need an empty dictionary. Create empty file data/en_GB.txt (again, change en_GB to your locale). You may prefill it with names from some trusted source with compatible license.
Testing
You're done! After recompiling, use you locale in -l argument if process_names utility and run it on some OSM data. You'll get big dump.no_match.full.txt file which you may use to fill you dictionary (through checking each name, of course!), and dump.non_name.txt may contain names with some status parts you've forgot to add to locale. As the dictionary is filled, other dumps will grow with found errors.