CNEFE data, IBGE, Brasil import
The aim of this project is to import/integrate data on Brazilian road names and addresses from the Cadastro Nacional de Endereços para Fins Estatísticos (CNEFE, National Registry of Addresses for Statistical Means), which is produced by The Brazilian Institute of Geography and Statistics (IBGE). CNEFE is a list of all 81.5 million addresses surveyed in the 2010 Census. This is important because in Brasil most traced roads remain unnamed ([]) and there is almost no address data.
With CNEFE, OSM data in Brasil could greatly improve, leveraging Brazilian mappers efforts and making OSM more useful to end users in the country.
CNEFE includes information on: municipality, enumeration district code, city block and block edge. However the data is not spatially bound. The only spatial information given is the shapefiles of enumeration district perimeters (which are about 5 to 10 blocks in size) and coordinates for a small subset of addresses than contain rural households and non-residential addresses (schools, hospitals, business).
In summary there are two projects:
- Use CNEFE street names to name unnamed "roads" in OSM-Brasil.
- In places with streets already named, add CNEFE addresses to OSM
We are seeking help from people in the OSM international community, with experience with SQL/POSTGIS, address registry imports, to help to import/integrate CNEFE into OSM. CNEFE data is not perfect and will require some (hopefully automated) curation. And because of the lack of road polygons, matching CNEFE data with OSM will challenge your codding skills, in order to milk the most amount of useful data out of CNEFE.
Bellow I describe in more detail the available datasets and possible integration strategies.
Data sources
All the data sources bellow were produced by the Brasilian statistical agency, Instituto Brasileiro de Geografia e Estatística (IBGE) and are subproducts of the 2010 Population Census.
CNEFE
Cadastro Nacional de Endereços para Fins Estatísticos, - National Registry of Addresses for Statistical Purposes is a database of all addresses that were visited during the 2010 CENSUS.
Source: main page with documentation here. Data is available here from this FTP server, with files for each municipality distric (or subdistrict when applicable) which are grouped in folders for each state. Each file contains all addresses in the subdistrict. Files are provided in .txt in a fixed width file format, with the layout file indicating column positions for the import.
Details: IBGE release's a public version of the registry which consists of a list 81,4 million addresses. Addressees were recorded by Census enumerators during the field work of the 2010 Demographic Census. on the field, so there can by some difference in street names spelling, although somewhat rare.
The address is divided into several fields (see image) including: road type (road, avenue, highway. etc), road name title (president, governor), road name itself, address number in street, postal code and identification number within building. Besides the address information, CNEFE also contains information on the enumeration district code, and block and block face of each address (described bellow), which allow for somewhat locating an approximate area of the address. See image bellow
The data set includes not only residential households, but also public and private entities (firms, hospitals, schools) For this non-residential addresses and also for the rural residential addresses the coordinates of the address are included, which were recorded by the GPS enabled PDA used Census enumerators.
Warnings and problems with the dataset: a) Addresses were recorded by enumerators to facilitate Census operation. IBGE does not guarantee the accuracy of the dataset. Streets naming is a municipal attribution, not under the jurisdiction of IBGE, the federal stats agency. Furthermore, IBGE mentions that it does not curate the addresses recorded by enumerators. There are cases of differences in spelling of street names In order to conflate the dataset into a street names dataset these differences in spellings must be identified (by some sort of fuzzy/Levenshtein string matching).
Enumeration Districts Shapefile
"Setores Censitários" (enumeration districts (ED)) are a partition of the national territory into small areas that are to be surveyed by just one Census enumerator. They must be contiguous and non-overlapping and contain from 250 to 350 households in urban areas and ~150 households in rural areas. Therefore, the sizes of enumeration districts can vary considerably, depending on the population density of the area. Enumeration district limits tend to flow roads, highways rivers and etc. In urban areas enumeration districts normally comprise 5 to 8 city blocks. For the 2010 Census Brasil was divided into 316,574 EDs.
Source: FTP server here. One shapefile per state (27 states), grouped into folders for each state. Within each shapefile polygons represent each enumeration district.
Details: Enumeration districts are identified by a 15 digit code with digits corresponding to : 7 digits for the municipal, 2 digits for district code, 2 digits for subdistrict code, and the remaining 4 digits to identify Enumeration Districts within subdistrict.
Warnings and problems with the dataset: Although enumeration districts are contiguous and non-overlapping, the shapefiles that represent them have some problems:
- most common is that they can be dislocated, drifted, in one direction, when compared to a satellite image. This distortion can vary from municipality to municipality, and may have arrived due to the quality of the underlying maps and aerial image used to create the shapefiles. It would be nice to correct they placement and give a buffer for drift when spatially matching with OSM data.
- Although enumeration districts are contiguous and non-overlapping, the shapefiles that represent them have some topological problems:
- First there is a small fraction of enumeration districts that are contained in another ED. This is because the larger ED incorrectly includes the area of the smaller ED.
- EDs borders are compatible within states, polygons of contiguous EDs around the state borders can overlap or have gaps in between.
Ideally these problems should be corrected in order to recast the EDs as a consistent topology, covering the country in a contiguous and non-overlapping manner.
PDF images images of Enumeration districts with Blocks and road names
Internally IBGE has a dataset of roads and blocks. However, because these were acquired from private vendors, IBGE can not publish the digital files associated with the roads and blocks. However IBGE does provide a PDF image of the roads and blocks in each enumeration district (available here XXXX).
Source: the PDF files are available here.
Mapper tmpsantos] has created IBGETools which extracts the images from within the PDFs and stitches them together to create a background layer for mappers using JSOM or iD editors. The layers for urban and rural areas can be seen here. As you can see in the image bellow, mappers can transcribe road names from this background image. Stitching of images is not perfect, as you can sees discontinuities in roads and blank spaces. This quickly became a fundamental tool for mappers im Brasil, especially in smaller towns.
Other Datasets
- Correios, the national post office, provides a list all street names. Although their georeferenced dataset is private, and sold, this list could be used to validate if street names in CNEFE are named correctly. (to be completed... )
Projects
In summary there are two projects: (need to describe here with more detail...)
Import Road Names from CNEFE
Use CNEFE street names to name unnamed "roads" in OSM-Brasil. IBGE provides a PDF image with blocks and road names. There already exist a JSOM layer that, by stitching the PDF files, provides a reference for users to manually transcribe street names. By that could be error prone and slow. By spatially matching roads with enumeration district polygons we would create a suggestion list of existing road names for that area. In conjunction with the PDF layer and the street name suggestion list, some sort of "MapRoulette" tasks could be for users around the world to fill in the road names.
Import Addresses from CNEFE
For roads already named, by matching roads to CNEFE data, it would be possible to insert street address. In order to match more ...
Code
Code will be available at this github page: https://github.com/lucasmation/osm_cnefe_import
Legality
IBGE has not provided a clear license statement, but informal communications have been recorded showing that IBGE's data is indeed considered "public domain" [1] [2] [3].