Fortaleza/PMF Addresses Import/Data Conversion

From OpenStreetMap Wiki
Jump to navigation Jump to search
Proposal Data Conversion Work Plan Import Guide Progress

This step serves to describe how the acquisition and conversion of the original data was performed, converting them into OSM tag pattern. This step is only made once.

Data download

Raw data can be downloaded from the aforementioned SDI, through the following procedure:

  • Click on Downloads;
  • Cartografia Base (Base Cartography);
  • Quadras Fiscais (Tax Parcels) and;
  • Select CSV or SHP file, and then download it.

Warning: it could take a while to start the download, it is normal.

Conversion of address numbers

The conversion described here is from the CSV file due to the greater familiarity with this type of file. However, the same procedure can be done with QGIS using SHP, or any programming language.

  • Open Microsoft Excel (or any other spreadsheet editor) without opening the file
  • Go to the Data tab and click on Get Data - From Text/CSV
  • Select the 65001: UTF-8 encoding in the File Source tab so that all accents marks are read correctly
  • Load the data
  • Delete all columns, keeping only the following columns: endereco (address), bairro (neighborhood) and the_geom
  • Save the file in CSV format by clicking on File - Save As - CSV (Comma delimited). Warning: do not choose the CSV UTF-8 option as the accent mark problem will persist.
  • Close Excel and open the saved file directly (without performing the import procedure) by double clicking on the file.

General data cleaning

Initially we need to clean and fix the data first. For this, we will carry out the following procedure:

  • Add a new column (B) next to the endereco column and use the following formula: =TRIM(A2). Do this for all lines.
  • Copy all the lines of the created column and paste on top of the column endereco. A REF error will appear, but click on the paste options (press Ctrl) and select Values.
  • Clear the previously created column and now use the following formula, copying it to all rows in this column: =LEN(A2)-LEN(SUBSTITUTE(A2,",",""))>1' '
    • This procedure is for checking cells that contain more than one comma, as this will interfere with the separation of the street name and the house number.
  • Add another column next to the one created previously (C) and create a filter for each column, selecting the headers (first row) and going to the Home Page tab - Sort and filter - Filter
  • Filter only the TRUE results from the first column created (B) by clicking the arrow in the header of this column and selecting TRUE only.
  • With the data filtered, add the following formula in the first cell of the empty column on the side, then copy it to all the filtered cells below: =SUBSTITUTE(A2071,",","",1) (change A2071 by the first filtered cell).
  • Remove the filter created previously and create a new column next to the one created previously.
  • In this new column (D) use the following formula for all rows: =IF(A2="","",IF(B2=TRUE,C2,A2))
  • Copy all data from column D into column A, with the same procedure as done before (paste options - values).
  • Delete the created columns (B, C and D).


After having the data already cleaned, there is a need to adjust it to the OSM scheme. We start with the column Endereco. Initially, let's separate the street name and house number as follows:

  • Add a column to the right side of the Endereco column.
  • Select the entire column Endereco and go to the Data tab.
  • Click on Text to Columns - Delimited - Comma.
  • Create a new column again between the first and the created column and add the following formula in the first empty row, copying it to the ones below: =PROPER(A2)
  • Copy the adjusted values to the first column, delete the created column, and do the same procedure with the Bairros column.

Identification of incorrect values

We now have to identify house numbers that are not in a correct format, such as LOT or LIVRE (empty). For this, we have to carry out the following operation:

  • Convert column B (house number) to Text
  • Perform, one by one, in this order, the replacement of these terms to empty. For this, select the column B, go to Replace (Ctrl+H) and in the first option enter the terms below, and in the second option leave blank.
  • - (dash) (space) LOTE LOT LOO LO LT . (dot) _ (underline) ) (right parenthesis) ( (left parenthesis) L000 L00 L0 = (equal sign) LJ AREAL BL CL ESCOL ETEL ETL GLEBA GL INSTITUCIONAL LARGO LA LBQD LBD LB LC LD/C LD LE LF LG/H LG LH LIVRE LIII LI LK LL LM LN LP LQAT LQ LR LSN LS LU LV MERCA MON AREAVERDE AREA AZIO VAGO NESGA OOO OSN OVII P/00 P/0 CASA CC CEMIT CENTR CS D00 D0 ESTAC ESTAD ETE FUNDO G0 G2P G4M GOA IB IGREJA IGREJ II INST J027 JARDI P/L P/Q0 P/Q P/ PARQU PARTE PATIO PA PG10S PGB9S PL\ PL P9E11 P0 PORCA PQ PRACA PT0 PT PRAC Q/0 Q]0 Q000 Q00 Q0 R] REMANESCENTE REMAN RESER RQ SAO SITIO /0 TERRENOC TERRENOB TERRE TERF TERE TERB TERA TER TE TD TF TG VAZIP VAZI VAZ VAO VAIZO VAGA VAG VA VC VERDE VIZIN VI VZAIO VZ TV0 TVB TV ] {{ | + « 9EP10 9Q A0 A560 AP102 B12 B3 B7 A1 A2 /17 C0 D170 D4/5 D760 D881 D913 E01 E0 N04 N32 NS6


Some numbers were automatically converted to date due to the way Excel handles the data. We have to consult the original database to correct these data. For this, we do the following:

  • In column B filter date values only
  • Format selected cells as TEXT
  • Replace the values of the following cells, respectively:
    • A83074 - 11
    • A101470 - 6
    • A107869 - 1
    • A131008 - 6
    • A202050 - 14
    • A242250 - 1
    • A246407 - 7
    • A271889 - 7
    • A289598 - 12
    • A293310 - 4
    • A296494 - 2
  • After conversion Excel will return an error for each cell. Accept the suggestion to convert to number.

Manual identification of incorrect values

We now have to perform a manual cleaning of values that were not possible to be cleaned previously. For this, we will carry out the following procedure:

  • We initially filter, in column B, values ​​that have a slash (/).
    • For values ​​that have only one slash or slash and only one number (/4, for example), manually remove the slashes.
    • Cells that have something like 2/A, remove the slash, leaving as 2A.
    • For those cells with something like T/V or QD6, delete them as well.
    • Finally, for values ​​that look like 14/15 or L4/5, remove the letter, the slash and the number after it.
  • Now we will filter the values ​​with the letter L.
    • Initially we filter the values ​​that contain the letter L in column B, performing the simple filtering on this column.
    • Now we need to separate the values ​​that only start with the letter L. To do this, we create a new column and insert the following formula in all cells: =LEFT(B2,LEN("L"))="L"
    • In this new column, filter the FALSE values ​​and manually clear the existing values ​​(i.e. leave only the values ​​that look like 46L).
    • Now filter the TRUE values ​​in the created column and manually search for extraneous values ​​such as L4A6 or L1OA. When making the correction, accept the suggestion to convert the cell to number.
    • Finally, in column B select all values ​​and remove all L, with the Replace function.

We will perform the same type of procedure with all other remaining letters. For this, we will proceed as follows:

  • Clear the entire previously created column.
  • Now we perform the same procedure, looking for the letter P.
    • We filter only the TRUE values from column C and then search for the letter P in column B.
    • Add the following formula to the empty column: =LEFT(B734,LEN("P"))="P" and copy this formula to all the rows below.
    • In this new column, filter the FALSE values and manually clear the existing values (i.e. leave only the values that look like 4P).
    • Now filter the TRUE values in the created column and manually search for extraneous values such as P1E2 or L1OA. When making the correction, accept the suggestion to convert the cell to number.
    • Finally, in column B select all values and remove all P, with the Replace function.

Now do the same procedure as before, in this order, with the letters T and V.

With the letter Q, we will do a slightly different procedure:

  • Clear the entire previously created column.
  • Now we perform the same procedure, looking for the letter Q.
    • We filter only the TRUE values from column C and then search for the letter Q in column B.
    • Add the following formula in the empty column: =LEFT(B132,LEN("Q"))="Q" and copy this formula to all the rows below.
    • In this new column, filter the FALSE values and manually clear the existing values (ie leave only the values that look like 4P).
    • Now filter the TRUE values in the created column and delete all found values.

Finally, do a manual cleanup of the awkward values that are left, which are now few.

Identification of addresses without number

Let's now filter out addresses without numbers. For this, we have to proceed as follows:

  • Filter, in column B, addresses with number 0.
  • After filtered, delete all
  • Re-filter values that contain SN and delete all of them too
  • Re-filter the values containing Y/N and delete all of them
  • Also delete the value S\N

Separation of numbers with addr:unit

We will now identify the addresses that contain details in the numbering (such as 50A), and separate them (such as addr:housenumber=* to 50 and addr:unit=* to A) as follows:

  • Add four columns to the right side of the port numbering column (C, D, E and F) and add the following formula in the first empty line, copying to the ones below: =ISTEXT(B2).
  • Filter only the TRUE results from the first column created (C) by clicking the arrow in the header of this column and selecting TRUE only.
  • With the data filtered, add the following formula in the cell of the first empty column next to it (D), copying then to all the filtered cells below: =LEFT(B13,SUM(LEN(B13)-LEN(SUBSTITUTE(B13,{"0","1","2","3","4","5","6","7","8","9"},""))))
  • Add the following formula in column AND: =RIGHT(B13,LEN(B13)-SUM(LEN(B13)-LEN(SUBSTITUTE(B13,{"0","1","2","3","4","5","6","7","8","9"},""))))
  • Remove all active filters and in the column F add the formula =IF(C2=TRUE,D2,(IF(ISBLANK(B2),"",B2)))
  • Create a new column G and copy and paste all the lines from column F into it. When pasting, just paste the values ​​as described in previous steps.
  • Again, create a new column H and copy and paste all the lines from column E into it. When pasting, paste only the values ​​as described in previous steps.
  • Delete columns B to F, and in the remaining columns add the tags addr:street=*, addr:housenumber=*, in the first row, addr:unit=* and addr:suburb=*, respectively.
  • Finally, add a new column labeled addr:city=* and add Fortaleza to every row.

Great, now we have the numbers correctly in our file! We then proceeded to correct the names of neighborhoods and streets.

Name conversion

Standardization of neighborhood names

Initially we will standardize the names of the neighborhoods, which are not entirely correct due to minor problems with accent marks or capital letters. For this, we will do practically all the work manually.

  • In the column of neighborhoods select all values.
  • Go to Find & Select and then Replace (or the command Ctrl+H).
  • Replace the following values, in this order (value to be fetched and value to be replaced, respectively), and then press Replace All:
    • Alto Da Balança - Alto da Balança
    • Barra Do Ceará - Barra do Ceará
    • Boa Vista / Castelão - Boa Vista/Castelão
    • Cais Do Porto - Cais do Porto
    • Cidade Dos Funcionários - Cidade dos Funcionários
    • Conjunto Ceará Ii - Conjunto Ceará II
    • Jardim Das Oliveiras - Jardim das Oliveiras
    • João Xxiii - João XXIII
    • José De Alencar - José de Alencar
    • Pirambú - Pirambu
    • Praia De Iracema - Praia de Iracema
    • Praia Do Futuro I - Praia do Futuro I
    • Praia Do Futuro Ii - Praia do Futuro II
    • Sapiranga / Coité - Sapiranga/Coité
    • Vicente Pinzon - Vicente Pinzón

Standardization of street names

Finally, we must correct the street names. We suggest doing this by neighborhood (smallest to largest, as suggested in the work plan), so that the work is more organized and can be done concurrently with the import.

  • We initially put the first column (name of the streets) in alphabetical order.
  • Then we filter the desired neighborhood.
  • We then corrected the street names manually. Anyway, as the names are arranged in alphabetical order, just correct the first value and copy the name to the cells below.

Doing this in this way, separating the work by neighborhoods, the work is not so heavy and it is possible to do this in stages (fix a neighborhood, the import is carried out, then correct another neighborhood successively).

Data conversion to OSM standard

Finally, with the data already cleaned and standardized, we have to convert this CSV file to an OSM file. For this, we will use QGIS:

  • After opening QGIS, go to menu - Layer - Add Layer - Add Delimited Text Layer (Ctrl+ Shift+T).
  • Open previously created file, select semicolon as delimiter, first record has field names, and in Geometry Definition select Well known text (WKT), then select last column that contains polygon data, and select the type of geometry as polygon.
  • When adding the polygons with the addresses, we now have to convert them to points and leave it in a file ready to be read in JOSM. For this, we do it like this:
    • Menu - Vector - Geometries - Centroids
    • Select data layer and perform processing.
    • With the nodes layer ready, export the file as SHP. It can be saved to another type of file, but current experience recommends saving to SHP so JOSM can open it without problems.