Talk:Hamilton County Building Import

Building use

The buildings dataset's buildingusecat field has C and R values that pretty obviously mean "commercial" and "residential", respectively, but what about M and V? "Manufacturing" and "various"? – Minh Nguyễn (talk, contribs) 22:43, 24 October 2018 (UTC)

After looking at this again, I'm wondering if we can actually use this information. I'm seeing R buildings in Spring Grove Cemetery and C buildings where people actually live like retirement homes. My best guess is that M means something like 'multiple', but that seems a bit iffy too. Multiple units? Multiple uses? I'm seeing this one in business districts and apartment buildings. As for V, I see that on some churches and libraries, but also on university and public buildings, but I'm not sure this is consistent either. Nate Wessel (talk) 17:53, 25 October 2018 (UTC)

cwwuse=MLTFM looks like it's for multiple-family buildings, such as duplexes. building=semidetached_house exists for mapping each individual unit in a duplex. building=multifamily is undocumented and used less than a hundred times. If there isn't anything else that jumps out as a suitable value, I guess building=residential is OK. – Minh Nguyễn (talk, contribs) 03:05, 6 November 2018 (UTC)

Are we sure that cwwuse=GENBUS should consistently be building=commercial, versus building=retail? I guess that's a tough call to make from the data we have; maybe a good opportunity for a MapRoulette challenge after the import. If the distinction between INDUST and MNFTRG is about light versus heavy manufacturing, we might be able to add man_made=works or something to heavy manufacturing buildings. – Minh Nguyễn (talk, contribs) 03:11, 6 November 2018 (UTC)

Multiple-address buildings

I'm guessing the buildings and parcels datasets don't provide detail down to individual units within an apartment building or office building. But how do they represent buildings with multiple addresses, particularly duplexes and condominiums? – Minh Nguyễn (talk, contribs) 22:45, 24 October 2018 (UTC)

It looks like there are actually repeated entries in the parcel dataset for buildings like this. The parcel geometry appears to be identical but the attributes associated with each record are different and list a few different adjacent addresses. I can't say it does that consistently, but where I see it happening, it looks like a set of rowhouses or connected apartments. Nate Wessel (talk) 03:51, 30 October 2018 (UTC)

Once I conflate multiple parcel addresses with a single building, how should I structure the tag? It seems there isn't consensus on this. For example, if I have the following: 11813;11815;11817;11819;11821;11823;11825;11827;11829;11831 should I make the tag addr:housenumber=11813-11831? I suppose we could always come back and fix these later. Nate Wessel (talk) 14:52, 31 October 2018 (UTC)

In that example, the semicolon-delimited list is unwieldy but avoids a loss of detail that would result from putting in just a single range. So I'd keep the list for now. Another approach is to keep the addresses as free-floating address nodes and distribute them evenly throughout the building. That's more or less how the municipal data was already structured for the New Orleans building import, but I can see how that would be challenging given this parcel dataset. – Minh Nguyễn (talk, contribs) 02:35, 1 November 2018 (UTC)

There appear to be 4695 buildings with potentially multiple addresses. Sometimes it's one building per parcel, sometimes they are totally separate. I think we'll need to give this a bit more thought... I don't want random address points floating around, and I don't want to assign the same list of addresses to multiple buildings. Nate Wessel (talk) 15:01, 1 November 2018 (UTC)

Vintage

The Building Footprints dataset says the footprints are primarily based on aerial imagery from spring 2011, with a promise of updates from spring 2017 being available in April 2018. Have the spring 2017 updates been added to this dataset or somewhere else? Or can we count on the dataset's timestamp of August 20, 2018 (same as the parcels dataset), to be accurate? – Minh Nguyễn (talk, contribs) 22:52, 24 October 2018 (UTC)

I haven't kept up on developments since I moved, but I do see that the new buildings south of the UC's main campus aren't included in this dataset. That development was finished about four years ago I think. The new shops and residences between Calhoun and McMillan. Nate Wessel (talk)

Also, there is no Casino and none of the new buildings in Washington Park. I think that makes it at least five years out of date if my timeline is right. Nate Wessel (talk) 18:07, 25 October 2018 (UTC)

Conflation

The proposal currently states that the import will leave any existing building alone and avoid importing any building that intersects. While that simplifies the import considerably, I think we should try harder to use CAGIS building footprints in cases of conflicts, for the following reasons:

A great many of the buildings in Cincinnati and Loveland were very crudely drawn by me in Potlatch against Yahoo! imagery at z17. I don't feel any attachment to the shapes of the buildings I drew, just the tags.
Ensuring the use of CAGIS building geometries can simplify address assignment. Especially in Over-the-Rhine, I probably made lots of errors identifying outer walls, such that some buildings are inappropriately combined while others are inappropriately split.
Even where buildings are correctly drawn in OSM, there are likely many that could use the addition of heights, floor counts, and names from CAGIS data.
If we don't deal with these conflicting buildings now, it becomes more difficult to bring in these attributes later in a separate import.

Here are two possible approaches to handling conflation:

Include conflicting buildings in the initial import, relying on contributors to manually conflate one building at a time within each task. Rely on JOSM validation to detect conflicting buildings. Contributors will need to split up tasks where OSM already has dense building coverage.
Exclude conflicting buildings from the initial import round, but set up a separate Tasking Manager project for manual conflation of those buildings afterwards. By and large, this becomes an exercise in copying tags from OSM buildings to CAGIS buildings and deleting the OSM buildings.

– Minh Nguyễn (talk, contribs) 01:27, 25 October 2018 (UTC)

I ran some numbers and Minh, you are the last editor on 40,903 buildings that intersect with buildings in the import dataset... quite the accomplishment! That's 65% of all conflated buildings. Most of those (32k) are from 2011 or earlier. I'm not sure I love the idea of updating geometries across the board though - I'd rather keep things that were done by OSM users unless we can really clearly say that their geometries are inferior. Perhaps we could make a list of buildings to check by searching for a lack of right angles and/or exceptionally poor overlap between the two datasets. Nate Wessel (talk) 16:13, 26 October 2018 (UTC)

I'd also like to find a way to keep the way history on any OSM buildings while updating only the tags and nodelist. That way we can maintain continuity in the dataset. I really don't want to erase the history of any contributions by OSM users, including the famous Minh Nguyễn. If you were the first to create a particular building, I want that to show in the version history. Nate Wessel (talk) 16:13, 26 October 2018 (UTC)

I appreciate your interest in keeping my editing legacy intact, as messy as that legacy may be. ;^) It looks like this JOSM plugin lets you replace the geometry of a way with that of another way while preserving (as much as possible) the history of the original way and its nodes. That said, it's totally reasonable to save conflicting buildings for a second pass where we can conflate more slowly and manually. Given that excluding conflicting buildings would omit well over a third of the CAGIS dataset, let's document that second pass in this proposal, so that we don't have to go through the whole import process all over again. – Minh Nguyễn (talk, contribs) 23:21, 28 October 2018 (UTC)

Agreed. It makes sense to cover it here since it's the same datasource and all that, and I do think it will be easier to do one thing at a time. Can we set up two things in the tasking manager? I could try to partition it so that the conflation part is done in smaller chunks since I would expect that to go a lot slower than a plain import. Nate Wessel (talk) 03:56, 30 October 2018 (UTC)

Sure, we can have two separate projects in the tasking manager, one that we create now for straightforward additions and another for cases that need manual attention. The latter project can be created now but left private until we're ready to start working on it. – Minh Nguyễn (talk, contribs) 10:14, 31 October 2018 (UTC)

Should we also consider a building to conflict if it intersects with a building that was recently deleted in OSM? For example, Erkenbrecker Avenue near Children's Hospital was recently rerouted over top of several houses; if the CAGIS data is old enough, it'll restore buildings that have already been demolished and deleted. – Minh Nguyễn (talk, contribs) 01:33, 25 October 2018 (UTC)

The parcel dataset actually has a field for the market value of the improvement (i.e. building). Searching for 0 values makes a half decent map of vacant lots, though there are places where that doesn't seem to hold, like the UC main campus where it expanded into Corryville. That might actually be helpful in deleting more demolished OSM buildings. I imagine we'll mostly be in the clear though if we try to follow imagery that's recent to the last year or so when importing buildings. If you know how to get geometries of recently deleted buildings, that would be great too! Nate Wessel (talk) 04:06, 30 October 2018 (UTC)

This Overpass query on augmented diffs returns buildings changed in Cincinnati since the beginning of the year. Deletions are contained in <action type="delete"> elements. Unfortunately, expanding the query to the rest of Hamilton County exceeds the API's memory limit, so you'd have to query other parts of the county independently and combine the results. Even for cases where a building has been deleted from OSM, I think we'll still need to include the intersection CAGIS building in the second pass. The OSM deletion could've merely been someone replacing a way with another way representing the same physical building, so height and other attributes might still be applicable. – Minh Nguyễn (talk, contribs) 10:14, 31 October 2018 (UTC)

Dividing tasks

The tasking manager lets us either divide the project into tasks of equal area or import arbitrarily shaped tasks. The New Orleans building import was divided into precincts, which made things go really smoothly (since each precinct is about equal in population). The San José sidewalk import was divided into TAZs (similar to census blocks) for the same effect. Is there a granular enough set of boundaries covering the whole county that we could use for this purpose? – Minh Nguyễn (talk, contribs) 19:44, 25 October 2018 (UTC)

I like those recursive grids I've seen used in the tasking manager before. That would probably be the best way to divide up the work about evenly, and it would have the benefit of fitting neatly in a square editing window. Nate Wessel (talk) 20:05, 25 October 2018 (UTC)

By default, the tasking manager divides the project into large squares the size of townships. As the project creator, I can make the squares more granular, but they all have to be the same size. Afterwards, a participant can take an individual task and subdivide it into smaller squares. We'd want participants to avoid doing this after the fact, because the tasks won't line up with the data being imported. But we could quickly split up urban tasks before finalizing the data to import and opening the project up to contributors. – Minh Nguyễn (talk, contribs) 23:26, 28 October 2018 (UTC)

Conflated building review

I spotted the following issues skimming over this file. – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

Thanks for being so thorough! I'll try and respond to most of these. – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

I rethreaded the conversation, because my numbering system fell apart quickly. :^) – Minh Nguyễn ^💬 04:47, 13 November 2018 (UTC)

Numbered streets are spelled out in CAGIS, while in OSM we're putting digits in name=* and sometimes spelled-out names in alt_name=*. Maybe it isn't a problem; not sure. – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

This is probably mostly a conflated buildings issue as downtown buildings are pretty much mapped. – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

"St" at the beginning of the street name should become "Saint". – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

There are at least 1,000 cases of this. Good catch! – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

The full list of abbreviations is in this specification, under "Rules for abbreviations and spellings of street names". To be super accurate, we'd want to somehow join on the MAPLABEL column of the centerline dataset. But since the addr:street=* values would at best match nearby OSM streets that we've cleaned up, joining probably isn't worth the additional effort. – Minh Nguyễn ^💬 05:19, 13 November 2018 (UTC)

I just checked for the rest of the abbreviations and apparently we had a lot of 'Mount's as well! These should all be corrected now. Nate Wessel (talk) 01:15, 13 December 2018 (UTC)

"McGregor" is spelled "Mcgregor". (By comparison, in TIGER data, the street name is all caps but there's a space after "Mc", making this case easy to detect.) – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

I take it that "McGregor" is correct? – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

If it was "Mc Gregor" or "MC GREGOR" in the dataset, then it should be "McGregor". Probably the same for "Mcgregor", but "Macgregor" would be less certain. – Minh Nguyễn ^💬 05:19, 13 November 2018 (UTC)

Fun facts. It seems like capitalizing the third or fourth letter isn't always correct, but looking through some of the 'mc' streetnames I'm seeing what looks to me like a lot of proper names so I'll go ahead and capitalize the third letter for 'Mc*' names. As for 'Mac*', most of these are fine as is, but I'll capitalize Mac(N)icholas and Mac(A)rthur. Nate Wessel (talk) 01:27, 13 December 2018 (UTC)

Especially downtown, a building may occupy an entire block, resulting in multiple values in addr:housenumber=* and often addr:street=*, for example addr:housenumber=1212;123 addr:street=Clay Street;Thirteenth Street. The worst examples I found were the Convention Center, Children's Hospital, and several buildings on the UC campus, including Marge Schott Stadium, the combined Rec Center / Armory Fieldhouse / Shoemaker Center building, and the Campus Green and University Avenue garages. I think we should always avoid multiple values in addr:street=* at all costs, even if that means leaving bare address nodes in the centroid of the parcel that we have to manually move later. With the current tagging, a geocoder won't be able to associate one of the house numbers with one of the street names. – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

I would suggest removing address information from these buildings entirely for the purposes of this import. There appear to be about 1,200 buildings with multiple street addresses. Or could this be flagged during validation and cleaned manually? – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

There's little a human reviewer could do about a way with eight values in addr:housenumber=* and two in addr:street=*. We're already able to associate each of these addresses with a building, so if it doesn't seem feasible to distribute the addresses evenly throughout the building, I think it'd be fine to plop them on the building's centroid for now and spread them out a bit during validation. – Minh Nguyễn ^💬 05:19, 13 November 2018 (UTC)

I think what we're seeing with these though is that the address conflation has failed to properly map an address from a parcel to a building. This may be something to look at systematically in a later step, but I think for this import these just need to be dropped until we either find a better address dataset or a better way of mapping buildings to parcels. Better not to have an address than a potentially wrong address! Nate Wessel (talk) 00:57, 13 December 2018 (UTC)

I noticed some buildings that have addr:housenumber=* but no addr:street=*, such as one building at the corner of Clay and 13th streets. – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

These are assigned from parcels at the same time, so this probably indicates missing data in the parcel dataset. – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

32895664 32895664 ended up with addr:housenumber=11294-. – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

I'm seeing a couple hundred buildings with addresses like "1220-1". This is the more typical pattern for housenumbers with hyphens, and the one you saw seems to be missing a number at the end. I'm not sure how these should be interpretted though, as it's unlikely a building has both "1220" and "1221". Many of these have even-odd combinations like that. – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

From this query, I think we can be confident that this column doesn't contain any ranges and that the hyphens are part of the address. I suppose we could put the second part in addr:unit=*, but leaving it in addr:housenumber=* is the safer option. – Minh Nguyễn ^💬 05:19, 13 November 2018 (UTC)

On at least one street in Loveland, there were conflicts between house numbers in CAGIS and those I guessed based on TIGER address interpolation ranges. I'm glad we're going with CAGIS in these cases. :^) – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

Hopefully it's more accurate! ;-) – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

For the addresses that didn't match any buildings, can we plop a standalone address node at the centroid of the parcel? We could set up a MapRoulette challenge to match them up to buildings or landuse areas afterwards, if that seems like it'll yield good results, but even bare address nodes are valid in OSM. – Minh Nguyễn (talk, contribs) 18:51, 9 November 2018 (UTC)

Based on what I've seen so far, my gut says that parcel centroids would end up in some pretty wild places. Wasn't there an address point dataset from CAGIS? If so, it would probably be best to import such points later and conflate them with yet unaddressed buildings. – Nate Wessel (talk) 19:59, 9 November 2018 (UTC)

There is the master address file that's part of the CAGIS quarterly release, but I haven't seen any indication that it's suitably licensed for OSM. – Minh Nguyễn ^💬 05:19, 13 November 2018 (UTC)

Some street addresses contain fractions or errant slashes. [1] – Minh Nguyễn ^💬 05:19, 13 November 2018 (UTC)

Thankfully I only see four of these in the whole dataset. Nate Wessel (talk) 00:57, 13 December 2018 (UTC)

Driveways

669 buildings, including a bunch of houses in Mt. Lookout and Cheviot, have driveways connecting to them. The conflated building dataset omits the nodes that connect the driveways to the buildings, but I think we can address the issue manually after the import. Overall, 1,543 buildings have highways of some sort connecting to them (which may be footways). – Minh Nguyễn ^💬 17:50, 13 December 2018 (UTC)

I think manual conflation is the only/best way to address this. The connecting nodes would have to be added to the import data somehow as they're not there already. And if the alignment isn't perfect (which it won't be) the highway will need to be edited as well. These actually look like pretty solid buildings for the most part - perhaps we can just copy any new tags over for most of these. Nate Wessel (talk) 18:02, 13 December 2018 (UTC)

That's fine. But we should make sure to run the query above right before starting the second phase of the import. We could manually add entrance=* nodes at the same time we reconnect the buildings. – Minh Nguyễn ^💬 07:24, 24 December 2018 (UTC)

Congratulations, well done so far

Simple congratulations from me for a well-researched, well-documented and well-communicated Import Plan. May it be well-implemented as you unroll your plans — good luck! Happy mapping, Stevea (talk) 20:18, 13 December 2018 (UTC)