Talk:HIFLD/United States Poultry Import

From OpenStreetMap Wiki
Jump to navigation Jump to search

Names

Thank you for posting this file for inspection. It would've been great if you had done this at the outset; you've might've earned some goodwill with the community for engagement. But what's done is done. I've taken a glance at your file and some things stick out to me regarding the name=* values you've assigned to each feature.

It's good that you've considered normalizing the case of each name. However, title-casing is not enough, and you haven't done proper title-casing. Title case in English is not as simple as capitalizing the first letter of every word. In fact, there's no one standard for capitalization of American English, but this dataset (like many government datasets) only stores the names in uppercase, resulting in dataloss. Technically, it isn't possible to get these names correct without detailed research on a case by case basis. However, it would be possible to produce a capitalization good enough for OSM by taking some additional steps:

  • Lowercase articles, conjunctions, and prepositions shorter than four characters. For example, 4,412 features contain "And" in the middle of their name=* values, 131 contain "Of", and 49 contain "The".
  • Uppercase acronyms and initialisms, including but not limited to "Amco", "Bmc", "Ok", and "Usa".
  • Uppercase the letter following the Irish and Scottish prefixes "Mac" and "Mc".

There are popular, well-tested libraries for performing title casing in many programming languages. I'd suggest looking into one before spending too much time on this problem.

Aside from case normalization, here are some other significant issues with name=* as you're using it:

  • Most of these features' name=* values are not the names of the facilities themselves, but rather the names of the companies that own and/or operate the facilities. These values should go in operator=*. Some may be appropriate for name=* as well, such as "Omaha Steaks International Incorporated - F Street". But in general, we need to discuss as a community whether we want these facilities to go nameless or have names corresponding to their operators.
  • 144 features' name=* values contain "Doing Business As". The operator=* should be either the official legal registered name that precedes this phrase or the trade name that follows this phrase, but not both. It's never correct to refer to the full name with d/b/a outside of a handful of government forms. The dataset does have a separate DBA1 field, but it mostly goes unused, a reflection of the quick-and-dirty nature of this dataset.
  • Names are truncated at 50 characters, resulting in 362 names that end mid-word. This should hopefully be less of a problem after addressing the "Doing Business As" issue, but some truncated values will remain, such as "Original Taco House Catering And Annex Food Compan". Perhaps you could extract these values, stick them in a spreadsheet, and invite the community to help fix these values before importing.

 – Minh Nguyễn 💬 23:48, 15 April 2022 (UTC)

Thanks for your comments. I fixed those names now. Updated file here. I personally think that all of those facilities should be named, i.e. the current name=* should be used as name=*. However, if the names end with "Incoperated" of "Limited Liability Corperation" (which i shortened to Inc./LLC) they could be used as an operator=* tag and the version without LLC/Inc in the name field. Example: name=Echo Lake Foods, operator=Echo Lake Foods, Inc.
I replaced the "Doing business as" with a "-". Either we leave it like this, or use one of the two. I don't really care if it's that or that. I also removed the 50-character limit. All names should be complete now. Hiausirg (talk) 18:16, 16 April 2022 (UTC)

Meat processing

As indicated in this proposal, the majority of this dataset (actually 3,283 features) is classified as "TEXT35=PROCESSING", meaning some form of retail, commercial, or industrial meat processing. You're proposing to tag these facilities as industrial=meat_processing, which is a tag that has never been used before in OSM. There's clearly a gap in the available tags in OSM, but I'd suggest asking the tagging mailing list or OSM Community whether industrial=meat_processing is the best tag to use or whether we should coin some other tag.

Part of the problem is that many of these meat processing facilities are not industrial by any means. Each POI was included in the dataset purely by virtue of its NAICS code being either 311615 or (in a few cases) 311999. You can see from the descriptions of those NAICS codes that they're quite expansive. Here are some counterexamples, but this is by no means an issue limited to just a few POIs in one area:

Issues like these may not be serious enough to block the import, but I think the community would feel better about it if they were tagged generically enough. This also seems like a good candidate for a MapRoulette cleanup challenge – or even to replace the bulk import with a MapRoulette challenge, as we did with the Silicon Valley POI "import".

 – Minh Nguyễn 💬 00:22, 16 April 2022 (UTC)

Okay, you're right, I won't add a industrial=meat_processing to the "PROCESSING" POIs. A few, maybe 10-15%, are indeed not industrial. I would favour to tag them with man_made=works and product=meat only. Maybe add a "fixme=check if shop=butcher applies" or sth. similar? There was a bike repair station import a while ago, and OSM notes with "verify location"/similar were created on every POI. I don't believe that a map roulette challenge is required here, as there are more important things to check than whether they process meat or only sell it. Particularly since, if I understood correctly, Maproulette is made for tasks that anyone can do quickly and easily. To know if shop=butcher fits better, you need to have some local knowledge or at least access to legally usable street view photos. Fixme fits better if you ask me, but I could be wrong. Hiausirg (talk) 18:16, 16 April 2022 (UTC)

Addresses

The dataset comes with addresses! Strongly consider adding address tags to the features, to make them easier to conflate with address and building imports. There's only a single field for the first line of the address, which is actually a much better practice than what OSM does. But since OSM tools expect a more structured format, consider parsing the ADDRESS field into addr:housenumber=* and addr:street=*. addr:city=* and addr:postcode=* are important too. These values are uppercase, so you'll need to match them to existing streets and cities in OSM. Tagging TELEPHONE as phone=+1-* would be cool, but only if we can trust that most of these phone numbers are for the front office, not the cell phone of a manager or whoever filled out the government form that got slurped up by DHS. – Minh Nguyễn 💬 00:27, 16 April 2022 (UTC)

Thanks for motivating me! I managed to split the single address field into housenumber and street now. For the result, see the new file linked above. I'm not sure about the phone numbers, so i better leave them out. Hiausirg (talk) 18:16, 16 April 2022 (UTC)