Talk:Overpass API/status

From OpenStreetMap Wiki
Jump to: navigation, search

Maintenance work 2016-07-13

To fix data damage caused by a software bug, the main instance and later on the Rambler instance will lag behind by some hours for some hours. The general availability is hopefully not affected. Work is scheduled to start at around 07h00 UTC --Roland is back to normal operations.

Yes but it is still extremely slow, and too many requests are failing before even reaching completion (no HTTP reply, failure to connect, or fatal error in execution due to lack of memory). I suspect there's heavy load caused by many background processes for some reconstruction or indexing (e.g. to reconstruct the preset "areas"). Only very simple queries are working, provided they don't return more than about 1MB of data. — Verdy_p (talk) 20:02, 26 July 2016 (UTC)
That's no longer the case. Mmd (talk) 07:46, 4 August 2016 (UTC)
It is still the case, I have examples of queries that need to return about 6MB of data after about 20 seconds of processing, that currently work perfectly with the French instance but fail with the German and Russian instance (probably because of lack of memory/resources or some internal SQL deadlock causing an uncaught exception): there not even a HTTP status return, the connection is accepted but gets closed abruptly after a few seconds before returning anything). — Verdy_p (talk) 14:47, 4 August 2016 (UTC)
overpass turbo shortlink, please. Mmd (talk) 15:08, 4 August 2016 (UTC)
If the maintenance work is terminated (internal knowledge of the local processing logs), please update the status page (indepednantly for the German and Russian instances). This will tell when to start the same maintenance on the French instance (which for now is not affected at all: I don't know which kind of bug you're correcting, I've not seen something wrong or missing on any instance, except some normal delays to reflect recent changes in the main DB). — Verdy_p (talk) 16:34, 4 August 2016 (UTC)
Again my question. can you please post an overpass turbo shortlink to support your statement I have examples of queries that need to return about 6MB of data after about 20 seconds of processing, that currently work perfectly with the French instance but fail with the German and Russian instance. Mmd (talk) 16:37, 4 August 2016 (UTC)

Rambler will still take a while. --Roland

Same effect for now as on the German instance. For now the French instance has no such problem and may be used if your querties are failing on the Russian or German instances. It's unfortunate that two major instances are under maintenance in the same time (but the French instance still seems to absorb the shock). We really need to secure this Overpass service (which is essential for development, QA tools, checking map progresses, or just for some simple apps), with more instances. But who can provide and host it? Should'nt the OSM Foundation run its own instance (recoverting some servers that are now disused such as the one that was running the OWL Beta) ?
I would like that this OSM support be proposed and added in the next SOTM World conference (where other develomments should be proposed to facilitate the mounting of additional mirrors and distribute the work load). This is more critical now that Mapquest stopped its free service and is now selling similar services for building custom maps with custom queries. Of course we'll need now to solidify also the tile serving CDN (with more caches). — Verdy_p (talk) 20:02, 26 July 2016 (UTC)
Basically, the French instance is affected by the same bug as the main instance and rambler. Sylvain is already informed, but hasn't yet started to rebuild the database. Expect some maintenance work for the French instance as well in the very near future. Also, I don't really see a massive impact on the French instance according to munin stats. Mmd (talk) 21:05, 26 July 2016 (UTC)

July 20 2017 indicent

It is interesting to know that the filesystem will not be able to recover a power outage. May be it's not just the filesystem or OS, but the hardware which is too sensitive and can write blocks of zeroes instead of the correct sector or nothing at all (and let then the OS recover the uncommited transaction automatically from the last known good state). I suspect here you have a hidden defect in some hard drive controler if it cannot detect power problems and erases its internal memory or cannot properly avoid writing something to the disk when there's a power emergency and not enough energy to complete a transaction with the spinning motors, magnetic heads and internal memory caches, and enter instead into "safe" mode, or cannot monitor the power state of its internal memory buffers (not necessarily slow Flash memory on the last stage but probably fast DRAM, which should still be able to keep enough energy for about 1 second without refresh, when it should be fast CMOS registers with a large enough reserve of energy with local capacitors...) Even on my desktop PC, I can abruptly power it off by unplugging the cable, I've never seen any corrupted disk sector. But I've never made such test by unplugging abruptly the small SATA power cord on the hard disk drive itself while this disk was in write access: normally the disk controler should detect it immediately and enter into safe mode instead of writing random data. If this occurs, the small board of the disk controler has a problem, and it will good to know in order to not buy another similar model from a bad manufacturer... On my own PC, I can regularly have crashes due to experiments, but I've always seen the OS recovering and restoring to a stable state (even when this occured when updating a new OS image that did not complete correctly: all I had to do was to press the reset button and check in logs what was going bad and what to fix before). Anyway I'm far less convinced this is a problem of the harddisk controler or the harddisk interface or hardware, but more probably a bug in the OS or its SATA drivers, that caused a power event to crash a system service with an uncaught exception followed by an unchecked assumption that all was OK, and the OS was able to continue a write transaction before an internal memory transfer was complete to fill in the buffer to transfer to the disk drive, or that caused too many multiple emergency I/O to compete for their completion and trying to "swap out" other services to disk before commiting other pending writes which were not atomically treated as they should have been. So you should check your drivers to see if there are problems (notably PCI bus drivers, and ACPI/power management drivers, or possibly even some unsafe power tweaks in BIOS settings, only meant to improve performance a bit, at the price of such failure in case of power loss). Write consistency is a requirement on all servers, and notably database servers. I doubt this is a bug in the SQL server, but most often these bugs are in the OS or incorrect BIOS settings, or broken drivers not tested correctly for all platforms or processors. I've also read recently that there were severe bugs in some recent ranges of Intel processors, notably for managing power states and managing bus priorities and synchronizing their internal pipelines, and no easy way to predict and workaround these bugs, Intel had to remove many processors from the market or to fuse out some functionalities to remarket them for lower ranges (this was a design bug affecting all dye masks, causing severe production delays with no immediate replacement possible, the dyes had to be redesigned: OEMs had to choose other models; this affected models made normally for high-end servers). — Verdy_p (talk) 11:20, 20 July 2017 (UTC)

Thank you for the hints. The precise circumstances are: the file is target of a lot of write-append activity. Due to a crontab entry, the file has been close after the restart being moved and compresses. Then the compressed file shows the zero bytes on decompression. This looks like a contradition between file content and stored file size. Judging from the content of the file, it has really lost some data in the end, probably more than the zero bytes.
All in all, the server is scheduled to be replaced in a month or two. Hence, I will try to recover within much investigation. I'm aware that this is not best practice. It is rather a effort-yield tradeoff.


More details on power outage:

Preliminary findings on the failure in the data center 21: In the night there was a storm at the Falkenstein site, whereby the power supply from the grid operator was interrupted briefly. With the exception of data center 21, the backup power supply by UPS / NEA worked smoothly in the other data centers at the Falkenstein site. In data center 21, several modules of the UPS system had been switched to fault mode for reasons which have not yet been clarified. The remaining modules could not process the load and switched to bypass mode. This switching process resulted in a voltage drop. This resulted in most of the servers in data center 21 rebooting. Since the UPS modules are currently still in the fault mode, there is currently no UPS protection in data center 21. Technicians from the UPS manufacturer will arrive today at about 1 PM at the site to repair the facilities as well as to perform further analysis on the cause of failure.