Talk:Platform Status

From OpenStreetMap Wiki
Jump to navigation Jump to search

Discuss Platform Status here

Can the Platform Status page be categorized?

I would like to categorize the Platform Status page (initially under Category:Technical, and later possibly under a new subcategory), but I don't know whether the page gets overwritten periodically by an automatic process which would wipe out a category link at the bottom. Is it possible to manually categorize the page? I suppose I could just try it, and see what happens, but I thought I would ask here first. Teratornis 12:02, 29 Jul 2006 (BST)

Think it's just updated by hand at the moment isn't it? -- Harry Wood 11:40, 22 Aug 2006 (BST)


Could link to munin stats for each server here? Ojw 08:19, 23 December 2006 (UTC)

Connection established but no data arrives: -> Server too busy?

Using josm and trying to download even a small area it sometimes takes ages that at least something arrives. This is annoying. Somehow the average load should be put on the stats page. --Gerchla 10:21, 28 January 2007 (UTC)

Yeah it would be interesting to know when the busy periods are, so we can avoid them. Judging by map/applet performance right now, I'm guessing Sunday night is a busy period. Anyone have stats which we could use to make a graph? It could show busy times of the day and busy days of the week (Based on past averages. No need for realtime stats) -- Harry Wood 21:54, 4 February 2007 (UTC)

Ah, we do have some stats graphs. Realtime generated too! Found them linked off the Servers page: db, dev, tile, wiki, www

A lot of these graphs show a nice quiet period at about 4a.m. each day (UK time) as I would expect. Most of them don't particularly show much change in traffic at the weekend, which I was surprised about. A notable exception is this tile server netstat graph, which (as I look at it today at least) is showing very busy periods over Saturday and Sunday, particularly on the afternoons.

Well I've made 'see also Servers' link. Maybe we should link direct to these graphs on this Platform Status page -- Harry Wood 14:59, 26 February 2007 (UTC)

cascading status

Even if a service is marked with {{Yellow | something}} it doesn't cascade to the actual status. That seems to come from somewhere else. Not much point marking things down or yellow if the top level just stays green anyway :( - User:Kpalsson 18:35, 30 October 2007

Edit Template:Platform Status to change the overall status (which features prominently on the Main Page)
Edit this Platform Status page to give more specific details.
As you say, there's no cascade mechanism. We could frig around with MediaWiki templates to acheive something like that, but do we always want it to cascade? Anyway there's less confusing black magic with the way it currently works.
- Harry Wood 13:42, 26 November 2007 (UTC)

Not getting e-mails when watched Wiki pages are updated?

Is anybody else not getting e-mails when a Wiki page is updated that they are watching? It was working for me a few days ago and then all of a sudden I'm not getting any e-mails about them. In fact, the last one I got arrived at least 2-3 days after somebody updated a page. --Rickmastfan67 04:23, 13 October 2010 (BST)

Started again now. I'm not sure if it was switched off deliberately (maybe for performance reasons) or something broke, but according to my tests it's now emailing again -- Harry Wood 10:20, 14 October 2010 (BST)
I fixed it sometime around 2am. The problem was a HUGE mediawiki job was stuck in the job-queue. It required > 150MB which is the default memory limit. I have increased the memory limit and jobs are now running as per normal. See job queue depth here -- Firefishy 17:00, 14 October 2010 (BST)
Late correction: the job queue length is no longer displayed in the Special:statistics page, but still displayed here (as JSON):
If this counter continues growing constantly, the job queue executor processes are stalled. Frequently because of high server load (generally the tile server, sometimes because of complex database maintenance/cleanup, or during monthly full dumps and archiving, or some processes requiring lot of memory). If this happens, you'll also see that pages added to categories (or whose categories have changed) are still not listed in their new category (even after performing null edits several minutes or hours later). — Verdy_p (talk) 05:34, 26 February 2017 (UTC)
E-mails are again messed up. I just received an e-mail from the 16th on the 19th, which is 3 days late. So, the que seems to be jammed up again. --Rickmastfan67 11:33, 20 October 2010 (BST)
Seems to be caught back up again. --Rickmastfan67 00:37, 22 October 2010 (BST)

I'm curious, why does there seem to be a major delay sometimes (like several hours) to sending out an e-mail when a page has been updated? Before I created this subsection back in October, that was never the case. Back then, when people made an edit to a page, the e-mail was almost instantaneous. --Rickmastfan67 00:44, 19 December 2010 (UTC)

atd decided it wouldn't process jobs anymore because of a broken job, fixed now. -- Firefishy 09:52, 19 December 2010 (UTC)

E-mails from the Wiki seem to be jammed up again. :( -- rickmastfan67 09:50, 13 January 2011 (UTC)

Nothing in the work queue. Checked spam? -- Firefishy 11:07, 13 January 2011 (UTC)
Did that before I posted this. -- rickmastfan67 13:09, 13 January 2011 (UTC)
OK, found the problem. Cannot fix today. Will do in morning. -- Firefishy 00:43, 14 January 2011 (UTC)
Well, you must have done something because I just got an e-mail about your comment here. ;) -- rickmastfan67 00:48, 14 January 2011 (UTC)
Yip, I couldn't wait and put a temporary fix in place. ;-) -- Firefishy 01:18, 14 January 2011 (UTC)

E-mails are delayed again. Getting e-mails for changes about ~5 hours (just an estimate) after the page update is made. -- rickmastfan67 12:36, 21 April 2011 (BST)

The wiki is currently sharing hardware with the tile server proxy, so during periods of super-heavy tile load, the wiki is affected, and I guess slowdowns and failures of the background processing job queue would be an early symptom of that. Tile load today is due to the popularity of 'iPhoneTracker' (quite a cool use of OpenStreetMap actually), which is presumably going to die down gradually over the coming days anyway, but there is a plan to shift the tile proxy onto new hardware in the coming weeks. When the emails will start coming through again I'm not exactly sure. -- Harry Wood 13:45, 21 April 2011 (BST)
Harry's reply is correct. The low priority work queue is processed when the system load is low, with the load so high it does not get processed. There are a few forced queue runs during the day but these are not keeping pace with the queue length. Should settle down in a few days time. --Firefishy 16:46, 21 April 2011 (BST)

Just wanted to make a comment that the e-mails are really backed up right now. Getting e-mails 5-6 days after a change on the wiki has been made (example, change was made on the 19th, got e-mail on the 25th). Just thought you guys would like to know about this. -- rickmastfan67 06:43, 27 September 2011 (BST)

Server has been overloaded and the job queue has been backing up. I have increased the job queue run size and frequency. -- Firefishy 11:19, 27 September 2011 (BST)

map call rate-limited queue note

User:NE2 wants to add the follow text alongside the API entry: "note that the map call (used by most editors, but not Potlatch 1, to load all the objects in an area) uses a rate-limited queue but other actions do not"

Nobody disputes that this is true, but is this text necessary/a good idea on the platform status page? Discuss...

-- Harry Wood 12:00, 7 December 2010 (UTC)

The queue, or the limit, or the chance of hitting the limit, aren't likely(?) to go away - and there's no tool to see if the limit is currently cutting traffic - or even if it has been in the past (minutes). IMO that means it has nothing to do with the current platform status; it's a feature, even if it can be annoying sometimes. Alv 15:19, 7 December 2010 (UTC)
I think this page is a simple traffic light-style indicator of whether there are genuine problems with OSM's systems. A discussion of the whys and wherefores of those problems just muddies the issue. A user visiting the page, wanting a simple yes or no answer, is left having to work out whether or not this is their problem. Leave the text out. Jonathan Bennett 15:49, 7 December 2010 (UTC)
Thinking about this further, the text definitely shouldn't be there since the two things (platform status, usage cap) are separate and possibly orthogonal things. The platform status is something the user can't do anything about, and the admins will do something about; the usage cap is something the user can do something about, and the admins won't do something about. Jonathan Bennett 17:25, 7 December 2010 (UTC)
That's what I thought too. The note User:NE2 is proposing (forcibly inserting), is statement about some aspect of the normal operation of the API. At worst it could be reported/discussed as a bug or suggestion for improvement (Is there a bug filed for something to fix? Should there be?) But this page is for describing more temporary operational problems, descriptions of platform stability or interruptions in service. It is not for reporting perceived problems (bugs) in the normal operation of services.
Therefore the note doesn't belong here. Don't you agree User:NE2 ?
-- Harry Wood 17:57, 7 December 2010 (UTC)
If further information is needed from the Platform Status page, than it should be given as a link to where this information can be obtained (i.e. a schedule for a planned maintenance), this page should be as simple as possible, is it working or not? A simple green light for working, red light for not working, and a borderline yellow light for working with problems. The yellow light should be followed by a brief note of the problem, and link to further details if needed. Built-in limitations are not considered working with problems (though might at times be annoying). --Skippern 18:40, 7 December 2010 (UTC)

You're not understanding. It's the fact that there's no rate limiting for most API queries that's worth noting. This can cause hour-long periods of wicked slowness in those queries, while the map call remains fine (since it is rate-limited). --NE2 18:44, 7 December 2010 (UTC)

Only the tiles have a method that slows down access. The API DOES NOT have a slow down method. If it is slow it could be networking issues or the server sides API processing queues are long. This would effects EVERYONE. -- Firefishy 19:41, 7 December 2010 (UTC)
The API queue for map calls is rate-limited, and is different from the queue for other calls: --NE2 19:46, 7 December 2010 (UTC)
It's an interesting thing to note I suppose. I wasn't really aware of it. But we don't want this permanently flagged as a note on this page. It's fairly impossible to explain in a short note, so will create confusion, and in general we want this page to report a clean sheet most of the time. How about if we bung the text on there during periods when the non-map API calls are experiencing slowness. Keep it clear the rest of the time ? -- Harry Wood 01:12, 8 December 2010 (UTC)
The reason for my putting it there was so people could report details on what exactly was slow - if it's the non-limited calls, then it's a fairly normal abnormality, but if it's the map call, it may be indicative of bigger problems. --NE2 01:56, 8 December 2010 (UTC)

the non-map API calls are subject to queues and do not have rate-limit limits. The 'rate-limit' is on the MAP api call, if you trigger it all your map queries are returned with a hard HTTP 509 Error. -- Firefishy 09:39, 8 December 2010 (UTC)

...but the queues cause them to slow down too presumably -- Harry Wood 09:51, 8 December 2010 (UTC)
Yes, much more than the map call. Right now, try loading an area in Potlatch or downloading referrers in JOSM, and it will take a long time. --NE2 20:46, 8 December 2010 (UTC)
I guess it is a connection between the current slowness and the fact that Bing coverage are made available over large partsof the world. This new extremely large datasource have increased activity in large parts of the world. --Skippern 07:38, 9 December 2010 (UTC)

NE2, seriously. MAP API QUEUE DOES NOT USE A RATE LIMIT! IT HAS A SIZED BASED USAGE LIMIT (X sized MAP API responses multiplied by Y number of requests in timeframe >=Z byte limit = 590 ERROR.) WHICH IF TRIGGERED RETURNS A HTTP 509 STATUS ERROR - Bandwidth Limit Exceeded instead of ANY data. The map API queue gets processed as fast as our hardware setup allows; a failed cache battery + potlatch2 + high bing load is making things slow (due to long queues) for everyone at the moment. Please stop putting factually inaccurate information on the status page. KTHXBYE -- Firefishy 16:53, 9 December 2010 (UTC)

NE2, when several different people, all knowledgeable long-time OSMers, revert your edit on a wiki page, that should be a pretty strong hint to you that you should not make the edit again. The way you behave, it comes across as if you are just trying to annoy the system administrators and making a nuisance of yourself. I'm sure that isn't the case, so please demonstrate your respect and appreciation to our hard working volunteer sysadmins by engaging in discussion here instead of making any further edits to the Platform Status page.
The text of the note "uses a rate-limited queue" is arguably inaccurate, and overall it's a confusing note. I have an idea for an alternative. Suggestion moved to a new discussion below
-- Harry Wood 13:16, 17 December 2010 (UTC)
OK, so Tom was wrong at Thank you for pointing that out calmly. --NE2 18:26, 9 December 2010 (UTC)

The stupid thing is, you're right. There is an intermittent API slowness problem, and maybe we can/should find a way of indicating this on the page. Given that you are right, it should be easy for you to come to some agreement with others, about what needs to change on the page. I could even take your side on this discussion, but now, like everyone else, I'm just intensely irritated by you and the manner in which you conduct yourself. I asked you very politely not to edit the page any more, and today you've ignored this, and added nothing to the discussion by way of explanation. That's the behaviour of somebody who is deliberately trying to annoy everyone. -- Harry Wood 13:44, 17 December 2010 (UTC)

Separate status rows for map and non-map API calls?

Although we externalise the 'map' call as if it is part of the API, it's operationally quite separate, with different performance characteristics. How about if we split the API row into two rows: "API map call" and "API other" to reflect this ?

-- Harry Wood 17:12, 9 December 2010 (UTC)

That might be a good way to do it. --NE2 18:26, 9 December 2010 (UTC)
The map call and the rest of the API still use the same database, the same machines and the same network connections. As Firefishy has pointed out, there is no rate limiting on any system, so if the machines are being hit hard, any jobs on those machines will suffer. This is still too complex a technical area for what's meant to be a simple traffic light-style indicator to everyone using OSM -- how does a non-technical mapper know if Potlatch/JOSM/Merkaartor/whatever is using a map call or not? Jonathan Bennett 11:23, 10 December 2010 (UTC)
How does it work actually? The API is listed as 'puff & fuchur', but then there's the "Rails application servers" draco, sarel, & norbert. Is it those servers which are queuing up non-map API calls? Where is map cgi running? -- Harry Wood 12:49, 10 December 2010 (UTC)
Some more details from chatting to TomH about how this works (at the moment) :
Hardware-wise the split is the same whether it's a map call or a non-map call. Requests come in via the front web servers puff & fechur, and processing is split between draco, sarel, & norbert (the ones described as the "Rails application servers")
Non-map calls hit the rails code running on those machines, and during busy periods the requests get queued by the passenger server on those machines.
For map calls the frontend web servers try to connect to cgimap. This is not rails code, but also running on draco, sarel, & norbert. During busy periods it may be that the connect won't succeed until there is an instance ready to process the request.
So yes... Jonathan Bennett is correct to say same database, the same machines and the same network connections. ... and yes it is quite complex.
-- Harry Wood 13:16, 17 December 2010 (UTC)

Detailed log for offtimes

Hi, is there any log where we list the resons for the downtimes e.g. the wiki shutdown this morning? --!i! This user is member of the wiki team of OSM 14:33, 2 March 2011 (UTC)

Hi, can anybody give us the reasons for the currently offtimes last days, please? So we can inform the community in newsletters. --!i! This user is member of the wiki team of OSM 14:06, 26 July 2011 (BST)
Yesterday we had a RAID controller failure on the main tile server. Thankfully the sysadmins have been gradually scaling up various servers and introducing redundancy, so on this occasion tiles could be served from a back-up server while this unexpected problem was tackled. Running on this backup server, contributors didn't get to see their data changes rendered. It also meant some errors ("More OSM coming soon" message) for requests at higher zoom levels and less popular bits of the map. This is because the back-up server does not have a rendering engine and all of the postgres database and replication stuff. It only serves up tiles which were already rendered in the cache.
Thanks to Firefishy for working on this problem. He got the main tile server back up and running in a few hours, partly by dashing to the server room on his lunch break!
The history of this wiki page. D'you think we should put a note on the blog about it? It was a pretty minor outage in the end.
-- Harry Wood 16:22, 26 July 2011 (BST)
Well to me it sounds quite good, to bring a bit of transparency to the hard jobs of the admins, so the community might respect their job a bit more. On the other hand some people allready espected another App killing our servers, that was obviously not the case :) --!i! This user is member of the wiki team of OSM 07:42, 27 July 2011 (BST)

Job queue stalled on this wiki

@Harry Wood:, @Firefishy:.

Apparently this wiki has its job queue stalled since about 5 days, with jobs constantly cumulating. If I look at server statistics on Munin, its server has its CPU 100% used constantly, and an unusually hot temperature due to this constant usage.

(This is also correlated with an apparent change in the network topology, with an internal link to a server whose IPv4 roundtrip time has increased from about 20ms to 30ms: possibly some network service is no longer responding correctly, such as a service for the SQL database replication, or for performing backups/dumps.)

It seems that one job in the MediaWiki job queue is never ending and has entered into some infinite processing loop. As a result, various pages on the wiki are not reflecting changes made, or categories are not updated correctly (missing pages that are categorized in them), or the lists of pages linking to another are not correctly updated, or some pages tested with #ifexist are not found even if they exist, or some redirects are not followed. It seems related to some background job trying to reconstruct an index on the SQL engine, and the job stalled in the Mediawiki job queue being locked there, blocking all other jobs from starting.

The job queue was clean one week ago. But now it slowly but constantly grows to critical levels.

May be the wiki server should be restarted after looking at which job in the MediaWiki job queue is stalled, and looking at what may be locking on the underlying SQL server (such as table indexing processes).

It may be an internal issue in MediaWiki (note that the current version used here is in "end of life", according to the MediaWiki site). But the constant high temperature on the server CPU (above 60°C instead of about 40°C) may cause hardware problems later. This morning apparently the server was restarted (causing a cooler temperature, but the job queue is still stalled, and temperature is growing again). — Verdy_p (talk) 19:02, 1 December 2016 (UTC)

You did not reply at all. The job queue is now back to normal. Visibly some background processed were killed, and immediately the CPU usage dropped, as well as the temperature and the VM usage. The job queue length has now decreased significantly (by more than two third of the max length it reached yesterday) and continues decreasing (generally slowly, sometimes in larger bumps for simpler tasks), and it has also a small benefit on the wiki performance and response time (most probably because the heavy never ending tasks that were locked in the job queue are no longer running, and most other tasks are much less CPU/memory intensive and cause less swaps in the VM). I also see that there are mmany less NATed connections through the firewall (this suggests that these heavy background tasks were caused by some wiki admin leaving a remote terminal connected for several days, to perform some instensive technical tests in the wiki, or possibly using a session to reindex some huge tables in the database, or trying to run some home-made bots or internal analysis of the wiki content; I just hope this was not caused by an external attack by unauthorized accesses or via security loopholes). — Verdy_p (talk) 19:29, 2 December 2016 (UTC)

Page for the platform, not servers

Unlike Munin, this is a page for users, not a server by server listing. Only parts of the "platform" should be listed, not servers that may not be in use for anything user-facing. I'm not sure how useful listing "servers" is, because these can change at any time and it's transparent to the users. Pnorman (talk) 01:59, 5 July 2018 (UTC)

It's not so transparent when services are down, or seem to be down; getting Munin details may give info that they are up and that a problem is elsewhere, or that there's an overload.
May be you'd want to integrate this mapping ofr services to servers on the "Hardware" info site (managed from chef probable), and then we could link that page. For now it's hard to check where a service is hosted. Anyway the list of servers is visible also in Munin (which is also hard to navigate with very long lists of status) and no evident sort and search option.
I edited that to relocate the services on servers according to their hardware "roles". It seems that the hardware site could list these servers by assigned "role" (but roles specific and dedicated to a single server, named like the server itself) should be hiden from this list.
As well the list of servers on the CDN also gives hints to uysers when one of them is failing (users may want to retry by bypassing the geoDNS default mapping via some proxies from another country, I've observed various countries changing from one to another, and the geoDNS does not correctly distribute the workload with effective servers capacities or response times, so the response time of the map varies a lot depending on from where the users are connecting as the mapping is extremely broad when done only by full country and the most local servers are not always used, e.g. the tile cache in France does not serve France, now it is served from one of the two caches in Germany, with a bit slower bandwidth and longer sessions for downloads, that cause higher workload on servers). It seems that improving the CDN and make it a bit more dynamic could help stabilize the service and detect local server overloads. As these servers have munin stats available, these stats may be used to regularly redistribute the workload with some prefered failover secondary mappings (but for loarge countries with many users, e.g. USA or Russia, or even Germany, may be there's a way to remap from some states/regions identifiable by geoIP, or if not available by source ISP/network, ie. by AS if geolocation is unreliable and concentrate all users from a alarge ISP in the same country at the same hotspot).
Note that this page is a form of promotion for informing possible service providers that they can help improve the infrastructure by donating resources. For now it just asks for financial donations to all users. It also shows that the SOMF has limited resources and to demonstrate why some usage policies are limiting what third party can do (I think we should link to a page showing how they can create their own server and contribute it to the community; I think that existing chapters could work on this, as our infrastructure is still very fragile). May be we should also link the online services hosted by Wikimedia. — Verdy_p (talk) 08:11, 5 July 2018 (UTC)

Remove Status Column for OSMF hardware

Reason: nobody maintains it, a list of servers is automatically generated, up to date information is provided via Twitter @osm_tech. Let's just drop it, it adds no value. Mmd (talk) 11:42, 15 July 2018 (UTC)

Website Status

The website stopped working temporarily at 3:50 PM PST on 7/14/20.

Status Rendering Queue

Is there currently an easy way to access/see the status/delay of the rendering queue as there is apparently a quite big lack there? --Gkai (talk) 10:58, 16 July 2020 (UTC)