Date: Tue, 16 Dec 2014 18:44:22 -0600 From: Rob Landley To: DreamHost Customer Support Team Subject: Re: [roblan21 97401440] lists.landley.net is borked The archive has now caught up to December 2nd. It's currently December 16th. The "this support ticket has been moved to a specific user's queue" message (#6603199 has been moved) was from friday. I have not heard back on this issue since. Currently, this is the most recent message listed: http://lists.landley.net/pipermail/toybox-landley.net/2014-December/003802.html There have been a lot of posts since then, including one from a contributor telling me that my project's mailing list is not up to date, and my reply to him cutting and pasting the back and forth on this issue going back to August, and his reply back to me. My wild guess at what's going on, given the way "wget" on any of these pages that haven't been recently loaded takes anywhere from 30 to 90 seconds, but then the second wget of the same page is almost instant, is that your server is swap thrashing itself to death. Specifically, it looks like you haven't set any physical memory or I/O bandwidth low water marks on the host system outside all the containers at all, and it hasn't got enough physical memory for the load you've given it, so individual lxc containers (openvz containers?) are getting swapped out, and then when it tries to swap them back _in_ the server winds up swap-thrashing or similar (hugely I/O bound either way), and the cron jobs _in_ the containers are essentially never running, because when they try to run the system detects it's overloaded and reschedules the regular cron jobs for a "later" that never comes because it'll be overloaded then too. (It may do a trivial amount of work, such as process one message per day, and then go "well, I'm over my allocated time, catch up next time this cron job runs"... in 24 hours. At which point it does one message again.) Have you considered running top and iotop on the host system outside these containers to see how much memory is available and what's using the I/O bandwidth? (Or maybe the cron jobs are synchronized between containers and that's what causes an I/O storm? Something is swapping everything out, and it's messing up cron and disabling any mail processing triggers you might have...) Rob ======================================================================== To: rob@landley.net From: DreamHost Customer Support Team Subject: Re: [roblan21 97858499] lists.landley.net is borked Message-Id: <20141220015213.E211858160@badtzmaru-new.sd.dreamhost.com> Date: Fri, 19 Dec 2014 17:52:13 -0800 (PST) ------------------------------------------------------------------------ - After reading this response, please consider visiting - the survey below to comment on its quality. Thanks! - http://www.dreamhost.com/survey.cgi?n=97858499&m=1638992 - - If the service you received from us was exceptional, please consider - tweeting your love for @dreamhost. It'll warm our hearts, soothe - our souls, and get you good karma at some point down the road. ------------------------------------------------------------------------ Hello, I apologize sincerely for the delays in getting back to you here, and for the inconveniences this has caused for you as well. Admittedly, our mailman servers in use for discussion and announcement lists really could use an upgrade. New hardware and routing plans are in the works. Indexing can run slowly at times, and occasionally may be terminated or killed by a process watcher depending on the servers' load. It *should* run again the next night when it picks back up, but there may be times like this one where it needs an extra kick from one of us. It appears our admin already took care of it two days ago, but I pushed the index update again for each of your three discussion lists just in case. It should be caught up at this time. Your patience has been very appreciated. ============================================================================ Date: Sat, 20 Dec 2014 19:10:21 -0600 From: Rob Landley To: DreamHost Customer Support Team Subject: Re: [roblan21 97858499] lists.landley.net is borked On 12/19/14 19:52, DreamHost Customer Support Team wrote: > one where it needs an extra kick from one of us. It appears our admin > already took care of it two days ago, but I pushed the index update again > for each of your three discussion lists just in case. It should be caught > up at this time. It's not caught up. I've received dozens of messages from this list since the last one the archive shows. Down at the bottom of the December page it says: Last message date: Wed Dec 3 17:00:08 PST 2014 Archived on: Sat Dec 20 01:23:35 PST 2014 This is the disconnect. The "last message date" was Dec 1 a week ago, now it's Dec 3. I have not gone back and retroactively generated messages, they're in the system and not getting processed. Two new messages have shown up on http://lists.landley.net/pipermail/toybox-landley.net/2014-December in the past few days, which is progress. But the most recent message in http://lists.landley.net/pipermail/toybox-landley.net/2014-December/date.html is from December 3, which was 17 days ago. I received 14 messages from this list _yesterday_, via email. At this rate, the first of yesterday's messages might show up in the web archive sometime in late February. Before that I received multiple messages on the 18th, and on the 17, and on the 16th, and on the 15th... Moving forward from where the archive is, there should be a reply to http://lists.landley.net/pipermail/toybox-landley.net/2014-December/003802.html from me, and then a reply to THAT from Felix Janda dated December 4. Those may trickle in over the next week? (I don't know.) This doesn't appear to be an indexing problem, this is a delivery problem. Whatever's feeding the messages _into_ the mailman system is delivering them one at a time, half a month late. It's not the actual list mailserver. I get messages to my gmail account fairly promptly (although I'm cc:'d on half of them and gmail filters otu duplicates, but I haven't received any via email noticeably delayed.) But the mailman mbox text files (accessable as "gzipped text" on the right from http://lists.landley.net/pipermail/toybox-landley.net/) do not receive them for a very long time, and the indexes can't index data they haven't received yet. I'm glad that at least the messages don't seem to be getting permanently lost. (Yet.) But in addition to my open source developers using the archive, I personally value the web archive precisely _because_ gmail filters out duplicates (and thus breaks threading, making my local copy of this data live half in my personal inbox and half in my list folder; of the 14 messages from yesterday 6 are in the list folder and the rest in my inbox, thank you gmail). When sites like https://lwn.net/Articles/616272/ cover my project, they copy messages like https://lwn.net/Articles/616315/ from the web archive, which they can't do if it's not there. I note that the http://landley.net/dreamhost.txt reply I copied to the website so I could refer to it was in reply to the Android Bionic maintainer, because this project has just been merged into the Android base system for release in the next Android version, which means my code would be running on a billion phones. I'd kind of like to avoid losing this opportunity, for reasons I described in my presentation on it at last year's Embedded Linux Conference: https://www.youtube.com/watch?v=SGmtP5Lg_t0 One of the messages I'd very much like to link to is the Android maintainer describing his progress merging toybox into Android, and his remaining todo list items. Unfortunately, I can't do so because he posted it yesterday. (See "february", above.) Yes, everybody thinks their stuff is important, but I'm fairly certain this isn't just inconveniencing me personally. I'm glad the website is working. I'm glad the mercurial repository is working. I'm glad the actual email part of the mailing list is working. I'm sorry to keep pestering you about this, but it's an issue for me that the mailing list archive work. Could you ask whoever fixed this when it happened back in August what they did? (Or did it just resolve itself...?) Rob ======================================================================= To: rob@landley.net From: DreamHost Customer Support Team Subject: [roblan21 97957266] Message from support. Message-Id: <20141222051900.0997458137@badtzmaru-new.sd.dreamhost.com> Date: Sun, 21 Dec 2014 21:19:00 -0800 (PST) ------------------------------------------------------------------------ - After reading this response, please consider visiting - the survey below to comment on its quality. Thanks! - http://www.dreamhost.com/survey.cgi?n=97957266&m=6549624 - - If the service you received from us was exceptional, please consider - tweeting your love for @dreamhost. It'll warm our hearts, soothe - our souls, and get you good karma at some point down the road. ------------------------------------------------------------------------ Hi there Rob! I'm terribly sorry for the delay in response to this matter... I've been noticing some issues with list archives not properly being updated, but hearing that the messages are making it through, though. I'm planning on placing a bug into our Admins tomorrow with details to this matter so we can get that fixed as I'm not sure what's causing this or how to resolve it myself. I'll hold onto this ticket in my personal queue so I can update you with further details as I get them. I'll be in touch with you again. Thank you kindly, Jin K. DreamHost Support Team + support@dreamhost.com Earn over $97 for each referral: http://www.dreamhost.com/affiliates/ To continue this support case, just reply to this email. Open a new case at: https://panel.dreamhost.com/?tab=support ========================================================================= To: rob@landley.net From: DreamHost Customer Support Team Subject: Re: [roblan21 98044907] lists.landley.net is borked Date: Wed, 24 Dec 2014 12:20:42 -0800 (PST) ------------------------------------------------------------------------ - After reading this response, please consider visiting - the survey below to comment on its quality. Thanks! - http://www.dreamhost.com/survey.cgi?n=98044907&m=6075228 - - If the service you received from us was exceptional, please consider - tweeting your love for @dreamhost. It'll warm our hearts, soothe - our souls, and get you good karma at some point down the road. ------------------------------------------------------------------------ Hi again! Thanks for your patience! Our Admins were able to find the cause of the issue here. It looks like the issue is with regards to a couple of things: 1. The list server was backed up with a large log of messages that it needed to archive, so the archiving was being done, but much slower than the amount of incoming/outgoing messages. 2. The hardware for the list server is somewhat old, and has a higher load due to this that can cause the processes on those machines to run slower. For the first problem, we were able to spot the issue and our Admins were able to increase the archive qrunner process count to help this run faster. The server is now playing catch-up, so it may be a bit of time before that's all included in the archive due to the backlog of requests that are still on the machine now. But, this did help with the processes running much faster. It will still need some time to get caught up to the current date, though. Please monitor the archiving over the next week and let me know if you notice further issues. As for the second issue, we are looking into upgrading the hardware on the list servers as a couple of the machines are a kind of old. I don't have an ETA on when this will all take place, but there are internal talks about getting the hardware replaced for the list service, so once that's done, that'll help greatly with the list service, and will certainly help to keep the archives up to date in a timely manner. The issue I assisted with back in October was a different issue with the 'scooter' mount not being properly mounted. This issue is with the archives simply not being processed and just sitting there waiting to be archived, but due to the backlog of requests the machine has to handle, that didn't get done in a timely manner, but the fix the Admins did should help get the requests processed faster. The archiving of the list posts should be processing much faster now that our Admins were able to increase the processes on the server for archiving, so please keep an eye on that and let us know if you notice any further issues so we can check that out then. Hopefully, you'll notice the updates within the next day or so. If you have any other questions at all, please reply back. I'd be happy to help! Thank you kindly, Jin K. DreamHost Support Team + support@dreamhost.com Earn over $97 for each referral: http://www.dreamhost.com/affiliates/ To continue this support case, just reply to this email. Open a new case at: https://panel.dreamhost.com/?tab=support =========================================================================== Date: Wed, 24 Dec 2014 22:57:09 -0600 From: Rob Landley To: DreamHost Customer Support Team Subject: Re: [roblan21 98044907] lists.landley.net is borked On 12/24/14 14:20, DreamHost Customer Support Team wrote: > ------------------------------------------------------------------------ > - After reading this response, please consider visiting > - the survey below to comment on its quality. Thanks! > - http://www.dreamhost.com/survey.cgi?n=98044907&m=6075228 > - > - If the service you received from us was exceptional, please consider > - tweeting your love for @dreamhost. It'll warm our hearts, soothe > - our souls, and get you good karma at some point down the road. > ------------------------------------------------------------------------ > > Hi again! > > Thanks for your patience! Merry christmas! > Our Admins were able to find the cause of the issue here. It looks like > the issue is with regards to a couple of things: > > 1. The list server was backed up with a large log of messages that it > needed to archive, so the archiving was being done, but much slower than > the amount of incoming/outgoing messages. > > 2. The hardware for the list server is somewhat old, and has a higher > load due to this that can cause the processes on those machines to run > slower. Something like swap-thrashing has to be occurring to transition "keeping up with realtime input for weeks at a time" into "multiple weeks behind, lucky to get a message a day". This isn't even 1/10th normal speed, this is "no visible progress for multiple days, system is paralyzed". This is not a CPU scheduling issue for a constantly running demon, this is a 3 orders of magnitude screwup and that means a rotating disk has started stopped doing streaming I/O and started introducing a seek between each sector read. That's just about the only place in a modern computer you _get_ that kind of giant latency insertion. (Ok, there's stuff like reverse DNS lookup failures with the insane 15 second timeout in ssh logging, but that doesn't apply here.) You don't happen to have an admin who thinks "10 gigs of swap space means the machine will never run out of memory, so this service will always stay up", do you? I've seen the result of that. It's a system thrashing so badly you spend half an hour waiting for bash to make its way through /etc/profile when you ssh into it. Have you tried (as root): echo 1 > /proc/sys/vm/drop_caches It's counterintuitive, but it can slap some sense into of a hysterical box long enough for you to fix the overload. (It says free all clean disk cache _now_, then processes and disk cache race to claim the freed memory. It means the processes can establish a working set for a few seconds and maybe progress long enough to clear the backlog.) I've sometimes done: while true; sync; echo 1 > /proc/sys/vm/drop_caches; sleep 30; done And left that running on a thrashing box until it was out of the woods. Not a long-term solution but as shock therapy it works surprisingly well... > For the first problem, we were able to spot the issue and our Admins were > able to increase the archive qrunner process count to help this run > faster. This seems unlikely to be an issue of a box being CPU starved. Not on Linux boxes, they ensure non-realtime threads get a tick each scheduler slice refill precisely to avoid that martian space probe problem. (Google "mars pathfinder priority inversion".) So you can crush processes to about 1/20th of their normal speed with priorities, but not 1/10,000th. Not unless you're running 10,000 CPU hog processes on a single box? Ever since the O(1) scheduler went in (like 15 years ago now) scheduler tick refills won't give _any_ process zero ticks for an entire period unless you're doing SCHED_RR which _nobody_ is crazy enough to run on a server... But maybe you're running BSD or something? I've only encountered this kind of behavior due to memory starvation resulting in pathologically bad I/O scheduling decisions resulting in a LOT of thrashing (you can even do it without swap, since executable pages are mmapped and can be dropped and individually faulted back in and if there's no _other_ source of physical pages to handle a fault...). But you know your config better than I do. Either way, yay progress. Ah, one other thing that can do it is if the system has multiple I/O busses of different speeds and you're running an old 2.6 kernel where calls to sync aren't per-queue. So you can have a USB2 disk with 16 gigs of buffered writes waiting to go out (I hit this running mke2fs on a terabyte USB disk) and then when anything calls sync (thank you vim for doing that every 100 keystrokes if you let it make .swp files; I wrote an LD_PRELOAD .so to stub out the sync calls for it to get it to STOP doign that), the old sync() implementation inserts a global barrier where all new I/O is blocked until _all_ pending I/O from before the sync call goes out, and that includes flushing all the data to the slow device. So the machine can go bye-bye for 15 minutes flushing the data in the case. I vaguely recall they fixed that somewhere around... 3.4 maybe? But again, unlikely to be the case on this server (are you writing gigabytes of data through USB2 ?). It _really_ sounds like some varaint of swap thrashing... > The server is now playing catch-up, so it may be a bit of time > before that's all included in the archive due to the backlog of requests > that are still on the machine now. > > But, this did help with the processes running much faster. It will still > need some time to get caught up to the current date, though. Please > monitor the archiving over the next week and let me know if you notice > further issues. According to http://lists.landley.net/pipermail/toybox-landley.net/2014-December/date.html Last message date: Wed Dec 3 23:37:33 PST 2014 Archived on: Sun Dec 21 02:52:08 PST 2014 It's now wednesday evening on the 24th, so it's saying the list index was last rebuilt just under 4 days ago. Presumably the index and the data being indexed will meet up at some point... > As for the second issue, we are looking into upgrading the hardware on > the list servers as a couple of the machines are a kind of old. I don't > have an ETA on when this will all take place, but there are internal > talks about getting the hardware replaced for the list service, so once > that's done, that'll help greatly with the list service, and will > certainly help to keep the archives up to date in a timely manner. > > The issue I assisted with back in October was a different issue with the > 'scooter' mount not being properly mounted. This issue is with the > archives simply not being processed and just sitting there waiting to be > archived, but due to the backlog of requests the machine has to handle, > that didn't get done in a timely manner, but the fix the Admins did > should help get the requests processed faster. Indeed, there have been at least 4 things wrong with the list processing over the past few months: 1) That mount issue you mentioned, rendering it 404. 2) The index not updating regularly, with the index page's timestap saying it's several days old. 3) Messages not being delivered to the list archive for processing in a timely manner. (So when index page updates _do_ happen, they don't index current data.) 4) End-user page loads (wget from out in the world) taking anywhere from 30 to 70 seconds to to get a response if that specific page hasn't been accessed in the last hour or so. Presumably the last three are related. (Again, I imagine your poor server swap thrashing its guts out. Have you run iotop on it?) Note that 2 and 3 are distinct issues. Both have recurred over the. You say just just fixed #3 (yay!), and I await the next time #2 unblocks so I can see the result. Speaking of #2: Note there's a top level index and a per-month index. The december index linked above says December 21. The top level index http://lists.landley.net/pipermail/toybox-landley.net/ says November 28th, which seems unlikely to be accurate if it's listing Deceber as a month... > The archiving of the list posts should be processing much faster now that > our Admins were able to increase the processes on the server for > archiving, Unless swap-thrashing (or similar rotating media I/O contention breaking up all streaming I/O and inserting a seek between every sector read) was the problem, in which case adding more competing processes would probably make it worse. But presumably they would have noticed that, so... > so please keep an eye on that and let us know if you notice > any further issues so we can check that out then. Hopefully, you'll > notice the updates within the next day or so. If the index rebuilds so I can see it, sure. > If you have any other questions at all, please reply back. I'd be happy > to help! I hope you get christmas off. :) The archive's been down all month, another couple days doesn't make a huge difference at this point. (I'm not sure what happens when messages from the previous month start trickling in, but if it's processing messages from the 3rd after the 19th and that works out ok, I guess it's fine?) Thanks, Rob =========================================================================== To: rob@landley.net From: DreamHost Customer Support Team Subject: [roblan21 98134032] Message from support. Date: Sat, 27 Dec 2014 14:01:38 -0800 (PST) ------------------------------------------------------------------------ - After reading this response, please consider visiting - the survey below to comment on its quality. Thanks! - http://www.dreamhost.com/survey.cgi?n=98134032&m=4414604 - - If the service you received from us was exceptional, please consider - tweeting your love for @dreamhost. It'll warm our hearts, soothe - our souls, and get you good karma at some point down the road. ------------------------------------------------------------------------ Hi Rob! Thank you for the update! And I hope you had a wonderful Christmas! I have to admit, your detailed response made my head spin. =) To be honest, the details are far beyond my knowledge being a front-line support guy. Unfortunately, the Admin that handles the bulk of our discussion and announcement list server issues won't be in until tomorrow. I'll keep this open until then and run this by the Admin to check that out further as well as look into the in-depth details you've graciously provided us. Thank you very much for that! To check into this again, I looked at the archives and while the date looks to have updated by a day, the last message is still the same message from before, so it doesn't look like much has changed since my last reply before Christmas, unfortunately. I've looked into the archive qrunner processes and still see that it's backed up with almost 10K. It doesn't look like the processes are being completed faster despite the increase our Admin previously set. I'm really sorry for the continued issues... I will have our Admins take another look as well as check out the information you've sent so we can see if we can find another way to get this resolved. We hope to get the hardware replaced for the list services as that will certainly help resolve these types of issues. Hopefully, we'll have an ETA on that project soon. I'll be in touch with you again, Rob. Thank you so much for your help and patience with this matter! Kind regards, Jin K. DreamHost Support Team + support@dreamhost.com Earn over $97 for each referral: http://www.dreamhost.com/affiliates/ To continue this support case, just reply to this email. Open a new case at: https://panel.dreamhost.com/?tab=support