Date: Tue, 16 Dec 2014 18:44:22 -0600
From: Rob Landley <rob@landley.net>
To: DreamHost Customer Support Team <support@dreamhost.com>
Subject: Re: [roblan21 97401440] lists.landley.net is borked

The archive has now caught up to December 2nd. It's currently December 16th.

The "this support ticket has been moved to a specific user's queue"
message (#6603199 has been moved) was from friday. I have not heard back
on this issue since.

Currently, this is the most recent message listed:

http://lists.landley.net/pipermail/toybox-landley.net/2014-December/003802.html

There have been a lot of posts since then, including one from a
contributor telling me that my project's mailing list is not up to date,
and my reply to him cutting and pasting the back and forth on this issue
going back to August, and his reply back to me.

My wild guess at what's going on, given the way "wget" on any of these
pages that haven't been recently loaded takes anywhere from 30 to 90
seconds, but then the second wget of the same page is almost instant, is
that your server is swap thrashing itself to death.

Specifically, it looks like you haven't set any physical memory or I/O
bandwidth low water marks on the host system outside all the containers
at all, and it hasn't got enough physical memory for the load you've
given it, so individual lxc containers (openvz containers?) are getting
swapped out, and then when it tries to swap them back _in_ the server
winds up swap-thrashing or similar (hugely I/O bound either way), and
the cron jobs _in_ the containers are essentially never running, because
when they try to run the system detects it's overloaded and reschedules
the regular cron jobs for a "later" that never comes because it'll be
overloaded then too. (It may do a trivial amount of work, such as
process one message per day, and then go "well, I'm over my allocated
time, catch up next time this cron job runs"... in 24 hours. At which
point it does one message again.)

Have you considered running top and iotop on the host system outside
these containers to see how much memory is available and what's using
the I/O bandwidth? (Or maybe the cron jobs are synchronized between
containers and that's what causes an I/O storm? Something is swapping
everything out, and it's messing up cron and disabling any mail
processing triggers you might have...)

Rob

========================================================================

To: rob@landley.net
From: DreamHost Customer Support Team <support@dreamhost.com>
Subject: Re: [roblan21 97858499] lists.landley.net is borked
Message-Id: <20141220015213.E211858160@badtzmaru-new.sd.dreamhost.com>
Date: Fri, 19 Dec 2014 17:52:13 -0800 (PST)

------------------------------------------------------------------------
- After reading this response, please consider visiting
- the survey below to comment on its quality. Thanks!
- http://www.dreamhost.com/survey.cgi?n=97858499&m=1638992
-
- If the service you received from us was exceptional, please consider
- tweeting your love for @dreamhost.  It'll warm our hearts, soothe
- our souls, and get you good karma at some point down the road.
------------------------------------------------------------------------

Hello,

I apologize sincerely for the delays in getting back to you here, and for
the inconveniences this has caused for you as well. Admittedly, our
mailman servers in use for discussion and announcement lists really could
use an upgrade. New hardware and routing plans are in the works. Indexing
can run slowly at times, and occasionally may be terminated or killed by
a process watcher depending on the servers' load. It *should* run again
the next night when it picks back up, but there may be times like this
one where it needs an extra kick from one of us. It appears our admin
already took care of it two days ago, but I pushed the index update again
for each of your three discussion lists just in case. It should be caught
up at this time.

Your patience has been very appreciated.

============================================================================

Date: Sat, 20 Dec 2014 19:10:21 -0600
From: Rob Landley <rob@landley.net>
To: DreamHost Customer Support Team <support@dreamhost.com>
Subject: Re: [roblan21 97858499] lists.landley.net is borked


On 12/19/14 19:52, DreamHost Customer Support Team wrote:
> one where it needs an extra kick from one of us. It appears our admin
> already took care of it two days ago, but I pushed the index update again
> for each of your three discussion lists just in case. It should be caught
> up at this time.

It's not caught up. I've received dozens of messages from this list
since the last one the archive shows.

Down at the bottom of the December page it says:

  Last message date: Wed Dec 3 17:00:08 PST 2014
  Archived on: Sat Dec 20 01:23:35 PST 2014

This is the disconnect. The "last message date" was Dec 1 a week ago,
now it's Dec 3. I have not gone back and retroactively generated
messages, they're in the system and not getting processed.

Two new messages have shown up on
http://lists.landley.net/pipermail/toybox-landley.net/2014-December in
the past few days, which is progress. But the most recent message in
http://lists.landley.net/pipermail/toybox-landley.net/2014-December/date.html
is from December 3, which was 17 days ago.

I received 14 messages from this list _yesterday_, via email. At this
rate, the first of yesterday's messages might show up in the web archive
sometime in late February. Before that I received multiple messages on
the 18th, and on the 17, and on the 16th, and on the 15th...

Moving forward from where the archive is, there should be a reply to
http://lists.landley.net/pipermail/toybox-landley.net/2014-December/003802.html
from me, and then a reply to THAT from Felix Janda dated December 4.
Those may trickle in over the next week? (I don't know.)

This doesn't appear to be an indexing problem, this is a delivery
problem. Whatever's feeding the messages _into_ the mailman system is
delivering them one at a time, half a month late.

It's not the actual list mailserver. I get messages to my gmail account
fairly promptly (although I'm cc:'d on half of them and gmail filters
otu duplicates, but I haven't received any via email noticeably delayed.)

But the mailman mbox text files (accessable as "gzipped text" on the
right from http://lists.landley.net/pipermail/toybox-landley.net/) do
not receive them for a very long time, and the indexes can't index data
they haven't received yet.

I'm glad that at least the messages don't seem to be getting permanently
lost. (Yet.) But in addition to my open source developers using the
archive, I personally value the web archive precisely _because_ gmail
filters out duplicates (and thus breaks threading, making my local copy
of this data live half in my personal inbox and half in my list folder;
of the 14 messages from yesterday 6 are in the list folder and the rest
in my inbox, thank you gmail).

When sites like https://lwn.net/Articles/616272/ cover my project, they
copy messages like https://lwn.net/Articles/616315/ from the web
archive, which they can't do if it's not there.

I note that the http://landley.net/dreamhost.txt reply I copied to the
website so I could refer to it was in reply to the Android Bionic
maintainer, because this project has just been merged into the Android
base system for release in the next Android version, which means my code
would be running on a billion phones. I'd kind of like to avoid losing
this opportunity, for reasons I described in my presentation on it at
last year's Embedded Linux Conference:

https://www.youtube.com/watch?v=SGmtP5Lg_t0

One of the messages I'd very much like to link to is the Android
maintainer describing his progress merging toybox into Android, and his
remaining todo list items. Unfortunately, I can't do so because he
posted it yesterday. (See "february", above.)

Yes, everybody thinks their stuff is important, but I'm fairly certain
this isn't just inconveniencing me personally. I'm glad the website is
working. I'm glad the mercurial repository is working. I'm glad the
actual email part of the mailing list is working. I'm sorry to keep
pestering you about this, but it's an issue for me that the mailing list
archive work.

Could you ask whoever fixed this when it happened back in August what
they did? (Or did it just resolve itself...?)

Rob

=======================================================================

To: rob@landley.net
From: DreamHost Customer Support Team <support@dreamhost.com>
Subject: [roblan21 97957266] Message from support.
Message-Id: <20141222051900.0997458137@badtzmaru-new.sd.dreamhost.com>
Date: Sun, 21 Dec 2014 21:19:00 -0800 (PST)

------------------------------------------------------------------------
- After reading this response, please consider visiting
- the survey below to comment on its quality. Thanks!
- http://www.dreamhost.com/survey.cgi?n=97957266&m=6549624
-
- If the service you received from us was exceptional, please consider
- tweeting your love for @dreamhost.  It'll warm our hearts, soothe
- our souls, and get you good karma at some point down the road.
------------------------------------------------------------------------

Hi there Rob!

I'm terribly sorry for the delay in response to this matter...

I've been noticing some issues with list archives not properly being
updated, but hearing that the messages are making it through, though. I'm
planning on placing a bug into our Admins tomorrow with details to this
matter so we can get that fixed as I'm not sure what's causing this or
how to resolve it myself. 

I'll hold onto this ticket in my personal queue so I can update you with
further details as I get them. I'll be in touch with you again.


Thank you kindly,

Jin K.

DreamHost Support Team + support@dreamhost.com
Earn over $97 for each referral: http://www.dreamhost.com/affiliates/
To continue this support case, just reply to this email.
Open a new case at: https://panel.dreamhost.com/?tab=support

=========================================================================

To: rob@landley.net
From: DreamHost Customer Support Team <support@dreamhost.com>
Subject: Re: [roblan21 98044907] lists.landley.net is borked
Date: Wed, 24 Dec 2014 12:20:42 -0800 (PST)

------------------------------------------------------------------------
- After reading this response, please consider visiting
- the survey below to comment on its quality. Thanks!
- http://www.dreamhost.com/survey.cgi?n=98044907&m=6075228
-
- If the service you received from us was exceptional, please consider
- tweeting your love for @dreamhost.  It'll warm our hearts, soothe
- our souls, and get you good karma at some point down the road.
------------------------------------------------------------------------

Hi again!

Thanks for your patience!

Our Admins were able to find the cause of the issue here. It looks like
the issue is with regards to a couple of things:

1. The list server was backed up with a large log of messages that it
needed to archive, so the archiving was being done, but much slower than
the amount of incoming/outgoing messages.

2. The hardware for the list server is somewhat old, and has a higher
load due to this that can cause the processes on those machines to run
slower.

For the first problem, we were able to spot the issue and our Admins were
able to increase the archive qrunner process count to help this run
faster. The server is now playing catch-up, so it may be a bit of time
before that's all included in the archive due to the backlog of requests
that are still on the machine now.

But, this did help with the processes running much faster. It will still
need some time to get caught up to the current date, though. Please
monitor the archiving over the next week and let me know if you notice
further issues.

As for the second issue, we are looking into upgrading the hardware on
the list servers as a couple of the machines are a kind of old. I don't
have an ETA on when this will all take place, but there are internal
talks about getting the hardware replaced for the list service, so once
that's done, that'll help greatly with the list service, and will
certainly help to keep the archives up to date in a timely manner.

The issue I assisted with back in October was a different issue with the
'scooter' mount not being properly mounted. This issue is with the
archives simply not being processed and just sitting there waiting to be
archived, but due to the backlog of requests the machine has to handle,
that didn't get done in a timely manner, but the fix the Admins did
should help get the requests processed faster.

The archiving of the list posts should be processing much faster now that
our Admins were able to increase the processes on the server for
archiving, so please keep an eye on that and let us know if you notice
any further issues so we can check that out then. Hopefully, you'll
notice the updates within the next day or so.

If you have any other questions at all, please reply back. I'd be happy
to help!


Thank you kindly,

Jin K.

DreamHost Support Team + support@dreamhost.com
Earn over $97 for each referral: http://www.dreamhost.com/affiliates/
To continue this support case, just reply to this email.
Open a new case at: https://panel.dreamhost.com/?tab=support

===========================================================================

Date: Wed, 24 Dec 2014 22:57:09 -0600
From: Rob Landley <rob@landley.net>
To: DreamHost Customer Support Team <support@dreamhost.com>
Subject: Re: [roblan21 98044907] lists.landley.net is borked

On 12/24/14 14:20, DreamHost Customer Support Team wrote:
> ------------------------------------------------------------------------
> - After reading this response, please consider visiting
> - the survey below to comment on its quality. Thanks!
> - http://www.dreamhost.com/survey.cgi?n=98044907&m=6075228
> -
> - If the service you received from us was exceptional, please consider
> - tweeting your love for @dreamhost.  It'll warm our hearts, soothe
> - our souls, and get you good karma at some point down the road.
> ------------------------------------------------------------------------
> 
> Hi again!
> 
> Thanks for your patience!

Merry christmas!

> Our Admins were able to find the cause of the issue here. It looks like
> the issue is with regards to a couple of things:
> 
> 1. The list server was backed up with a large log of messages that it
> needed to archive, so the archiving was being done, but much slower than
> the amount of incoming/outgoing messages.
> 
> 2. The hardware for the list server is somewhat old, and has a higher
> load due to this that can cause the processes on those machines to run
> slower.

Something like swap-thrashing has to be occurring to transition "keeping
up with realtime input for weeks at a time" into "multiple weeks behind,
lucky to get a message a day".

This isn't even 1/10th normal speed, this is "no visible progress for
multiple days, system is paralyzed". This is not a CPU scheduling issue
for a constantly running demon, this is a 3 orders of magnitude screwup
and that means a rotating disk has started stopped doing streaming I/O
and started introducing a seek between each sector read. That's just
about the only place in a modern computer you _get_ that kind of giant
latency insertion. (Ok, there's stuff like reverse DNS lookup failures
with the insane 15 second timeout in ssh logging, but that doesn't apply
here.)

You don't happen to have an admin who thinks "10 gigs of swap space
means the machine will never run out of memory, so this service will
always stay up", do you? I've seen the result of that. It's a system
thrashing so badly you spend half an hour waiting for bash to make its
way through /etc/profile when you ssh into it.

Have you tried (as root):

echo 1 > /proc/sys/vm/drop_caches

It's counterintuitive, but it can slap some sense into of a hysterical
box long enough for you to fix the overload. (It says free all clean
disk cache _now_, then processes and disk cache race to claim the freed
memory. It means the processes can establish a working set for a few
seconds and maybe progress long enough to clear the backlog.)

I've sometimes done:

while true; sync; echo 1 > /proc/sys/vm/drop_caches; sleep 30; done

And left that running on a thrashing box until it was out of the woods.
Not a long-term solution but as shock therapy it works surprisingly well...

> For the first problem, we were able to spot the issue and our Admins were
> able to increase the archive qrunner process count to help this run
> faster.

This seems unlikely to be an issue of a box being CPU starved. Not on
Linux boxes, they ensure non-realtime threads get a tick each scheduler
slice refill precisely to avoid that martian space probe problem.
(Google "mars pathfinder priority inversion".) So you can crush
processes to about 1/20th of their normal speed with priorities, but not
1/10,000th. Not unless you're running 10,000 CPU hog processes on a
single box? Ever since the O(1) scheduler went in (like 15 years ago
now) scheduler tick refills won't give _any_ process zero ticks for an
entire period unless you're doing SCHED_RR which _nobody_ is crazy
enough to run on a server...

But maybe you're running BSD or something? I've only encountered this
kind of behavior due to memory starvation resulting in pathologically
bad I/O scheduling decisions resulting in a LOT of thrashing (you can
even do it without swap, since executable pages are mmapped and can be
dropped and individually faulted back in and if there's no _other_
source of physical pages to handle a fault...).

But you know your config better than I do. Either way, yay progress.

Ah, one other thing that can do it is if the system has multiple I/O
busses of different speeds and you're running an old 2.6 kernel where
calls to sync aren't per-queue. So you can have a USB2 disk with 16 gigs
of buffered writes waiting to go out (I hit this running mke2fs on a
terabyte USB disk) and then when anything calls sync (thank you vim for
doing that every 100 keystrokes if you let it make .swp files; I wrote
an LD_PRELOAD .so to stub out the sync calls for it to get it to STOP
doign that), the old sync() implementation inserts a global barrier
where all new I/O is blocked until _all_ pending I/O from before the
sync call goes out, and that includes flushing all the data to the slow
device. So the machine can go bye-bye for 15 minutes flushing the data
in the case.

I vaguely recall they fixed that somewhere around... 3.4 maybe?

But again, unlikely to be the case on this server (are you writing
gigabytes of data through USB2 ?). It _really_ sounds like some varaint
of swap thrashing...

> The server is now playing catch-up, so it may be a bit of time
> before that's all included in the archive due to the backlog of requests
> that are still on the machine now.
> 
> But, this did help with the processes running much faster. It will still
> need some time to get caught up to the current date, though. Please
> monitor the archiving over the next week and let me know if you notice
> further issues.

According to
http://lists.landley.net/pipermail/toybox-landley.net/2014-December/date.html

  Last message date: Wed Dec 3 23:37:33 PST 2014
  Archived on: Sun Dec 21 02:52:08 PST 2014

It's now wednesday evening on the 24th, so it's saying the list index
was last rebuilt just under 4 days ago.

Presumably the index and the data being indexed will meet up at some
point...

> As for the second issue, we are looking into upgrading the hardware on
> the list servers as a couple of the machines are a kind of old. I don't
> have an ETA on when this will all take place, but there are internal
> talks about getting the hardware replaced for the list service, so once
> that's done, that'll help greatly with the list service, and will
> certainly help to keep the archives up to date in a timely manner.
> 
> The issue I assisted with back in October was a different issue with the
> 'scooter' mount not being properly mounted. This issue is with the
> archives simply not being processed and just sitting there waiting to be
> archived, but due to the backlog of requests the machine has to handle,
> that didn't get done in a timely manner, but the fix the Admins did
> should help get the requests processed faster.

Indeed, there have been at least 4 things wrong with the list processing
over the past few months:

1) That mount issue you mentioned, rendering it 404.

2) The index not updating regularly, with the index page's timestap
saying it's several days old.

3) Messages not being delivered to the list archive for processing in a
timely manner. (So when index page updates _do_ happen, they don't index
current data.)

4) End-user page loads (wget from out in the world) taking anywhere from
30 to 70 seconds to to get a response if that specific page hasn't been
accessed in the last hour or so.

Presumably the last three are related. (Again, I imagine your poor
server swap thrashing its guts out. Have you run iotop on it?)

Note that 2 and 3 are distinct issues. Both have recurred over the. You
say just just fixed #3 (yay!), and I await the next time #2 unblocks so
I can see the result.

Speaking of #2: Note there's a top level index and a per-month index.
The december index linked above says December 21. The top level index
http://lists.landley.net/pipermail/toybox-landley.net/ says November
28th, which seems unlikely to be accurate if it's listing Deceber as a
month...

> The archiving of the list posts should be processing much faster now that
> our Admins were able to increase the processes on the server for
> archiving,

Unless swap-thrashing (or similar rotating media I/O contention breaking
up all streaming I/O and inserting a seek between every sector read) was
the problem, in which case adding more competing processes would
probably make it worse. But presumably they would have noticed that, so...

> so please keep an eye on that and let us know if you notice
> any further issues so we can check that out then. Hopefully, you'll
> notice the updates within the next day or so.

If the index rebuilds so I can see it, sure.

> If you have any other questions at all, please reply back. I'd be happy
> to help!

I hope you get christmas off. :)

The archive's been down all month, another couple days doesn't make a
huge difference at this point. (I'm not sure what happens when messages
from the previous month start trickling in, but if it's processing
messages from the 3rd after the 19th and that works out ok, I guess it's
fine?)

Thanks,

Rob

===========================================================================

To: rob@landley.net
From: DreamHost Customer Support Team <support@dreamhost.com>
Subject: [roblan21 98134032] Message from support.
Date: Sat, 27 Dec 2014 14:01:38 -0800 (PST)

------------------------------------------------------------------------
- After reading this response, please consider visiting
- the survey below to comment on its quality. Thanks!
- http://www.dreamhost.com/survey.cgi?n=98134032&m=4414604
-
- If the service you received from us was exceptional, please consider
- tweeting your love for @dreamhost.  It'll warm our hearts, soothe
- our souls, and get you good karma at some point down the road.
------------------------------------------------------------------------

Hi Rob!

Thank you for the update! And I hope you had a wonderful Christmas! 

I have to admit, your detailed response made my head spin. =) 

To be honest, the details are far beyond my knowledge being a front-line
support guy. Unfortunately, the Admin that handles the bulk of our
discussion and announcement list server issues won't be in until
tomorrow. I'll keep this open until then and run this by the Admin to
check that out further as well as look into the in-depth details you've
graciously provided us. Thank you very much for that!

To check into this again, I looked at the archives and while the date
looks to have updated by a day, the last message is still the same
message from before, so it doesn't look like much has changed since my
last reply before Christmas, unfortunately.

I've looked into the archive qrunner processes and still see that it's
backed up with almost 10K. It doesn't look like the processes are being
completed faster despite the increase our Admin previously set. 

I'm really sorry for the continued issues... I will have our Admins take
another look as well as check out the information you've sent so we can
see if we can find another way to get this resolved. 

We hope to get the hardware replaced for the list services as that will
certainly help resolve these types of issues. Hopefully, we'll have an
ETA on that project soon. 

I'll be in touch with you again, Rob. Thank you so much for your help and
patience with this matter!


Kind regards,

Jin K.

DreamHost Support Team + support@dreamhost.com
Earn over $97 for each referral: http://www.dreamhost.com/affiliates/
To continue this support case, just reply to this email.
Open a new case at: https://panel.dreamhost.com/?tab=support