* High Net and Disk Use == stuck domain
@ 2008-11-21 16:54 Christopher S. Aker
2008-11-21 17:07 ` Stefan de Konink
2008-12-01 15:05 ` Christopher S. Aker
0 siblings, 2 replies; 6+ messages in thread
From: Christopher S. Aker @ 2008-11-21 16:54 UTC (permalink / raw)
To: xen devel; +Cc: Jeremy Fitzhardinge
For the past year or so we've been seeing a bug whereby a domU's CPU
would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console would
freeze, and some or all of the network-facing services within the domU
would connect but block without any output. Disk IO would flatline.
The domU would never recover and required rebooting.
Since pv_ops hasn't always been around, we previously had only seen this
behavior with xen-patched domUs (2.6.18.x), but now we're seeing it with
pv_ops. Identical symptoms. And, I have a user that is able to
reliable reproduce it on 2.6.27.4!
His recipe is downloading an ISO from a very fast and close-by news
server using nzbget. The trigger appears to be a combination of high
network use and high disk use (like download from a very fast mirror) --
because we weren't able to reproduce the problem when saving to a tmpfs
mount.
I was able to grab the output of sysrq t while it was in the bad state:
http://theshore.net/~caker/xen/BUGS/D-state/console.log
The number of processes in D state (39) is quite suspicious.
Let me know if there's anything else I can provide.
-Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: High Net and Disk Use == stuck domain
2008-11-21 16:54 High Net and Disk Use == stuck domain Christopher S. Aker
@ 2008-11-21 17:07 ` Stefan de Konink
2008-11-21 17:16 ` Christopher S. Aker
2008-12-01 15:05 ` Christopher S. Aker
1 sibling, 1 reply; 6+ messages in thread
From: Stefan de Konink @ 2008-11-21 17:07 UTC (permalink / raw)
To: Christopher S. Aker; +Cc: Jeremy Fitzhardinge, xen devel
Christopher S. Aker wrote:
> Let me know if there's anything else I can provide.
iSCSI/loop/blktap?
Stefan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: High Net and Disk Use == stuck domain
2008-11-21 17:07 ` Stefan de Konink
@ 2008-11-21 17:16 ` Christopher S. Aker
0 siblings, 0 replies; 6+ messages in thread
From: Christopher S. Aker @ 2008-11-21 17:16 UTC (permalink / raw)
To: Stefan de Konink; +Cc: Jeremy Fitzhardinge, xen devel
Stefan de Konink wrote:
> iSCSI/loop/blktap?
Local LVM volumes exported via "phy:" in the domU's config.
-Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: High Net and Disk Use == stuck domain
2008-11-21 16:54 High Net and Disk Use == stuck domain Christopher S. Aker
2008-11-21 17:07 ` Stefan de Konink
@ 2008-12-01 15:05 ` Christopher S. Aker
2008-12-01 20:19 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 6+ messages in thread
From: Christopher S. Aker @ 2008-12-01 15:05 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: xen devel
Christopher S. Aker wrote:
> For the past year or so we've been seeing a bug whereby a domU's CPU
> would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console would
> freeze, and some or all of the network-facing services within the domU
> would connect but block without any output. Disk IO would flatline. The
> domU would never recover and required rebooting.
>
> Since pv_ops hasn't always been around, we previously had only seen this
> behavior with xen-patched domUs (2.6.18.x), but now we're seeing it with
> pv_ops. Identical symptoms. And, I have a user that is able to
> reliable reproduce it on 2.6.27.4!
>
> His recipe is downloading an ISO from a very fast and close-by news
> server using nzbget. The trigger appears to be a combination of high
> network use and high disk use (like download from a very fast mirror) --
> because we weren't able to reproduce the problem when saving to a tmpfs
> mount.
>
> I was able to grab the output of sysrq t while it was in the bad state:
>
> http://theshore.net/~caker/xen/BUGS/D-state/console.log
>
> The number of processes in D state (39) is quite suspicious.
>
> Let me know if there's anything else I can provide.
>
> -Chris
Jeremy,
Did this one slip by you? I figured a reproducible bug would be just
too tantalizing to resist.
What's the correct venue for these issues that overlap xen-devel, lkml,
and virtualization/pv_ops stuff -- should I be blasting these to everybody?
-Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: High Net and Disk Use == stuck domain
2008-12-01 15:05 ` Christopher S. Aker
@ 2008-12-01 20:19 ` Jeremy Fitzhardinge
2008-12-01 21:00 ` Christopher S. Aker
0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Fitzhardinge @ 2008-12-01 20:19 UTC (permalink / raw)
To: Christopher S. Aker; +Cc: xen devel
Christopher S. Aker wrote:
> Christopher S. Aker wrote:
>> For the past year or so we've been seeing a bug whereby a domU's CPU
>> would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console
>> would freeze, and some or all of the network-facing services within
>> the domU would connect but block without any output. Disk IO would
>> flatline. The domU would never recover and required rebooting.
>>
>> Since pv_ops hasn't always been around, we previously had only seen
>> this behavior with xen-patched domUs (2.6.18.x), but now we're seeing
>> it with pv_ops. Identical symptoms. And, I have a user that is able
>> to reliable reproduce it on 2.6.27.4!
>>
>> His recipe is downloading an ISO from a very fast and close-by news
>> server using nzbget. The trigger appears to be a combination of high
>> network use and high disk use (like download from a very fast mirror)
>> -- because we weren't able to reproduce the problem when saving to a
>> tmpfs mount.
>>
>> I was able to grab the output of sysrq t while it was in the bad state:
>>
>> http://theshore.net/~caker/xen/BUGS/D-state/console.log
>>
>> The number of processes in D state (39) is quite suspicious.
>>
>> Let me know if there's anything else I can provide.
>>
>> -Chris
>
> Jeremy,
>
> Did this one slip by you? I figured a reproducible bug would be just
> too tantalizing to resist.
Hoping it would go away by itself? ;)
I'm trying to repro it now, copying ISOs at 25 Mbytes/sec. How long
does it take to happen?
> What's the correct venue for these issues that overlap xen-devel,
> lkml, and virtualization/pv_ops stuff -- should I be blasting these to
> everybody?
Me and xen-devel are a good start, and posting in a bugzilla cc:ing me
if it looks like its been dropped on the floor.
J
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: High Net and Disk Use == stuck domain
2008-12-01 20:19 ` Jeremy Fitzhardinge
@ 2008-12-01 21:00 ` Christopher S. Aker
0 siblings, 0 replies; 6+ messages in thread
From: Christopher S. Aker @ 2008-12-01 21:00 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: xen devel
Jeremy Fitzhardinge wrote:
>> Did this one slip by you? I figured a reproducible bug would be just
>> too tantalizing to resist.
>
> Hoping it would go away by itself? ;)
>
> I'm trying to repro it now, copying ISOs at 25 Mbytes/sec. How long
> does it take to happen?
Under a few minutes, usually within 30 seconds. The affected kernel
binary is here:
http://theshore.net/~caker/xen/BUGS/D-state/2.6.27.4-linode14
This was built with my non-broken toolchain, too, btw :)
Meanwhile, I'll try to reproduce it in a new environment and come up
with a better recipe.
>> What's the correct venue for these issues that overlap xen-devel,
>> lkml, and virtualization/pv_ops stuff -- should I be blasting these to
>> everybody?
>
> Me and xen-devel are a good start, and posting in a bugzilla cc:ing me
> if it looks like its been dropped on the floor.
OK -- targets acquired!
Thanks,
-Chris
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-12-01 21:00 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-21 16:54 High Net and Disk Use == stuck domain Christopher S. Aker
2008-11-21 17:07 ` Stefan de Konink
2008-11-21 17:16 ` Christopher S. Aker
2008-12-01 15:05 ` Christopher S. Aker
2008-12-01 20:19 ` Jeremy Fitzhardinge
2008-12-01 21:00 ` Christopher S. Aker
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.