All of lore.kernel.org
 help / color / mirror / Atom feed
* High Net and Disk Use == stuck domain
@ 2008-11-21 16:54 Christopher S. Aker
  2008-11-21 17:07 ` Stefan de Konink
  2008-12-01 15:05 ` Christopher S. Aker
  0 siblings, 2 replies; 6+ messages in thread
From: Christopher S. Aker @ 2008-11-21 16:54 UTC (permalink / raw)
  To: xen devel; +Cc: Jeremy Fitzhardinge

For the past year or so we've been seeing a bug whereby a domU's CPU 
would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console would 
freeze, and some or all of the network-facing services within the domU 
would connect but block without any output.  Disk IO would flatline. 
The domU would never recover and required rebooting.

Since pv_ops hasn't always been around, we previously had only seen this 
behavior with xen-patched domUs (2.6.18.x), but now we're seeing it with 
pv_ops.  Identical symptoms.  And, I have a user that is able to 
reliable reproduce it on 2.6.27.4!

His recipe is downloading an ISO from a very fast and close-by news 
server using nzbget.  The trigger appears to be a combination of high 
network use and high disk use (like download from a very fast mirror) -- 
because we weren't able to reproduce the problem when saving to a tmpfs 
mount.

I was able to grab the output of sysrq t while it was in the bad state:

http://theshore.net/~caker/xen/BUGS/D-state/console.log

The number of processes in D state (39) is quite suspicious.

Let me know if there's anything else I can provide.

-Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: High Net and Disk Use == stuck domain
  2008-11-21 16:54 High Net and Disk Use == stuck domain Christopher S. Aker
@ 2008-11-21 17:07 ` Stefan de Konink
  2008-11-21 17:16   ` Christopher S. Aker
  2008-12-01 15:05 ` Christopher S. Aker
  1 sibling, 1 reply; 6+ messages in thread
From: Stefan de Konink @ 2008-11-21 17:07 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: Jeremy Fitzhardinge, xen devel

Christopher S. Aker wrote:
> Let me know if there's anything else I can provide.

iSCSI/loop/blktap?


Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: High Net and Disk Use == stuck domain
  2008-11-21 17:07 ` Stefan de Konink
@ 2008-11-21 17:16   ` Christopher S. Aker
  0 siblings, 0 replies; 6+ messages in thread
From: Christopher S. Aker @ 2008-11-21 17:16 UTC (permalink / raw)
  To: Stefan de Konink; +Cc: Jeremy Fitzhardinge, xen devel

Stefan de Konink wrote:
> iSCSI/loop/blktap?

Local LVM volumes exported via "phy:" in the domU's config.

-Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: High Net and Disk Use == stuck domain
  2008-11-21 16:54 High Net and Disk Use == stuck domain Christopher S. Aker
  2008-11-21 17:07 ` Stefan de Konink
@ 2008-12-01 15:05 ` Christopher S. Aker
  2008-12-01 20:19   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 6+ messages in thread
From: Christopher S. Aker @ 2008-12-01 15:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen devel

Christopher S. Aker wrote:
> For the past year or so we've been seeing a bug whereby a domU's CPU 
> would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console would 
> freeze, and some or all of the network-facing services within the domU 
> would connect but block without any output.  Disk IO would flatline. The 
> domU would never recover and required rebooting.
> 
> Since pv_ops hasn't always been around, we previously had only seen this 
> behavior with xen-patched domUs (2.6.18.x), but now we're seeing it with 
> pv_ops.  Identical symptoms.  And, I have a user that is able to 
> reliable reproduce it on 2.6.27.4!
> 
> His recipe is downloading an ISO from a very fast and close-by news 
> server using nzbget.  The trigger appears to be a combination of high 
> network use and high disk use (like download from a very fast mirror) -- 
> because we weren't able to reproduce the problem when saving to a tmpfs 
> mount.
> 
> I was able to grab the output of sysrq t while it was in the bad state:
> 
> http://theshore.net/~caker/xen/BUGS/D-state/console.log
> 
> The number of processes in D state (39) is quite suspicious.
> 
> Let me know if there's anything else I can provide.
> 
> -Chris

Jeremy,

Did this one slip by you?  I figured a reproducible bug would be just 
too tantalizing to resist.

What's the correct venue for these issues that overlap xen-devel, lkml, 
and virtualization/pv_ops stuff -- should I be blasting these to everybody?

-Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: High Net and Disk Use == stuck domain
  2008-12-01 15:05 ` Christopher S. Aker
@ 2008-12-01 20:19   ` Jeremy Fitzhardinge
  2008-12-01 21:00     ` Christopher S. Aker
  0 siblings, 1 reply; 6+ messages in thread
From: Jeremy Fitzhardinge @ 2008-12-01 20:19 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: xen devel

Christopher S. Aker wrote:
> Christopher S. Aker wrote:
>> For the past year or so we've been seeing a bug whereby a domU's CPU 
>> would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console 
>> would freeze, and some or all of the network-facing services within 
>> the domU would connect but block without any output.  Disk IO would 
>> flatline. The domU would never recover and required rebooting.
>>
>> Since pv_ops hasn't always been around, we previously had only seen 
>> this behavior with xen-patched domUs (2.6.18.x), but now we're seeing 
>> it with pv_ops.  Identical symptoms.  And, I have a user that is able 
>> to reliable reproduce it on 2.6.27.4!
>>
>> His recipe is downloading an ISO from a very fast and close-by news 
>> server using nzbget.  The trigger appears to be a combination of high 
>> network use and high disk use (like download from a very fast mirror) 
>> -- because we weren't able to reproduce the problem when saving to a 
>> tmpfs mount.
>>
>> I was able to grab the output of sysrq t while it was in the bad state:
>>
>> http://theshore.net/~caker/xen/BUGS/D-state/console.log
>>
>> The number of processes in D state (39) is quite suspicious.
>>
>> Let me know if there's anything else I can provide.
>>
>> -Chris
>
> Jeremy,
>
> Did this one slip by you?  I figured a reproducible bug would be just 
> too tantalizing to resist.

Hoping it would go away by itself? ;)

I'm trying to repro it now, copying ISOs at 25 Mbytes/sec.  How long 
does it take to happen?

> What's the correct venue for these issues that overlap xen-devel, 
> lkml, and virtualization/pv_ops stuff -- should I be blasting these to 
> everybody?

Me and xen-devel are a good start, and posting in a bugzilla cc:ing me 
if it looks like its been dropped on the floor.


    J

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: High Net and Disk Use == stuck domain
  2008-12-01 20:19   ` Jeremy Fitzhardinge
@ 2008-12-01 21:00     ` Christopher S. Aker
  0 siblings, 0 replies; 6+ messages in thread
From: Christopher S. Aker @ 2008-12-01 21:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen devel

Jeremy Fitzhardinge wrote:
>> Did this one slip by you?  I figured a reproducible bug would be just 
>> too tantalizing to resist.
> 
> Hoping it would go away by itself? ;)
> 
> I'm trying to repro it now, copying ISOs at 25 Mbytes/sec.  How long 
> does it take to happen?

Under a few minutes, usually within 30 seconds.  The affected kernel 
binary is here:

http://theshore.net/~caker/xen/BUGS/D-state/2.6.27.4-linode14

This was built with my non-broken toolchain, too, btw :)

Meanwhile, I'll try to reproduce it in a new environment and come up 
with a better recipe.

>> What's the correct venue for these issues that overlap xen-devel, 
>> lkml, and virtualization/pv_ops stuff -- should I be blasting these to 
>> everybody?
> 
> Me and xen-devel are a good start, and posting in a bugzilla cc:ing me 
> if it looks like its been dropped on the floor.

OK -- targets acquired!

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-12-01 21:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-21 16:54 High Net and Disk Use == stuck domain Christopher S. Aker
2008-11-21 17:07 ` Stefan de Konink
2008-11-21 17:16   ` Christopher S. Aker
2008-12-01 15:05 ` Christopher S. Aker
2008-12-01 20:19   ` Jeremy Fitzhardinge
2008-12-01 21:00     ` Christopher S. Aker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.