bug # 477

All of lore.kernel.org
 help / color / mirror / Atom feed

* bug # 477
@ 2006-03-18  4:00 Jason
  2006-03-29  5:24 ` Florian Kirstein
  0 siblings, 1 reply; 6+ messages in thread
From: Jason @ 2006-03-18  4:00 UTC (permalink / raw)
  To: xen-devel

Hey guys, I just posted an addition to bug #477 regarding xenconsoled and
lots of data being pasted into a DomU.  This seems to impact data as small
as 60 lines being pasted into the console on a domu and causes a BUG: soft
lockup detected on CPU#0! message in the domu that you are pasting the
data. I have confirmed this behaviour in the March 14th pull as well as
the March 17 pull from xen-unstable-hg.    I have also found that if
xenconsoled does not crash, it goes to 100% CPU in dom0.   Any thoughts?

-- 
Jason
The place where you made your stand never mattered,
only that you were there... and still on your feet

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug # 477
  2006-03-18  4:00 bug # 477 Jason
@ 2006-03-29  5:24 ` Florian Kirstein
  2006-03-29  9:56   ` Ewan Mellor
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Kirstein @ 2006-03-29  5:24 UTC (permalink / raw)
  To: xen-devel

Hallo,

> data. I have confirmed this behaviour in the March 14th pull as well as
> the March 17 pull from xen-unstable-hg.    I have also found that if
> xenconsoled does not crash, it goes to 100% CPU in dom0.   Any thoughts?
Sorry, none, but I can reproduce this in the current unstable version
(2006/04/28) easily and also had it in older versions. Can't reporduce it
in Xen 2, there the console works smoothly. Test-System was a P4
without hyperthreading.

Easiest way seems to be to open a console to a domU, start "cat" and paste
some larger block of text (I used the content of a full 90x80 xterm) and it
brings xenconsoled in Dom0 to 100% CPU reliably, either for quite some
time (30 seconds or so) or even until I kill it. In both cases the DomU
get's the "soft lockup detected" (see below). Tried various console-buffer
sizes in the xend config but it doesn't seem to make a difference.

OK, you could say "then don't do this", but I'd like to give some users
console-acces for their VMs and it wouldn't be nice if they could
bring my Dom0 into such troubles by simply pasting a block of text
into the console...

MfG, Florian Kirstein

P.S: the error I get every time:
Mar 29 06:55:52 vmtest kernel: BUG: soft lockup detected on CPU#0!
Mar 29 06:55:52 vmtest kernel: 
Mar 29 06:55:52 vmtest kernel: Pid: 0, comm:              swapper
Mar 29 06:55:52 vmtest kernel: EIP: 0061:[<c01010c7>] CPU: 0
Mar 29 06:55:52 vmtest kernel: EIP is at 0xc01010c7
Mar 29 06:55:52 vmtest kernel:  EFLAGS: 00000246    Not tainted  (2.6.16-xenU #1)
Mar 29 06:55:52 vmtest kernel: EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: 000027d5
Mar 29 06:55:52 vmtest kernel: ESI: c035a000 EDI: 00000001 EBP: c035bfa8 DS: 007b ES: 007b
Mar 29 06:55:52 vmtest kernel: CR0: 8005003b CR2: 0804e85c CR3: 0035d000 CR4: 00000640
Mar 29 06:55:52 vmtest kernel:  [<c01053c1>] show_trace+0x21/0x30
Mar 29 06:55:52 vmtest kernel:  [<c0102f80>] show_regs+0x1a0/0x1c8
Mar 29 06:55:52 vmtest kernel:  [<c013de51>] softlockup_tick+0x81/0x90
Mar 29 06:55:52 vmtest kernel:  [<c01280cf>] do_timer+0x3f/0xd0
Mar 29 06:55:52 vmtest kernel:  [<c0108f5c>] timer_interrupt+0x1cc/0x610
Mar 29 06:55:52 vmtest kernel:  [<c013e163>] handle_IRQ_event+0x73/0xc0
Mar 29 06:55:52 vmtest kernel:  [<c013e239>] __do_IRQ+0x89/0x100
Mar 29 06:55:52 vmtest kernel:  [<c0106b20>] do_IRQ+0x20/0x30
Mar 29 06:55:52 vmtest kernel:  [<c025414e>] evtchn_do_upcall+0x8e/0x110
Mar 29 06:55:52 vmtest kernel:  [<c01050fc>] hypervisor_callback+0x2c/0x34
Mar 29 06:55:52 vmtest kernel:  [<c0102c05>] cpu_idle+0x85/0xf0
Mar 29 06:55:52 vmtest kernel:  [<c0102035>] rest_init+0x35/0x40
Mar 29 06:55:52 vmtest kernel:  [<c035c9a5>] start_kernel+0x1c5/0x210
Mar 29 06:55:52 vmtest kernel:  [<c010006f>] 0xc010006f

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug # 477
  2006-03-29  5:24 ` Florian Kirstein
@ 2006-03-29  9:56   ` Ewan Mellor
  2006-03-29 22:39     ` Florian Kirstein
  0 siblings, 1 reply; 6+ messages in thread
From: Ewan Mellor @ 2006-03-29  9:56 UTC (permalink / raw)
  To: Florian Kirstein; +Cc: xen-devel

On Wed, Mar 29, 2006 at 07:24:01AM +0200, Florian Kirstein wrote:

> Hallo,
> 
> > data. I have confirmed this behaviour in the March 14th pull as well as
> > the March 17 pull from xen-unstable-hg.    I have also found that if
> > xenconsoled does not crash, it goes to 100% CPU in dom0.   Any thoughts?
> Sorry, none, but I can reproduce this in the current unstable version
> (2006/04/28) easily and also had it in older versions. Can't reporduce it
> in Xen 2, there the console works smoothly. Test-System was a P4
> without hyperthreading.
> 
> Easiest way seems to be to open a console to a domU, start "cat" and paste
> some larger block of text (I used the content of a full 90x80 xterm) and it
> brings xenconsoled in Dom0 to 100% CPU reliably, either for quite some
> time (30 seconds or so) or even until I kill it.

When xenconsoled gets into this state, spinning using 100% CPU, could you use
gdb to find out where it is spinning?  We've not managed to reproduce this, so
any clue as to where it's getting stuck would be useful -- a backtrace or a
core dump would be most valuable.

Thanks,

Ewan.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug # 477
  2006-03-29  9:56   ` Ewan Mellor
@ 2006-03-29 22:39     ` Florian Kirstein
  2006-03-30  8:57       ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Kirstein @ 2006-03-29 22:39 UTC (permalink / raw)
  To: xen-devel; +Cc: Ewan Mellor

Hallo,

OK, I found at least a kludge to work around this, see below. Not sure if
it qualifies as a clean soloution, though, but works for me so far...
Added it as a comment in bugzilla, hope that's OK.

> When xenconsoled gets into this state, spinning using 100% CPU, could you use
> gdb to find out where it is spinning?  We've not managed to reproduce this
Oh, and I thought it easily reproduces :) But now I even had difficulties
getting it into the "really hung" state, but the "hung for 30 seconds"
was enough for a first analysis:

I used strace to see what xenconsoled is doing while consuming 100% CPU,
and what it does is "select" all the time:

select(20, [16 18 19], [], NULL, NULL)  = 1 (in [19])
select(20, [16 18 19], [], NULL, NULL)  = 1 (in [19])
select(20, [16 18 19], [], NULL, NULL)  = 1 (in [19])
select(20, [16 18 19], [], NULL, NULL)  = 1 (in [19])
select(20, [16 18 19], [], NULL, NULL)  = 1 (in [19])
select(20, [16 18 19], [], NULL, NULL)  = 1 (in [19])

using gdb I identified this to be the select in 
tools/console/daemon/io.c line 572 in handle_io(void):
    ret = select(max_fd + 1, &readfds, &writefds, 0, NULL);
after which xenconsoled seems to iterate through the domains
to handle the input or something like that.

My idea now was that it could be possible, that the select returns before
the domU really made the data available or something, and then by
running in an select-loop xenconsoled even slows down the machine more
so it takes even longer for the data to become available. Just wild
guesses, I haven't looked into the details of the console code :) So
I simply added:
 usleep(100);
after the select in io.c to slow down the select-loop and give the machine
time to do other things. Possibly this is why you can't reproduce it:
because you don't have machines slow enough? :)

The result is satisfying, the console accepts the paste of even large blocks
more or less immediately and  I now can't bring xencosoled to consume any
relevant amount of CPU and could not reproduce the soft-irq kernel-message
either. Of course this patch slows down the consoles a bit, but I think
of using even 1000 in the usleep, 1ms should be a fair response time
for a console and it prevents users from stealing Dom0 CPU by flooding
the console :)

Possibly there's a nicer fix for this possible race-condition, but for
that I don't have the insight in the inner workings of the console
mechanism (yet :). 

Oh, and for the record: I never could really crash xenconsoled in my
setup (just hang it to 100% CPU), so I'm not sure if this fixes also the
initial problem Alex Kelly had in Bug #477 - possibly he could test this?

(:ul8er, r@y

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug # 477
  2006-03-29 22:39     ` Florian Kirstein
@ 2006-03-30  8:57       ` Keir Fraser
  2006-03-30 10:41         ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Keir Fraser @ 2006-03-30  8:57 UTC (permalink / raw)
  To: Florian Kirstein; +Cc: xen-devel, Ewan Mellor

On 29 Mar 2006, at 23:39, Florian Kirstein wrote:

> My idea now was that it could be possible, that the select returns 
> before
> the domU really made the data available or something, and then by
> running in an select-loop xenconsoled even slows down the machine more
> so it takes even longer for the data to become available. Just wild
> guesses, I haven't looked into the details of the console code :)

I bet this behaviour is due to two problems.
  1. The daemon is probably looking for console data before it clears 
the event fd. So if no data available then the fd doesn't get cleared 
and the select() doesn't block next time round.
  2. I think we still have a stupid scheduler default for domain0 where 
it gets all the CPU it wants and can starve out domUs. So once 
xenconsoled starts spinning, that's it for you if the domain it's 
spinning on is supposed to run on the same CPU.

I'll look at fixing both of these.

  -- Keir

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug # 477
  2006-03-30  8:57       ` Keir Fraser
@ 2006-03-30 10:41         ` Keir Fraser
  0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-03-30 10:41 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Florian Kirstein, xen-devel, Ewan Mellor

On 30 Mar 2006, at 09:57, Keir Fraser wrote:

> I bet this behaviour is due to two problems.
>  1. The daemon is probably looking for console data before it clears 
> the event fd. So if no data available then the fd doesn't get cleared 
> and the select() doesn't block next time round.
>  2. I think we still have a stupid scheduler default for domain0 where 
> it gets all the CPU it wants and can starve out domUs. So once 
> xenconsoled starts spinning, that's it for you if the domain it's 
> spinning on is supposed to run on the same CPU.
>
> I'll look at fixing both of these.

Actually the first issue was a bit different to what I describe above, 
but it's now fixed in changeset 9475.

The second issue is a bug in the SEDF scheduler I think. It's supposed 
to be guaranteed 75% of the CPU, plus share the remaining 25% with 
other domains in a 'fair' way. But it seems that domains that are given 
a real-time guarantee plus extratime basically get strict preference 
when allocating extratime. I haven't looked at fixing this -- it would 
be great is someone else wants to take a look.

  -- Keir

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-03-30 10:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-18  4:00 bug # 477 Jason
2006-03-29  5:24 ` Florian Kirstein
2006-03-29  9:56   ` Ewan Mellor
2006-03-29 22:39     ` Florian Kirstein
2006-03-30  8:57       ` Keir Fraser
2006-03-30 10:41         ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.