All of lore.kernel.org
 help / color / mirror / Atom feed
* blocking Xen 3.X production use: soft lockup bugs
@ 2006-08-02 20:54 Steve Traugott
  2006-08-02 22:48 ` Steve Traugott
  0 siblings, 1 reply; 13+ messages in thread
From: Steve Traugott @ 2006-08-02 20:54 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Hi All,

I hate to say it, but it's starting to look like soft lockup bug(s)
are turning into a serious roadblock for general production use of Xen
3.X, on a wide range of hardware.  I've been using Xen since the 1.0
days, and I have to say that this the most serious showstopper bug
I've ever hit -- it usually manifests itself during the first
significant network and/or disk I/O after starting a second or third
domU on the same box, and is the only bug I've ever hit that has
caused permanent damage -- it tends to corrupt guest filesystems.  In
my case it's stopped a deployment dead in its tracks, and our only
options at this point are to go back to Xen 2.X or (horrors) to native
Linux kernels.

The problem (or something that looks identical) is described in
several tickets, status currently NEW or REOPENED, no clear
resolution:
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705

In our own shop, we consistently hit soft lockups while running on
both IBM x330's and older Netengines (similar to an IBM 4000R).  We've
found no workaround.  We're on xen-3.0-testing, changeset 9732, kernel
2.6.6.13.  On April 6th, Keir posted a note saying this was fixed as
of a blkif_schedule() fix, which we already have because that was way
back in changeset 9587...
http://lists.xensource.com/archives/html/xen-devel/2006-04/msg00121.html.

The most recent devel list traffic I've found which covers this is
July 7th:
http://lists.xensource.com/archives/html/xen-users/2006-07/msg00134.html
...this message referred back to Kier's comment as describing a fix,
but it doesn't look true; while Kier's 9587 checkin may have fixed a
soft lockup problem, there appear to be more out there, or else
there's been regression.

Do we have any consensus that this bug is fixed at all in
xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
lockups in testing *not* hitting them any more on the same hardware?
If so, what changeset are you on now?

If anyone needs any more information, just let me know.  As usual, if
anyone wants login and console server access to one of these boxes to
chase this down, I'm more than happy to provide that.

Thanks, 

Steve
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: blocking Xen 3.X production use: soft lockup bugs
@ 2006-08-02 22:25 Ian Pratt
  2006-08-03  0:27 ` Steve Traugott
  2006-08-03  8:03 ` Keir Fraser
  0 siblings, 2 replies; 13+ messages in thread
From: Ian Pratt @ 2006-08-02 22:25 UTC (permalink / raw)
  To: Steve Traugott, Keir Fraser; +Cc: xen-devel

> The problem (or something that looks identical) is described in
> several tickets, status currently NEW or REOPENED, no clear
> resolution:
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705

There's very little to go on here. Two of the bugs are actually the same
guy. One of the others is x86_64 the other two are 32b. 

The only thing in common about the stack traces is that networking
functions seem to feature.

Taking a wild guess, are you doing some kind of unusual networking setup
involving iptables rules?

 
> Do we have any consensus that this bug is fixed at all in
> xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
> lockups in testing *not* hitting them any more on the same hardware?
> If so, what changeset are you on now?

Soft lockups could be due to a huge variety of causes. It's unlikely to
be a hardware issue, and since the problems seem to be experienced by a
very small number of users my guess would be that it's configuration
dependent, most likely networking.

> If anyone needs any more information, just let me know.  As usual, if
> anyone wants login and console server access to one of these boxes to
> chase this down, I'm more than happy to provide that.

Having a really detailed bug report would really be the best way of
proceeding.

When this happens, does it just effect one guest? What's the stack
trace? How many VCPUs has the guest got? Is the guest completely hosed
or is it still pingable? What about guest console echo? What about 'xm
sysreq'? Looking in dom0, are you still seeing packets go to/from the
associated VIF? How many network interfaces has the guest got? What's
the precise networking setup in dom0? Can you come up with a recipe for
reproduction, ideally with a single guest? 

Thanks,
Ian

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: blocking Xen 3.X production use: soft lockup bugs
@ 2006-08-05  7:38 Ian Pratt
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Pratt @ 2006-08-05  7:38 UTC (permalink / raw)
  To: Steve Traugott, Keir Fraser; +Cc: xen-devel

> So I built -unstable changeset 10868, and ran an even heavier workload
> (the above, plus 'bonnie' in the guests) on dom0 and two guests
> overnight, and they experienced no soft lockups; running -unstable,
> changeset 10868, credit scheduler.  This same workload would have
> caused soft lockups within seconds in -testing changeset 9732 using
> the sedf scheduler; I may not have been able to get it started at all.
> Response time remained subsecond under -unstable; -testing would have
> been on its knees.

That's good to hear. 3.0.3 is going to be a big leap forward in many
ways.

Ian

^ permalink raw reply	[flat|nested] 13+ messages in thread
* blocking Xen 3.X production use: soft lockup bugs
@ 2006-08-07 14:15 Harry Butterworth
  0 siblings, 0 replies; 13+ messages in thread
From: Harry Butterworth @ 2006-08-07 14:15 UTC (permalink / raw)
  To: keir.fraser, ian.pratt, xen-devel

So when I wrote this...

> On Sat, 2006-08-05 at 14:45 +0100, Keir Fraser wrote:
> > On 5/8/06 12:59 pm, "Harry Butterworth"
> > <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > 
> > > Another data point: Yesterday I was working with an unstable changeset
> > > from the morning (I think about halfway through the qemu patches) and
> > > running HVM xm-test to try to debug the create-concurrent failures.
> > > qemu_dm was taking 100% of one core and I got about 6 soft lockups in
> > > dom0 and 2 dom0 hangs.
> > > 
> > > I'm not sure exactly why HVM testing is all over the floor for me, maybe
> > > I picked a bad changeset or perhaps the recent ubuntu updates have
> > > broken something.
> > > 
> > > It's possible that there are still some lurking soft lockup issues
> > > anyway.
> > 
> > Well, I believe the issues are sorted out for paravirtualised guests at
> > least. Maybe there are lurkers for HVM guests -- if so, and they're of the
> > scale of hangs and softlockups, we'd really like detailed info so we could
> > try to repro.
> 
> I'll post the changeset and any more details I can when I get back into
> work on Monday but dd was segfaulting for me due to a locale issue after
> the ubuntu update so I don't really have a lot of confidence that it's
> even a xen problem yet.
> 
> Harry.

...the changeset was 10927 which was after 10921 where Christian changed
the HVM cdrom configuration and I was using this patch
http://lists.xensource.com/archives/html/xen-devel/2006-07/msg01052.html
which uses the old style configuration for which there is no backwards
compatibility.

So that explains why the HVM testing was all over the floor.  But it
doesn't really explain the softlockups or the dom0 hangs.  The bad config
must have been provoking some bad behaviour from something.

The HVM testing is working again for me now I have updated the above patch.
I've moved on a few changesets and I'm not getting soft lockups any more
either so for the time being I'm going back to the create-concurrent failure
that I was originally investigating.

Harry.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-08-07 14:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-02 20:54 blocking Xen 3.X production use: soft lockup bugs Steve Traugott
2006-08-02 22:48 ` Steve Traugott
  -- strict thread matches above, loose matches on Subject: below --
2006-08-02 22:25 Ian Pratt
2006-08-03  0:27 ` Steve Traugott
2006-08-03  8:07   ` Keir Fraser
2006-08-03  8:03 ` Keir Fraser
2006-08-04 20:21   ` Steve Traugott
2006-08-05  8:50     ` Keir Fraser
2006-08-05 11:59       ` Harry Butterworth
2006-08-05 13:45         ` Keir Fraser
2006-08-05 14:33           ` Harry Butterworth
2006-08-05  7:38 Ian Pratt
2006-08-07 14:15 Harry Butterworth

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.