All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-08 21:19 Ian Pratt
  2005-06-08 21:42 ` Kurt Garloff
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Pratt @ 2005-06-08 21:19 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: xen-devel, Kip Macy, Rich Persaud, Osma Suominen, Gerd Knorr

 
> I observed that the first userspace process that uses the FPU 
> will SIGFPE once. Afterwards everything runs just fine ...
> 
> You description looks like it matches exactly the 
> misbehaviour I've been seeing.

Got any more critical bugs you're not telling us about? :-)

> Is attached patch the right way to fix this?

I think that should work (with the obvious kernel_ prefix), but I've
appeneded what we've gone for.

Best,
Ian 

--- linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/bugs.h
2005-06-08 22:08:52.000000000 +0100
+++ linux-2.6.11-xen0/include/asm-i386/bugs.h   2005-03-02
07:37:49.000000000 +0000
@@ -107,7 +107,6 @@
                "fninit"
                : "=m" (*&boot_cpu_data.fdiv_bug)
                : "m" (*&x), "m" (*&y));
+       stts();
        if (boot_cpu_data.fdiv_bug)
                printk("Hmm, FPU with FDIV bug.\n");
 }

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-10 20:01 Ian Pratt
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Pratt @ 2005-06-10 20:01 UTC (permalink / raw)
  To: Robbie Dinn, xen-devel

 > > The last point is where we can fall down: if the TS bit in fact 
> > *isn;t* set, then we are screwed for all time. The kernel 
> will never 
> > realise a process is using the FPU because we will never 
> take the TS 
> > fault, because the TS bit is clear. Thus state doesn;t get 
> > saved/restored during context switch and the TS bit never 
> gets set. So 
> > its a self perpetuating state once you're in it.
> 
> Say I have an xen machine with several domains, some with 
> kernels that have the FPU bug fix and some without. Can a 
> domain with the buggy kernel upset a domain with a bug free kernel?
> Or does this just affect processes within one domain?

It just affects the one domain. 

Best,
Ian

> I might want to be a bit more hasty in upgrading all the 
> kernels if a buggy kernel/domain can upset a good kernel/domain.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-08 17:58 Ian Pratt
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Pratt @ 2005-06-08 17:58 UTC (permalink / raw)
  To: Keir Fraser, Osma Suominen
  Cc: xen-devel, Kip Macy, Kurt Garloff, Rich Persaud, Gerd Knorr

> On 3 Jun 2005, at 10:04, Osma Suominen wrote:
> 
> > When you've had wget crash, you can try some of the other tests in
> > http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628
> >
> > Since this happens on a random PC with the demo CD, I'll 
> bet that this 
> > is not some obscure problem with the specific hardware or software 
> > installation but a real bug in Xen.
> 
> This bug should now be fixed in our xen-2.0.testing.bk repository.

This deserves a bit more explanation, as it probably effects all vendor
kernels based on Xen 2.0 (SuSE 9.3 Pro, Debian, demo CD, Gentoo, etc.)
It does *not* effect the kernel we ship in our 2.0 source and binary tar
balls, which is why its taken so long to pin down. It does *not* effect
the unstable branch.

The reason the bug is not present in our kernels is due to the kernel
config: we enable CONFIG_MD_RAID5=y in our config which hides the bug,
whereas most distros have this as a module.

The root cause of the bug is that during the boot sequence Linux tests
to see whether the processor has the fdiv bug. This involves doing some
floating point opertions. Unfortunately, they are not wrapped in the
kernel_fpu_begin()/end() calls that normally surround use of fp in the
kernel. Native linux gets away with this because it happens so early in
the boot process that no-one else can be using the fpu. However, on Xen
this gets us into a bad state, which will come back to haunt us much
later on, resulting in fpu state corruption in user processes. The fix
in 2.0-testing is simply to 'wrap' the fdiv test.

The reason the bug is not present on unstable is that the fpu code had
already been rejigged so that we were immune to this kind of problem as
it had been identified as a potential fragility.

Since this bug hadn't been widely reported we probably won't rush to
release a 2.0.6a demo CD, but vendor kernel maintainers should
definitely pick up the fix.

Best,
Ian 

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-02 14:32 Ian Pratt
  2005-06-02 15:07 ` Osma Suominen
  2005-06-03  9:04 ` Osma Suominen
  0 siblings, 2 replies; 13+ messages in thread
From: Ian Pratt @ 2005-06-02 14:32 UTC (permalink / raw)
  To: Osma Suominen; +Cc: xen-devel

> Unfortunately the machine is not completely in my control. It 
> is owned by another company and I only have access to dom1 
> and dom2, not dom0 (no other domains on the machine). But I 
> will try to reproduce this on another machine. It might be a 
> lot easier if there was a 2.0.6 demo CD, though...

Funny you should say that....

Please don't all download it at once, but there's a preview avilable at:
http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xendemo-2.0.6.i
so
 
> Anyway, thanks for your input. I will look at whether NTP is 
> involved and do some further investigation.

If you're running NTP in your local domain you should enable
independent_wallclock 

e.g. echo 1 > /proc/sys/xen/independent_wallclock or put
independent_wallclock=1 on your kernel command line. [NB: someone should
document the kernel config option]

I'll wager that this is your problem. Hmm, that's a pretty nasty failure
mode. Without doing something gross and intercepting the adjtimex
syscall there's not a lot we can do about it.

Ian 

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-02 13:50 Ian Pratt
  2005-06-02 14:21 ` Osma Suominen
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Pratt @ 2005-06-02 13:50 UTC (permalink / raw)
  To: Osma Suominen, xen-devel

 
> I reported about time-related problems some days ago, with no replies:
> http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628
> 
> I have problems with e.g. wget and Zope crashing on domU on a 
> recent -testing build. This is on a Debian Sarge system, with 
> kernel 2.6.11.11 and a Xen -testing snapshot from two days 
> ago (2005-05-31). The problems are not as easy to trigger as 
> with earlier versions (e.g. the 2.0.5 demo CD), but they do happen.
> 
> The symptom is that during heavy load, wget crashes with the message
> "acalc_rate: Assertion `msecs >= 0' failed", which probably 
> means that time has stepped backwards (looking at earlier 
> xen-devel posts).

I can't reproduce this wget crash, even running seti@home in the
background as you suggest.
I'm running this in dom0 on an SMP Xeon box.

Are you running NTP on your system? If so, what does "echo peers | ntpq"
show? What happens if you disable it?

Is there anything odd about the system? Is the CPU clock speed correctly
identified?

You could try the unstable tree -- it would certainly be interesting to
know if there was a difference. The issue must be really quite specific
to your machine or setup (e.g. the crystal is completely knackered) as
otherwise lots of people would be complaining.

Best,
Ian

^ permalink raw reply	[flat|nested] 13+ messages in thread
* wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-02 10:22 Osma Suominen
  0 siblings, 0 replies; 13+ messages in thread
From: Osma Suominen @ 2005-06-02 10:22 UTC (permalink / raw)
  To: xen-devel

Hello,

I reported about time-related problems some days ago, with no replies:
http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628

I have problems with e.g. wget and Zope crashing on domU on a recent 
-testing build. This is on a Debian Sarge system, with kernel 2.6.11.11 
and a Xen -testing snapshot from two days ago (2005-05-31). The problems 
are not as easy to trigger as with earlier versions (e.g. the 2.0.5 demo 
CD), but they do happen.

The symptom is that during heavy load, wget crashes with the message 
"acalc_rate: Assertion `msecs >= 0' failed", which probably means that 
time has stepped backwards (looking at earlier xen-devel posts).

Also, Zope frequently dies with different time-related error messages. 
Here's the end of a typical traceback:

   File "/usr/lib/zope2.7/lib/python/DateTime/DateTime.py", line 694, in 
_parse_args
     lt = safelocaltime(t)
   File "/usr/lib/zope2.7/lib/python/DateTime/DateTime.py", line 437, in 
safelocaltime
     raise TimeError, 'The time %f is beyond the range ' \
TimeError: The time nan is beyond the range of this Python implementation.

It is fairly easy to crash Zope this way by using a tool such as apache's 
benchmarking utility ab/ab2 or wget to pound on it. It usually takes a few 
minutes on an otherwise unloaded machine to bring down Zope. Note that 
Zope runs just fine on a similar native Linux system, and after running 
production Zope systems for more than a year, I have never seen the kind 
of errors Zope on Xen brings up.

To cause the wget error (which I think is a symptom of a very similar 
problem), it is easiest to run SETI@Home which will put enough load on the 
system. It might take a few attempts but I can always crash wget this way 
when SETI is running.

It is my impression that these problems occur during bursts of high timer 
interrupt activity, but I haven't made detailed studies.

Is there anything I can do to help sort out this? For example, would it be 
a good idea to test unstable to see if it exhibits this behavior? Any help 
is appreciated, and since I soon need to run a production Zope system on 
several Xen hosts, I would like to find a solution to the frequent 
crashes.

-Osma

-- 
*** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi ***

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-06-10 20:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-08 21:19 RE: wget and Zope crashes on post-2.0.6 -testing Ian Pratt
2005-06-08 21:42 ` Kurt Garloff
2005-06-08 21:55   ` Keir Fraser
2005-06-10 19:52     ` Robbie Dinn
  -- strict thread matches above, loose matches on Subject: below --
2005-06-10 20:01 Ian Pratt
2005-06-08 17:58 Ian Pratt
2005-06-02 14:32 Ian Pratt
2005-06-02 15:07 ` Osma Suominen
2005-06-03  9:04 ` Osma Suominen
2005-06-08 17:44   ` Keir Fraser
2005-06-02 13:50 Ian Pratt
2005-06-02 14:21 ` Osma Suominen
2005-06-02 10:22 Osma Suominen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.