wget and Zope crashes on post-2.0.6 -testing

All of lore.kernel.org
 help / color / mirror / Atom feed

* wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-02 10:22 Osma Suominen
  0 siblings, 0 replies; 13+ messages in thread
From: Osma Suominen @ 2005-06-02 10:22 UTC (permalink / raw)
  To: xen-devel

Hello,

I reported about time-related problems some days ago, with no replies:
http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628

I have problems with e.g. wget and Zope crashing on domU on a recent 
-testing build. This is on a Debian Sarge system, with kernel 2.6.11.11 
and a Xen -testing snapshot from two days ago (2005-05-31). The problems 
are not as easy to trigger as with earlier versions (e.g. the 2.0.5 demo 
CD), but they do happen.

The symptom is that during heavy load, wget crashes with the message 
"acalc_rate: Assertion `msecs >= 0' failed", which probably means that 
time has stepped backwards (looking at earlier xen-devel posts).

Also, Zope frequently dies with different time-related error messages. 
Here's the end of a typical traceback:

   File "/usr/lib/zope2.7/lib/python/DateTime/DateTime.py", line 694, in 
_parse_args
     lt = safelocaltime(t)
   File "/usr/lib/zope2.7/lib/python/DateTime/DateTime.py", line 437, in 
safelocaltime
     raise TimeError, 'The time %f is beyond the range ' \
TimeError: The time nan is beyond the range of this Python implementation.

It is fairly easy to crash Zope this way by using a tool such as apache's 
benchmarking utility ab/ab2 or wget to pound on it. It usually takes a few 
minutes on an otherwise unloaded machine to bring down Zope. Note that 
Zope runs just fine on a similar native Linux system, and after running 
production Zope systems for more than a year, I have never seen the kind 
of errors Zope on Xen brings up.

To cause the wget error (which I think is a symptom of a very similar 
problem), it is easiest to run SETI@Home which will put enough load on the 
system. It might take a few attempts but I can always crash wget this way 
when SETI is running.

It is my impression that these problems occur during bursts of high timer 
interrupt activity, but I haven't made detailed studies.

Is there anything I can do to help sort out this? For example, would it be 
a good idea to test unstable to see if it exhibits this behavior? Any help 
is appreciated, and since I soon need to run a production Zope system on 
several Xen hosts, I would like to find a solution to the frequent 
crashes.

-Osma

-- 
*** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi ***

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-02 13:50 Ian Pratt
  2005-06-02 14:21 ` Osma Suominen
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Pratt @ 2005-06-02 13:50 UTC (permalink / raw)
  To: Osma Suominen, xen-devel

> I reported about time-related problems some days ago, with no replies:
> http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628
> 
> I have problems with e.g. wget and Zope crashing on domU on a 
> recent -testing build. This is on a Debian Sarge system, with 
> kernel 2.6.11.11 and a Xen -testing snapshot from two days 
> ago (2005-05-31). The problems are not as easy to trigger as 
> with earlier versions (e.g. the 2.0.5 demo CD), but they do happen.
> 
> The symptom is that during heavy load, wget crashes with the message
> "acalc_rate: Assertion `msecs >= 0' failed", which probably 
> means that time has stepped backwards (looking at earlier 
> xen-devel posts).

I can't reproduce this wget crash, even running seti@home in the
background as you suggest.
I'm running this in dom0 on an SMP Xeon box.

Are you running NTP on your system? If so, what does "echo peers | ntpq"
show? What happens if you disable it?

Is there anything odd about the system? Is the CPU clock speed correctly
identified?

You could try the unstable tree -- it would certainly be interesting to
know if there was a difference. The issue must be really quite specific
to your machine or setup (e.g. the crystal is completely knackered) as
otherwise lots of people would be complaining.

Best,
Ian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
  2005-06-02 13:50 Ian Pratt
@ 2005-06-02 14:21 ` Osma Suominen
  0 siblings, 0 replies; 13+ messages in thread
From: Osma Suominen @ 2005-06-02 14:21 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

On Thu, 2 Jun 2005, Ian Pratt wrote:

> I can't reproduce this wget crash, even running seti@home in the
> background as you suggest.
> I'm running this in dom0 on an SMP Xeon box.
>
> Are you running NTP on your system? If so, what does "echo peers | ntpq"
> show? What happens if you disable it?
>
> Is there anything odd about the system? Is the CPU clock speed correctly
> identified?

Unfortunately the machine is not completely in my control. It is owned by 
another company and I only have access to dom1 and dom2, not dom0 (no 
other domains on the machine). But I will try to reproduce this on another 
machine. It might be a lot easier if there was a 2.0.6 demo CD, though...

It was easy to cause this crash with the 2.0.5 demo CD. I succeeded on all 
3 machines (most of them old) I tried it on. The recipe was in a previous 
post to the list. But maybe Xen has changed so much that it's not relevant 
anymore.

Anyway, the machine I'm now observing is a Pentium IV server with 
HyperThreading. ntpd is running in dom0. As far as I can tell the clock 
speed (3.0GHz) is correctly reported. There are two identical machines, 
and the problem occurs on both (and in both dom1 and dom2), so broken 
hardware is likely not to blame.

AFAICT there is nothing odd with these machines; in fact the company 
owning them seems to make a good business out of renting out Xen domains 
to customers like me. However, since others aren't complaining loudly, the 
problem could be something related to the specific workload I'm putting on 
the machines.

> You could try the unstable tree -- it would certainly be interesting to
> know if there was a difference. The issue must be really quite specific
> to your machine or setup (e.g. the crystal is completely knackered) as
> otherwise lots of people would be complaining.

As I said the specific machine is not entirely in my control but I have 
the feeling I might be able to reproduce this on a spare machine, since it 
was so easy with 2.0.5. In that case I will try unstable as well.

Anyway, thanks for your input. I will look at whether NTP is involved and 
do some further investigation.

-Osma

-- 
*** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi ***

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-02 14:32 Ian Pratt
  2005-06-02 15:07 ` Osma Suominen
  2005-06-03  9:04 ` Osma Suominen
  0 siblings, 2 replies; 13+ messages in thread
From: Ian Pratt @ 2005-06-02 14:32 UTC (permalink / raw)
  To: Osma Suominen; +Cc: xen-devel

> Unfortunately the machine is not completely in my control. It 
> is owned by another company and I only have access to dom1 
> and dom2, not dom0 (no other domains on the machine). But I 
> will try to reproduce this on another machine. It might be a 
> lot easier if there was a 2.0.6 demo CD, though...

Funny you should say that....

Please don't all download it at once, but there's a preview avilable at:
http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xendemo-2.0.6.i
so
 
> Anyway, thanks for your input. I will look at whether NTP is 
> involved and do some further investigation.

If you're running NTP in your local domain you should enable
independent_wallclock 

e.g. echo 1 > /proc/sys/xen/independent_wallclock or put
independent_wallclock=1 on your kernel command line. [NB: someone should
document the kernel config option]

I'll wager that this is your problem. Hmm, that's a pretty nasty failure
mode. Without doing something gross and intercepting the adjtimex
syscall there's not a lot we can do about it.

Ian 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
  2005-06-02 14:32 Ian Pratt
@ 2005-06-02 15:07 ` Osma Suominen
  2005-06-03  9:04 ` Osma Suominen
  1 sibling, 0 replies; 13+ messages in thread
From: Osma Suominen @ 2005-06-02 15:07 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

On Thu, 2 Jun 2005, Ian Pratt wrote:

> Funny you should say that....
>
> Please don't all download it at once, but there's a preview avilable at:
> http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xendemo-2.0.6.iso

Wow! Thanks... I'll look into that.

> If you're running NTP in your local domain you should enable
> independent_wallclock
>
> e.g. echo 1 > /proc/sys/xen/independent_wallclock or put
> independent_wallclock=1 on your kernel command line. [NB: someone should
> document the kernel config option]
>
> I'll wager that this is your problem. Hmm, that's a pretty nasty failure
> mode. Without doing something gross and intercepting the adjtimex
> syscall there's not a lot we can do about it.

I'm not running NTP in domU, but it should be running in dom0, although it 
seems it's not working since the clock is out of sync.

I turned on independent_wallclock and was just about to report that it 
fixed the problem, and then it happened again. Twice. That is, SETI broke 
wget, even with independent_wallclock=1.

Also, with "apt-get install ntpdate ntp-simple" and SETI running I get 
this interesting Perl error (which I've seen before, during high load):

--clip--
debconf: Perl may be unconfigured (Global symbol "%priorities" requires 
explicit package name at /usr/share/perl5/Debconf/Priority.pm line 16.
Compilation failed in require at /usr/share/perl5/Debconf/Config.pm line 
7.
BEGIN failed--compilation aborted at /usr/share/perl5/Debconf/Config.pm 
line 7.
Compilation failed in require at /usr/share/perl5/Debconf/Log.pm line 8.
Compilation failed in require at (eval 1) line 4.
BEGIN failed--compilation aborted at (eval 1) line 4.
) -- aborting
--clip--

And wget occasionally dies with "malloc: not enough memory", when the 
machine has 1 gig of free RAM (total 1,5G) plus 3G of swap. This is 
getting really weird...

I installed ntp on the domU in question and the problem remains, with 
independent_wallclock=1, ntp running and the clock in sync with the 
world.

-Osma

-- 
*** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi ***

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
  2005-06-02 14:32 Ian Pratt
  2005-06-02 15:07 ` Osma Suominen
@ 2005-06-03  9:04 ` Osma Suominen
  2005-06-08 17:44   ` Keir Fraser
  1 sibling, 1 reply; 13+ messages in thread
From: Osma Suominen @ 2005-06-03  9:04 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

On Thu, 2 Jun 2005, Ian Pratt wrote:

>> and dom2, not dom0 (no other domains on the machine). But I
>> will try to reproduce this on another machine. It might be a
>> lot easier if there was a 2.0.6 demo CD, though...
>
> Funny you should say that....
>
> Please don't all download it at once, but there's a preview avilable at:

I tried the demo CD and was able to reproduce this wget crash with it on 
an old Pentium III desktop PC. The recipe is basically the same as before, 
but I'll repeat it here:

1. boot the 2.0.6 demo CD in text mode
2. ifup eth0
3. wget ftp://alien.ssl.berkeley.edu/pub/setiathome-3.08.i686-pc-linux-gnu.tar
4. untar, run, and background setiathome (with ^Z and bg)
5. run wget a few times (took some half a dozen attempts for me)

When you've had wget crash, you can try some of the other tests in
http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628

Since this happens on a random PC with the demo CD, I'll bet that this is 
not some obscure problem with the specific hardware or software 
installation but a real bug in Xen.

-Osma

-- 
*** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi ***

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: wget and Zope crashes on post-2.0.6 -testing
  2005-06-03  9:04 ` Osma Suominen
@ 2005-06-08 17:44   ` Keir Fraser
  0 siblings, 0 replies; 13+ messages in thread
From: Keir Fraser @ 2005-06-08 17:44 UTC (permalink / raw)
  To: Osma Suominen; +Cc: Ian Pratt, xen-devel


On 3 Jun 2005, at 10:04, Osma Suominen wrote:

> When you've had wget crash, you can try some of the other tests in
> http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628
>
> Since this happens on a random PC with the demo CD, I'll bet that this 
> is not some obscure problem with the specific hardware or software 
> installation but a real bug in Xen.

This bug should now be fixed in our xen-2.0.testing.bk repository.

  -- Keir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-08 17:58 Ian Pratt
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Pratt @ 2005-06-08 17:58 UTC (permalink / raw)
  To: Keir Fraser, Osma Suominen
  Cc: xen-devel, Kip Macy, Kurt Garloff, Rich Persaud, Gerd Knorr

> On 3 Jun 2005, at 10:04, Osma Suominen wrote:
> 
> > When you've had wget crash, you can try some of the other tests in
> > http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628
> >
> > Since this happens on a random PC with the demo CD, I'll 
> bet that this 
> > is not some obscure problem with the specific hardware or software 
> > installation but a real bug in Xen.
> 
> This bug should now be fixed in our xen-2.0.testing.bk repository.

This deserves a bit more explanation, as it probably effects all vendor
kernels based on Xen 2.0 (SuSE 9.3 Pro, Debian, demo CD, Gentoo, etc.)
It does *not* effect the kernel we ship in our 2.0 source and binary tar
balls, which is why its taken so long to pin down. It does *not* effect
the unstable branch.

The reason the bug is not present in our kernels is due to the kernel
config: we enable CONFIG_MD_RAID5=y in our config which hides the bug,
whereas most distros have this as a module.

The root cause of the bug is that during the boot sequence Linux tests
to see whether the processor has the fdiv bug. This involves doing some
floating point opertions. Unfortunately, they are not wrapped in the
kernel_fpu_begin()/end() calls that normally surround use of fp in the
kernel. Native linux gets away with this because it happens so early in
the boot process that no-one else can be using the fpu. However, on Xen
this gets us into a bad state, which will come back to haunt us much
later on, resulting in fpu state corruption in user processes. The fix
in 2.0-testing is simply to 'wrap' the fdiv test.

The reason the bug is not present on unstable is that the fpu code had
already been rejigged so that we were immune to this kind of problem as
it had been identified as a potential fragility.

Since this bug hadn't been widely reported we probably won't rush to
release a 2.0.6a demo CD, but vendor kernel maintainers should
definitely pick up the fix.

Best,
Ian 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-08 21:19 Ian Pratt
  2005-06-08 21:42 ` Kurt Garloff
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Pratt @ 2005-06-08 21:19 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: xen-devel, Kip Macy, Rich Persaud, Osma Suominen, Gerd Knorr

 
> I observed that the first userspace process that uses the FPU 
> will SIGFPE once. Afterwards everything runs just fine ...
> 
> You description looks like it matches exactly the 
> misbehaviour I've been seeing.

Got any more critical bugs you're not telling us about? :-)

> Is attached patch the right way to fix this?

I think that should work (with the obvious kernel_ prefix), but I've
appeneded what we've gone for.

Best,
Ian 

--- linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/bugs.h
2005-06-08 22:08:52.000000000 +0100
+++ linux-2.6.11-xen0/include/asm-i386/bugs.h   2005-03-02
07:37:49.000000000 +0000
@@ -107,7 +107,6 @@
                "fninit"
                : "=m" (*&boot_cpu_data.fdiv_bug)
                : "m" (*&x), "m" (*&y));
+       stts();
        if (boot_cpu_data.fdiv_bug)
                printk("Hmm, FPU with FDIV bug.\n");
 }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RE: RE: wget and Zope crashes on post-2.0.6 -testing
  2005-06-08 21:19 RE: wget and Zope crashes on post-2.0.6 -testing Ian Pratt
@ 2005-06-08 21:42 ` Kurt Garloff
  2005-06-08 21:55   ` Keir Fraser
  0 siblings, 1 reply; 13+ messages in thread
From: Kurt Garloff @ 2005-06-08 21:42 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Kip Macy, Rich Persaud, Osma Suominen, Gerd Knorr


[-- Attachment #1.1: Type: text/plain, Size: 1020 bytes --]

Hi Ian,

On Wed, Jun 08, 2005 at 10:19:35PM +0100, Ian Pratt wrote:
>  
> > I observed that the first userspace process that uses the FPU 
> > will SIGFPE once. Afterwards everything runs just fine ...
> > 
> > You description looks like it matches exactly the 
> > misbehaviour I've been seeing.
> 
> Got any more critical bugs you're not telling us about? :-)

I wanted to hunt that one down myself ... obviously overestimating
the amount of time and expertise I can devote to it :(

OK, you want another one:
Well, xenified SLES9 oopses on balloooning :-)
But I hope that Christian, Kip, /me will track this one
down soon.

> > Is attached patch the right way to fix this?
> 
> I think that should work (with the obvious kernel_ prefix), but I've
> appeneded what we've gone for.

Not having a CPU manual close to my desk: What do we achieve by setting
bit 3 (TS) in CR0? Why does it help to get the FPU back to a sane state?

Regards,
-- 
Kurt Garloff, Director SUSE Labs, Novell Inc.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: wget and Zope crashes on post-2.0.6 -testing
  2005-06-08 21:42 ` Kurt Garloff
@ 2005-06-08 21:55   ` Keir Fraser
  2005-06-10 19:52     ` Robbie Dinn
  0 siblings, 1 reply; 13+ messages in thread
From: Keir Fraser @ 2005-06-08 21:55 UTC (permalink / raw)
  To: Kurt Garloff
  Cc: Ian Pratt, xen-devel, Kip Macy, Rich Persaud, Osma Suominen,
	Gerd Knorr

On 8 Jun 2005, at 22:42, Kurt Garloff wrote:

>> I think that should work (with the obvious kernel_ prefix), but I've
>> appeneded what we've gone for.
>
> Not having a CPU manual close to my desk: What do we achieve by setting
> bit 3 (TS) in CR0? Why does it help to get the FPU back to a sane 
> state?

When set it causes a fault whenever the FPU is accessed. We use it to 
lazily initialise the FPU for the currently running process. At 
context-switch time we look at the process we are descheduling and, if 
it hasn;t used the FPU in its time slice, we don;t save FPU state and 
we don;t set the TS bit (because we assume it must be already set).

The last point is where we can fall down: if the TS bit in fact *isn;t* 
set, then we are screwed for all time. The kernel will never realise a 
process is using the FPU because we will never take the TS fault, 
because the TS bit is clear. Thus state doesn;t get saved/restored 
during context switch and the TS bit never gets set. So its a self 
perpetuating state once you're in it.

  -- Keir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: wget and Zope crashes on post-2.0.6 -testing
  2005-06-08 21:55   ` Keir Fraser
@ 2005-06-10 19:52     ` Robbie Dinn
  0 siblings, 0 replies; 13+ messages in thread
From: Robbie Dinn @ 2005-06-10 19:52 UTC (permalink / raw)
  To: xen-devel

Hi all

Keir Fraser wrote:
> 
> When set it causes a fault whenever the FPU is accessed. We use it to 
> lazily initialise the FPU for the currently running process. At 
> context-switch time we look at the process we are descheduling and, if 
> it hasn;t used the FPU in its time slice, we don;t save FPU state and we 
> don;t set the TS bit (because we assume it must be already set).
> 
> The last point is where we can fall down: if the TS bit in fact *isn;t* 
> set, then we are screwed for all time. The kernel will never realise a 
> process is using the FPU because we will never take the TS fault, 
> because the TS bit is clear. Thus state doesn;t get saved/restored 
> during context switch and the TS bit never gets set. So its a self 
> perpetuating state once you're in it.

I have an end user question rather than a developers Q.

Say I have an xen machine with several domains, some with kernels
that have the FPU bug fix and some without. Can a domain with
the buggy kernel upset a domain with a bug free kernel?
Or does this just affect processes within one domain?

I might want to be a bit more hasty in upgrading all the kernels
if a buggy kernel/domain can upset a good kernel/domain.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: wget and Zope crashes on post-2.0.6 -testing
@ 2005-06-10 20:01 Ian Pratt
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Pratt @ 2005-06-10 20:01 UTC (permalink / raw)
  To: Robbie Dinn, xen-devel

 > > The last point is where we can fall down: if the TS bit in fact 
> > *isn;t* set, then we are screwed for all time. The kernel 
> will never 
> > realise a process is using the FPU because we will never 
> take the TS 
> > fault, because the TS bit is clear. Thus state doesn;t get 
> > saved/restored during context switch and the TS bit never 
> gets set. So 
> > its a self perpetuating state once you're in it.
> 
> Say I have an xen machine with several domains, some with 
> kernels that have the FPU bug fix and some without. Can a 
> domain with the buggy kernel upset a domain with a bug free kernel?
> Or does this just affect processes within one domain?

It just affects the one domain. 

Best,
Ian

> I might want to be a bit more hasty in upgrading all the 
> kernels if a buggy kernel/domain can upset a good kernel/domain.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-06-10 20:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-08 21:19 RE: wget and Zope crashes on post-2.0.6 -testing Ian Pratt
2005-06-08 21:42 ` Kurt Garloff
2005-06-08 21:55   ` Keir Fraser
2005-06-10 19:52     ` Robbie Dinn
  -- strict thread matches above, loose matches on Subject: below --
2005-06-10 20:01 Ian Pratt
2005-06-08 17:58 Ian Pratt
2005-06-02 14:32 Ian Pratt
2005-06-02 15:07 ` Osma Suominen
2005-06-03  9:04 ` Osma Suominen
2005-06-08 17:44   ` Keir Fraser
2005-06-02 13:50 Ian Pratt
2005-06-02 14:21 ` Osma Suominen
2005-06-02 10:22 Osma Suominen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.