segfault in VM

All of lore.kernel.org
 help / color / mirror / Atom feed

* segfault in VM
@ 2004-07-19  5:22 Derek Glidden
  2004-07-19  5:50 ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-19  5:22 UTC (permalink / raw)
  To: xen-devel

Maybe related or maybe not, but it was the same VM getting all the 
scheduling time in my previous post.  (SMP Celeron box with 512M of 
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux 
source tree from one place to another with rsync.  Everything copacetic 
until I started the big rsync in dom0, where within a minute or so, vm2 
bombed.  No messages on the dom0 console or in the VM other than the 
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=y) console spits out:

(XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o 
on the machine, all with the same values:)

(XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294

Any further activity inside vm2 results in more segmentation faults and 
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-19  5:22 segfault in VM Derek Glidden
@ 2004-07-19  5:50 ` James Harper
  2004-07-19  7:27   ` Keir Fraser
  2004-07-19 18:52   ` Derek Glidden
  0 siblings, 2 replies; 64+ messages in thread
From: James Harper @ 2004-07-19  5:50 UTC (permalink / raw)
  To: Derek Glidden, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3163 bytes --]

that sounds like the same sort of errors i'm getting which appeared to be filesystem corruption. First the corruption starts, then everything you do causes a segfault, although i've only seen funny things happen in dom0.

In the limited testing i've done it looks like dom0 by itself is stable, but crashes start occuring once I start up other domains and work dom0 hard (other domains running under light load). I'm running this script in dom0:

#!/bin/sh
while [ 1 = 1 ]
do
 diff file3 file4 && echo okay
done

where file3 and file4 are around 300mb files, and the vm has 128mb of memory with no swap. This ensures that none of the file is cached so there's lots of I/O.

When i've seen it crash most readily has been when i'm running a few other domains and then start running dom0 out of memory, but nothing conclusive yet.

I'll let this test keep running for another hour (otherwise idle, no other domains running) or so then start my running-out-of-memory program.

I wonder if it is coincidence that we both have smp boxes... each of the domains only sees 1 cpu so I wouldn't have thought that would be a problem unless there's a race in xen itself.

James

From: Derek Glidden
Sent: Mon 19/07/2004 3:22 PM
To: xen-devel@lists.sourceforge.net
Subject: [Xen-devel] segfault in VM

Maybe related or maybe not, but it was the same VM getting all the 
scheduling time in my previous post.  (SMP Celeron box with 512M of 
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux 
source tree from one place to another with rsync.  Everything copacetic 
until I started the big rsync in dom0, where within a minute or so, vm2 
bombed.  No messages on the dom0 console or in the VM other than the 
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=y) console spits out:

(XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o 
on the machine, all with the same values:)

(XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294

Any further activity inside vm2 results in more segmentation faults and 
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 4505 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  5:50 ` James Harper
@ 2004-07-19  7:27   ` Keir Fraser
  2004-07-19  8:28     ` Chris Andrews
  2004-07-19 18:56     ` Derek Glidden
  2004-07-19 18:52   ` Derek Glidden
  1 sibling, 2 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-19  7:27 UTC (permalink / raw)
  To: James Harper; +Cc: Derek Glidden, xen-devel


Clearly there's some fairly random memory corruption going on, which
then causes segfaults (if the corruption hits code pages) and
filesystem corruption (if the corruption hits buffer-cache pages).

The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost
certainly just symptoms of executing a corrupted block of code. i.e.,
the bug has already triggered some time ago - probably corrupted a
page of glibc or the kernel.

It would be interesting to see whether or not this is SMP-related.
It's also interesting that someone said they couldn't reproduce
corruption when using 2.6.7 for the non-privileged guest OSes.

 -- Keir

> that sounds like the same sort of errors i'm getting which appeared to be filesystem corruption. First the corruption starts, then everything you do causes a segfault, although i've only seen funny things happen in dom0.
> 
> In the limited testing i've done it looks like dom0 by itself is stable, but crashes start occuring once I start up other domains and work dom0 hard (other domains running under light load). I'm running this script in dom0:
> 
> #!/bin/sh
> while [ 1 = 1 ]
> do
>  diff file3 file4 && echo okay
> done
> 
> where file3 and file4 are around 300mb files, and the vm has 128mb of memory with no swap. This ensures that none of the file is cached so there's lots of I/O.
> 
> When i've seen it crash most readily has been when i'm running a few other domains and then start running dom0 out of memory, but nothing conclusive yet.
> 
> I'll let this test keep running for another hour (otherwise idle, no other domains running) or so then start my running-out-of-memory program.
> 
> I wonder if it is coincidence that we both have smp boxes... each of the domains only sees 1 cpu so I wouldn't have thought that would be a problem unless there's a race in xen itself.
> 
> James
> 
> 
> From: Derek Glidden
> Sent: Mon 19/07/2004 3:22 PM
> To: xen-devel@lists.sourceforge.net
> Subject: [Xen-devel] segfault in VM
> 
> 
> Maybe related or maybe not, but it was the same VM getting all the 
> scheduling time in my previous post.  (SMP Celeron box with 512M of 
> RAM, no himem enabled.)
> 
> At the time, four VMs were all compiling, with dom0 copying a linux 
> source tree from one place to another with rsync.  Everything copacetic 
> until I started the big rsync in dom0, where within a minute or so, vm2 
> bombed.  No messages on the dom0 console or in the VM other than the 
> "Segmentation Fault" in the VM during compliation.
> 
> However XEN (compiled with debug=y) console spits out:
> 
> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
> 4GB segment.
> 
> at the time of the segmentation fault.
> 
> (and there are lots of these, pretty much any time there is heavy i/o 
> on the machine, all with the same values:)
> 
> (XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294
> 
> Any further activity inside vm2 results in more segmentation faults and 
> more "Bailing" messages.  The other VMs and dom0 seem to be ok.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        http://www.eff.org/
> in blood. But if you live your     |  http://www.anti-dmca.org/
> life right, that kind of thing     |---------------------------
> doesn't have to stop there." -- Dana Gould
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel
\x1f -=- MIME -=- \x1f\f

--_DA10D165-B49A-46A6-8E62-3E81282C36E8_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset="iso-8859-1";
	format=flowed

that sounds like the same sort of errors i'm getting which appeared to be f=
ilesystem corruption. First the corruption starts, then everything you do c=
auses a segfault, although i've only seen funny things happen in dom0.

In the limited testing i've done it looks like dom0 by itself is stable, bu=
t crashes start occuring once I start up other domains and work dom0 hard (=
other domains running under light load). I'm running this script in dom0:

#!/bin/sh
while [ 1 =3D 1 ]
do
 diff file3 file4 && echo okay
done

where file3 and file4 are around 300mb files, and the vm has 128mb of memor=
y with no swap. This ensures that none of the file is cached so there's lot=
s of I/O.

When i've seen it crash most readily has been when i'm running a few other =
domains and then start running dom0 out of memory, but nothing conclusive y=
et.

I'll let this test keep running for another hour (otherwise idle, no other =
domains running) or so then start my running-out-of-memory program.

I wonder if it is coincidence that we both have smp boxes... each of the do=
mains only sees 1 cpu so I wouldn't have thought that would be a problem un=
less there's a race in xen itself.

James









From: Derek Glidden
Sent: Mon 19/07/2004 3:22 PM
To: xen-devel@lists.sourceforge.net
Subject: [Xen-devel] segfault in VM


Maybe related or maybe not, but it was the same VM getting all the=20
scheduling time in my previous post.  (SMP Celeron box with 512M of=20
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux=20
source tree from one place to another with rsync.  Everything copacetic=20
until I started the big rsync in dom0, where within a minute or so, vm2=20
bombed.  No messages on the dom0 console or in the VM other than the=20
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=3Dy) console spits out:

(XEN) (file=3Dx86_32/emulate.c, line=3D228) Bailing: not a -ve offset into=
=20
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o=20
on the machine, all with the same values:)

(XEN) (file=3Dtraps.c, line=3D466) GPF (0004): fc5277a8 -> fc52a294

Any further activity inside vm2 results in more segmentation faults and=20
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

--_DA10D165-B49A-46A6-8E62-3E81282C36E8_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText53940 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>that sounds like=
 the same sort of errors i'm getting which appeared to be filesystem corrup=
tion. First the corruption starts, then everything you do causes a segfault=
, although i've only seen funny things happen in dom0.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>In the limited testing i've done=
 it looks like dom0 by itself is stable, but crashes start occuring once I =
start up other domains and work dom0 hard (other domains running under ligh=
t load). I'm running this script in dom0:</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>#!/bin/sh<BR>while [ 1 =3D 1 ]<B=
R>do<BR>&nbsp;diff file3 file4 &amp;&amp; echo okay<BR>done<BR></FONT></DIV=
>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>where file3 and file4 are around=
 300mb files, and the vm has 128mb of memory with no swap. This ensures tha=
t none of the file is cached so there's lots of I/O.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>When i've seen it crash most rea=
dily has been when i'm running a few other domains and then start running d=
om0 out of memory, but nothing conclusive yet.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>I'll let this test keep running =
for another hour (otherwise idle, no other domains running) or so then star=
t&nbsp;my running-out-of-memory program.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>I wonder if it is coincidence th=
at we both have smp boxes... each of the domains only sees 1 cpu so I would=
n't have thought that would be a problem unless there's a race in xen itsel=
f.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT><FONT face=3DArial size=
=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Derek Glidden<BR><B>Sent:</B> Mon=
 19/07/2004 3:22 PM<BR><B>To:</B> xen-devel@lists.sourceforge.net<BR><B>Sub=
ject:</B> [Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Maybe related or maybe not, but i=
t was the same VM getting all the=20
scheduling time in my previous post.  (SMP Celeron box with 512M of=20
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux=20
source tree from one place to another with rsync.  Everything copacetic=20
until I started the big rsync in dom0, where within a minute or so, vm2=20
bombed.  No messages on the dom0 console or in the VM other than the=20
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=3Dy) console spits out:

(XEN) (file=3Dx86_32/emulate.c, line=3D228) Bailing: not a -ve offset into=
=20
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o=20
on the machine, all with the same values:)

(XEN) (file=3Dtraps.c, line=3D466) GPF (0004): fc5277a8 -&gt; fc52a294

Any further activity inside vm2 results in more segmentation faults and=20
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_DA10D165-B49A-46A6-8E62-3E81282C36E8_--


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  7:27   ` Keir Fraser
@ 2004-07-19  8:28     ` Chris Andrews
  2004-07-19  8:57       ` Keir Fraser
  2004-07-19 18:58       ` Derek Glidden
  2004-07-19 18:56     ` Derek Glidden
  1 sibling, 2 replies; 64+ messages in thread
From: Chris Andrews @ 2004-07-19  8:28 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
> Clearly there's some fairly random memory corruption going on, which
> then causes segfaults (if the corruption hits code pages) and
> filesystem corruption (if the corruption hits buffer-cache pages).
 >
> The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost
> certainly just symptoms of executing a corrupted block of code. i.e.,
> the bug has already triggered some time ago - probably corrupted a
> page of glibc or the kernel.
> 
> It would be interesting to see whether or not this is SMP-related.
> It's also interesting that someone said they couldn't reproduce
> corruption when using 2.6.7 for the non-privileged guest OSes.

I'm seeing this corruption on a single CPU machine, with a single 2.4 
guest running but idle. I only ran one 2.6.7 guest, and I didn't give it 
any work, but it didn't take any load in the 2.4 guest to provoke problems.

The machine uses devicemapper, so I'm going to move some partitions 
around and see if I still get corruption without it. I can also build 
Xen with debug=y and try that, once I've got the disk sorted.


Chris.


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  8:28     ` Chris Andrews
@ 2004-07-19  8:57       ` Keir Fraser
  2004-07-19  9:01         ` Chris Andrews
  2004-07-19 18:58       ` Derek Glidden
  1 sibling, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-19  8:57 UTC (permalink / raw)
  To: Chris Andrews; +Cc: Keir Fraser, xen-devel

> Keir Fraser wrote:
> > Clearly there's some fairly random memory corruption going on, which
> > then causes segfaults (if the corruption hits code pages) and
> > filesystem corruption (if the corruption hits buffer-cache pages).
>  >
> > The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost
> > certainly just symptoms of executing a corrupted block of code. i.e.,
> > the bug has already triggered some time ago - probably corrupted a
> > page of glibc or the kernel.
> > 
> > It would be interesting to see whether or not this is SMP-related.
> > It's also interesting that someone said they couldn't reproduce
> > corruption when using 2.6.7 for the non-privileged guest OSes.
> 
> I'm seeing this corruption on a single CPU machine, with a single 2.4 
> guest running but idle. I only ran one 2.6.7 guest, and I didn't give it 
> any work, but it didn't take any load in the 2.4 guest to provoke problems.

Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  8:57       ` Keir Fraser
@ 2004-07-19  9:01         ` Chris Andrews
  2004-07-19 12:48           ` Wm
  0 siblings, 1 reply; 64+ messages in thread
From: Chris Andrews @ 2004-07-19  9:01 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
>>Keir Fraser wrote:
>>
>>>Clearly there's some fairly random memory corruption going on, which
>>>then causes segfaults (if the corruption hits code pages) and
>>>filesystem corruption (if the corruption hits buffer-cache pages).
>>
>> >
>>
>>>The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost
>>>certainly just symptoms of executing a corrupted block of code. i.e.,
>>>the bug has already triggered some time ago - probably corrupted a
>>>page of glibc or the kernel.
>>>
>>>It would be interesting to see whether or not this is SMP-related.
>>>It's also interesting that someone said they couldn't reproduce
>>>corruption when using 2.6.7 for the non-privileged guest OSes.
>>
>>I'm seeing this corruption on a single CPU machine, with a single 2.4 
>>guest running but idle. I only ran one 2.6.7 guest, and I didn't give it 
>>any work, but it didn't take any load in the 2.4 guest to provoke problems.
> 
> 
> Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?

Yes, that's right. With just the 2.4 domain0 on its own, everything 
seems fine.


Chris.


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  9:01         ` Chris Andrews
@ 2004-07-19 12:48           ` Wm
  2004-07-19 13:22             ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: Wm @ 2004-07-19 12:48 UTC (permalink / raw)
  To: xen-devel

On Mon, Jul 19, 2004 at 10:01:54AM +0100, Chris Andrews wrote:
> Keir Fraser wrote:
> >>Keir Fraser wrote:
> >Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?
> 
> Yes, that's right. With just the 2.4 domain0 on its own, everything 
> seems fine.

OK using an image from Chris of dom1 i have been able to semi reliably cause
all sorts of corruption including Oopses in dom0 and other domains.

This is on 2 different machines one of which is thought to be atleast semi 
reliable.

I first noticed it when doing a bk pull and having bitkeeper deciding
that my tree was rather corrupt (in a dom0), but with other domains
running.

Running while (:) do  tar cpf - . | gzip -3vc | cat >/dev/null; done

http://www.yuri.org.uk/~murble/boom/ has some output, with the
domain0 deciding to try and access beyond the end of device lots,
but this could be caused by random memory corruption.

Whilst trying to build somthing in dom0 seems a fairly reliable way
of triggering it.

Also my dom0 only had 48mb or so of ram but plenty of swap.

Before the crash i noticed user programs that allocated lots of memory
in dom0 randomly segfaulting, including apt-get update and apt-get
build-deps.

Again the oopsen i get were generally rather random, although i
have noticed another possible XenoLinux bug.  When you boot with
panic=30 it takes ages for dom0 to reboot, far longer than 30 seconds.

Even though after the panic it says rebooting in 30 seconds.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19 12:48           ` Wm
@ 2004-07-19 13:22             ` Keir Fraser
  2004-07-19 19:06               ` Derek Glidden
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-19 13:22 UTC (permalink / raw)
  To: Wm; +Cc: xen-devel


It strikes me that many people have started seeing this bug all of a
sudden, so it has probably been introduced in the last week. Perhaps
it is worth someone backing off to an older repository version and
seeing whether they can reproduce the problems?

If we can 'binary chop' the changesets to isolate the bad one, it
would be a much easier bug to fix. ;-) Sounds like it would be a
fairly tedious process though...

The first person to complain I think was Jody Belka, who was using 
the changeset with comment 'Fairly major fixes to the network frontend
driver...' (2004-07-13 18:24:48). Perhaps backing off to a day before
that would be a sensible place to start? 

 -- Keir

> On Mon, Jul 19, 2004 at 10:01:54AM +0100, Chris Andrews wrote:
> > Keir Fraser wrote:
> > >>Keir Fraser wrote:
> > >Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?
> > 
> > Yes, that's right. With just the 2.4 domain0 on its own, everything 
> > seems fine.
> 
> OK using an image from Chris of dom1 i have been able to semi reliably cause
> all sorts of corruption including Oopses in dom0 and other domains.
> 
> This is on 2 different machines one of which is thought to be atleast semi 
> reliable.
> 
> I first noticed it when doing a bk pull and having bitkeeper deciding
> that my tree was rather corrupt (in a dom0), but with other domains
> running.
> 
> Running while (:) do  tar cpf - . | gzip -3vc | cat >/dev/null; done
> 
> http://www.yuri.org.uk/~murble/boom/ has some output, with the
> domain0 deciding to try and access beyond the end of device lots,
> but this could be caused by random memory corruption.
> 
> Whilst trying to build somthing in dom0 seems a fairly reliable way
> of triggering it.
> 
> Also my dom0 only had 48mb or so of ram but plenty of swap.
> 
> Before the crash i noticed user programs that allocated lots of memory
> in dom0 randomly segfaulting, including apt-get update and apt-get
> build-deps.
> 
> Again the oopsen i get were generally rather random, although i
> have noticed another possible XenoLinux bug.  When you boot with
> panic=30 it takes ages for dom0 to reboot, far longer than 30 seconds.
> 
> Even though after the panic it says rebooting in 30 seconds.
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  5:50 ` James Harper
  2004-07-19  7:27   ` Keir Fraser
@ 2004-07-19 18:52   ` Derek Glidden
  1 sibling, 0 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-19 18:52 UTC (permalink / raw)
  To: xen-devel


On Jul 19, 2004, at 1:50 AM, James Harper wrote:

> where file3 and file4 are around 300mb files, and the vm has 128mb of 
> memory with no swap. This ensures that none of the file is cached so 
> there's lots of I/O.
>  
> When i've seen it crash most readily has been when i'm running a few 
> other domains and then start running dom0 out of memory, but nothing 
> conclusive yet.
>  
> I'll let this test keep running for another hour (otherwise idle, no 
> other domains running) or so then start my running-out-of-memory 
> program.

similarly, I can reproduce it reasonably reliably if I wait until all 
the VMs are busy either doing I/o or high CPU utilization and then I 
start dom0 doing lots of I/o either through an rsync or something along 
those lines.  If I let the system run for a little while to "prime" it, 
so far I think I can pretty much crash it whenever I want.

>  
> I wonder if it is coincidence that we both have smp boxes... each of 
> the domains only sees 1 cpu so I wouldn't have thought that would be a 
> problem unless there's a race in xen itself.

I have another, single-CPU, box that I can play with that I'll try to 
get to building and deploying Xen tonight and see if it makes any 
difference.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idG21&alloc_id\x10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  7:27   ` Keir Fraser
  2004-07-19  8:28     ` Chris Andrews
@ 2004-07-19 18:56     ` Derek Glidden
  2004-07-19 23:06       ` Derek Glidden
  1 sibling, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-19 18:56 UTC (permalink / raw)
  To: xen-devel


On Jul 19, 2004, at 3:27 AM, Keir Fraser wrote:

>
> Clearly there's some fairly random memory corruption going on, which
> then causes segfaults (if the corruption hits code pages) and
> filesystem corruption (if the corruption hits buffer-cache pages).
>
> The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost
> certainly just symptoms of executing a corrupted block of code. i.e.,
> the bug has already triggered some time ago - probably corrupted a
> page of glibc or the kernel.
>
> It would be interesting to see whether or not this is SMP-related.
> It's also interesting that someone said they couldn't reproduce
> corruption when using 2.6.7 for the non-privileged guest OSes.

I'll be building Xen on a non-SMP box I also have at home tonight, with 
any luck.

I'll also be running memtest on the SMP box that's been seeing the 
corruption when I get home as well, probably followed by CTCS.  It was 
stable for a week or so under reasonably heavy load before I installed 
Xen on it, but you never know...

If it passes all the testing, I'll build a 2.6.7 guest kernel and give 
that a try.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19  8:28     ` Chris Andrews
  2004-07-19  8:57       ` Keir Fraser
@ 2004-07-19 18:58       ` Derek Glidden
  2004-07-19 19:34         ` Chris Andrews
  1 sibling, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-19 18:58 UTC (permalink / raw)
  To: xen-devel


On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:

>
> I'm seeing this corruption on a single CPU machine, with a single 2.4 
> guest running but idle. I only ran one 2.6.7 guest, and I didn't give 
> it any work, but it didn't take any load in the 2.4 guest to provoke 
> problems.

I've not really tried real hard at not loading the VMs or dom0 OS yet.  
I've got too much I want to make them do.   :)

> The machine uses devicemapper, so I'm going to move some partitions 
> around and see if I still get corruption without it. I can also build 
> Xen with debug=y and try that, once I've got the disk sorted.

This box uses dm as well...  Clue or coincidence?  Probably 
coincidence...  I've had the segfaults in different VMs and dom0, so I 
doubt it's related to any specific LV or disk sector or anything.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19 13:22             ` Keir Fraser
@ 2004-07-19 19:06               ` Derek Glidden
  2004-07-20  0:01                 ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-19 19:06 UTC (permalink / raw)
  To: xen-devel

On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:
>
> The first person to complain I think was Jody Belka, who was using
> the changeset with comment 'Fairly major fixes to the network frontend
> driver...' (2004-07-13 18:24:48). Perhaps backing off to a day before
> that would be a sensible place to start?

I'm either going to blow your theory out of the water or help a lot 
because my first "real" build of all the Xen tools & kernel & linux 
kernels where I actually booted into a dom0 kernel from Xen was from a 
checkout on either the 12th or 13th.  Prior to that I was working out 
getting everything built under gentoo and not actually running it.  And 
that's what I've been using until I checked out and rebuilt everything 
fresh this sunday afternoon and still have the problem as of last 
night.  Although a VM will segfault while dom0 seems to panic, it's 
probably the same root problem.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19 18:58       ` Derek Glidden
@ 2004-07-19 19:34         ` Chris Andrews
  2004-07-20  0:04           ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Chris Andrews @ 2004-07-19 19:34 UTC (permalink / raw)
  To: Derek Glidden; +Cc: xen-devel


On 19 Jul 2004, at 19:58, Derek Glidden wrote:

>
> On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:
>
>
>> The machine uses devicemapper, so I'm going to move some partitions 
>> around and see if I still get corruption without it. I can also build 
>> Xen with debug=y and try that, once I've got the disk sorted.
>
> This box uses dm as well...  Clue or coincidence?  Probably 
> coincidence...  I've had the segfaults in different VMs and dom0, so I 
> doubt it's related to any specific LV or disk sector or anything.

I've moved stuff around my machine's disk so I don't need dm and 
recompiled without it, and I've seen the same crashes with the guest fs 
on a loop device, and with the guest fs on an ordinary disk partition, 
so I guess it's not specific to dm.

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19 18:56     ` Derek Glidden
@ 2004-07-19 23:06       ` Derek Glidden
  2004-07-20  1:01         ` Derek Glidden
  0 siblings, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-19 23:06 UTC (permalink / raw)
  To: xen-devel


On Jul 19, 2004, at 2:56 PM, Derek Glidden wrote:

> I'll also be running memtest on the SMP box that's been seeing the 
> corruption when I get home as well, probably followed by CTCS.  It was 
> stable for a week or so under reasonably heavy load before I installed 
> Xen on it, but you never know...

FWIW - memtest ran for a couple of hours with no trouble.

I've booted Xen with "nosmp" and will do the same things I've been 
doing to it to make it break and see what happens.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-19 19:06               ` Derek Glidden
@ 2004-07-20  0:01                 ` James Harper
  2004-07-20  1:04                   ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-20  0:01 UTC (permalink / raw)
  To: Derek Glidden, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2330 bytes --]

I'm pretty sure i've seen it earlier than that, but couldn't be certain. Initially I more or less expected instabilities and so wasn't really taking much notice.

so I guess my comments above are of absolutely no help at all. :)

i'll be trying a bk pull and build today (under normal linux - 2 cpus and max memory = faster builds) then verify that i can still make it crash, then try nosmp, although i've seen a few posts about single cpu crashes.

james

From: Derek Glidden
Sent: Tue 20/07/2004 5:06 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:
>
> The first person to complain I think was Jody Belka, who was using
> the changeset with comment 'Fairly major fixes to the network frontend
> driver...' (2004-07-13 18:24:48). Perhaps backing off to a day before
> that would be a sensible place to start?

I'm either going to blow your theory out of the water or help a lot 
because my first "real" build of all the Xen tools & kernel & linux 
kernels where I actually booted into a dom0 kernel from Xen was from a 
checkout on either the 12th or 13th.  Prior to that I was working out 
getting everything built under gentoo and not actually running it.  And 
that's what I've been using until I checked out and rebuilt everything 
fresh this sunday afternoon and still have the problem as of last 
night.  Although a VM will segfault while dom0 seems to panic, it's 
probably the same root problem.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 2991 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-19 19:34         ` Chris Andrews
@ 2004-07-20  0:04           ` James Harper
  0 siblings, 0 replies; 64+ messages in thread
From: James Harper @ 2004-07-20  0:04 UTC (permalink / raw)
  To: Chris Andrews, Derek Glidden; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1440 bytes --]

i'm not using dm and see lots of crashes.



From: Chris Andrews
Sent: Tue 20/07/2004 5:34 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


On 19 Jul 2004, at 19:58, Derek Glidden wrote:

>
> On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:
>
>
>> The machine uses devicemapper, so I'm going to move some partitions 
>> around and see if I still get corruption without it. I can also build 
>> Xen with debug=y and try that, once I've got the disk sorted.
>
> This box uses dm as well...  Clue or coincidence?  Probably 
> coincidence...  I've had the segfaults in different VMs and dom0, so I 
> doubt it's related to any specific LV or disk sector or anything.

I've moved stuff around my machine's disk so I don't need dm and 
recompiled without it, and I've seen the same crashes with the guest fs 
on a loop device, and with the guest fs on an ordinary disk partition, 
so I guess it's not specific to dm.

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 1811 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-19 23:06       ` Derek Glidden
@ 2004-07-20  1:01         ` Derek Glidden
  2004-07-20  6:56           ` Keir Fraser
  2004-07-20 15:51           ` Derek Glidden
  0 siblings, 2 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-20  1:01 UTC (permalink / raw)
  To: xen-devel


On Jul 19, 2004, at 7:06 PM, Derek Glidden wrote:

>
> I've booted Xen with "nosmp" and will do the same things I've been 
> doing to it to make it break and see what happens.

hmm.  Running this same box, same Xen kernel, same linux kernel, but 
with "nosmp", just creating a domain gives me about two dozen of these:

(XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
4GB segment.
(XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
!!!!

But so far, no crashes.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-20  0:01                 ` James Harper
@ 2004-07-20  1:04                   ` James Harper
  2004-07-20  7:59                     ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-20  1:04 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3575 bytes --]

bk pull only showed 2 patches, neither of which affected kernels so I didn't bother recompiling.

I have seen an error (shown by my diff script 'compare' or by xend doing silly things like crashing), by simply starting another domain and pinging it with something like:

ping -s 1400 -i 0.001 192.168.200.200

(ping -f might do it but I think it goes a bit fast)

That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running.

running it out of memory with this code:

#include <stdio.h>
#include <stdlib.h>
int main() {
        char *buf;
        int mem = 0;
        int size = 1;
        char rnd;
        rnd = rand() & 255;
        while(1) {
                buf = (char *)malloc(size*1024*1024);
                memset(buf, rnd, size*1024*1024);
                if (buf != NULL) {
                        mem += size;
                        printf("%d\n", mem);
                }
        }
}

causes a crash far more quickly. I guess it's possible that those are two different errors though...

James

From: James Harper
Sent: Tue 20/07/2004 10:01 AM
To: Derek Glidden; xen-devel@lists.sourceforge.net
Subject: RE: [Xen-devel] segfault in VM

I'm pretty sure i've seen it earlier than that, but couldn't be certain. Initially I more or less expected instabilities and so wasn't really taking much notice.

so I guess my comments above are of absolutely no help at all. :)

i'll be trying a bk pull and build today (under normal linux - 2 cpus and max memory = faster builds) then verify that i can still make it crash, then try nosmp, although i've seen a few posts about single cpu crashes.

james

From: Derek Glidden
Sent: Tue 20/07/2004 5:06 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:
>
> The first person to complain I think was Jody Belka, who was using
> the changeset with comment 'Fairly major fixes to the network frontend
> driver...' (2004-07-13 18:24:48). Perhaps backing off to a day before
> that would be a sensible place to start?

I'm either going to blow your theory out of the water or help a lot 
because my first "real" build of all the Xen tools & kernel & linux 
kernels where I actually booted into a dom0 kernel from Xen was from a 
checkout on either the 12th or 13th.  Prior to that I was working out 
getting everything built under gentoo and not actually running it.  And 
that's what I've been using until I checked out and rebuilt everything 
fresh this sunday afternoon and still have the problem as of last 
night.  Although a VM will segfault while dom0 seems to panic, it's 
probably the same root problem.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 6449 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20  1:01         ` Derek Glidden
@ 2004-07-20  6:56           ` Keir Fraser
  2004-07-20 15:51           ` Derek Glidden
  1 sibling, 0 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-20  6:56 UTC (permalink / raw)
  To: Derek Glidden; +Cc: xen-devel


This could be harmless, or indicate memory corruption.

 -- Keir


> 
> On Jul 19, 2004, at 7:06 PM, Derek Glidden wrote:
> 
> >
> > I've booted Xen with "nosmp" and will do the same things I've been 
> > doing to it to make it break and see what happens.
> 
> hmm.  Running this same box, same Xen kernel, same linux kernel, but 
> with "nosmp", just creating a domain gives me about two dozen of these:
> 
> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
> 4GB segment.
> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
> !!!!
> 
> But so far, no crashes.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "I think that's what they mean by   |
> "nickels a day can feed a child."   |       http://www.eff.org/
> I thought, "How can food be so      | http://www.anti-dmca.org/
> cheap over there?"  It's not, they  |--------------------------
> just eat the nickels." -- Peter Nguyen
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20  1:04                   ` James Harper
@ 2004-07-20  7:59                     ` Keir Fraser
  2004-07-20 10:42                       ` James Harper
  2004-07-20 10:52                       ` Keir Fraser
  0 siblings, 2 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-20  7:59 UTC (permalink / raw)
  To: James Harper; +Cc: xen-devel


I've just checked in a few networking fixes that should make things
rather more robust in low-memory conditions. I suspect there are still
some bugs lurking somewhere, but hopefully this has thinned out the
bugs somewhat.

 -- Keir

> bk pull only showed 2 patches, neither of which affected kernels so
> I didn't bother recompiling.
> 
> I have seen an error (shown by my diff script 'compare' or by xend
> doing silly things like crashing), by simply starting another domain
> and pinging it with something like:
> 
> ping -s 1400 -i 0.001 192.168.200.200
> 
> (ping -f might do it but I think it goes a bit fast)
> 
> That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running.
> 
> running it out of memory with this code:
> 
> #include <stdio.h>
> #include <stdlib.h>
> int main() {
>         char *buf;
>         int mem = 0;
>         int size = 1;
>         char rnd;
>         rnd = rand() & 255;
>         while(1) {
>                 buf = (char *)malloc(size*1024*1024);
>                 memset(buf, rnd, size*1024*1024);
>                 if (buf != NULL) {
>                         mem += size;
>                         printf("%d\n", mem);
>                 }
>         }
> }
> 
> causes a crash far more quickly. I guess it's possible that those are two different errors though...
> 
> James


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-20  7:59                     ` Keir Fraser
@ 2004-07-20 10:42                       ` James Harper
  2004-07-20 10:52                       ` Keir Fraser
  1 sibling, 0 replies; 64+ messages in thread
From: James Harper @ 2004-07-20 10:42 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

I still get corruption with these latest patches. In this case I had started 2 domains and was pinging them both fairly hard, I didn't get as far as running it out of memory.

hth

James



From: Keir Fraser
Sent: Tue 20/07/2004 5:59 PM
To: James Harper
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


I've just checked in a few networking fixes that should make things
rather more robust in low-memory conditions. I suspect there are still
some bugs lurking somewhere, but hopefully this has thinned out the
bugs somewhat.

 -- Keir

> bk pull only showed 2 patches, neither of which affected kernels so
> I didn't bother recompiling.
> 
> I have seen an error (shown by my diff script 'compare' or by xend
> doing silly things like crashing), by simply starting another domain
> and pinging it with something like:
> 
> ping -s 1400 -i 0.001 192.168.200.200
> 
> (ping -f might do it but I think it goes a bit fast)
> 
> That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running.
> 
> running it out of memory with this code:
> 
> #include <stdio.h>
> #include <stdlib.h>
> int main() {
>         char *buf;
>         int mem = 0;
>         int size = 1;
>         char rnd;
>         rnd = rand() & 255;
>         while(1) {
>                 buf = (char *)malloc(size*1024*1024);
>                 memset(buf, rnd, size*1024*1024);
>                 if (buf != NULL) {
>                         mem += size;
>                         printf("%d\n", mem);
>                 }
>         }
> }
> 
> causes a crash far more quickly. I guess it's possible that those are two different errors though...
> 
> James

[-- Attachment #2: Type: text/html, Size: 2402 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20  7:59                     ` Keir Fraser
  2004-07-20 10:42                       ` James Harper
@ 2004-07-20 10:52                       ` Keir Fraser
  2004-07-20 13:38                         ` Christian Limpach
  2004-07-21  1:14                         ` James Harper
  1 sibling, 2 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-20 10:52 UTC (permalink / raw)
  To: Keir Fraser; +Cc: James Harper, xen-devel


I've checked in some more fixes that might entirely solve the problems
that everyone has been seeing.

Unfortunately xen.bkbits.net is down and I'm about to leave for
Canada. :-( Hopefully it will be possible to push to bkbits in a few
hours... 

The Changesets that will hopefully fix everything are:
   1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0
   More backend driver fixes and robustifying.

   1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0
   Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk
   into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno

So keep an eye out for these when you pull --- we're very interested
to hear of further bugs in builds /with/ these changesets. :-)

 -- Keir

> I've just checked in a few networking fixes that should make things
> rather more robust in low-memory conditions. I suspect there are still
> some bugs lurking somewhere, but hopefully this has thinned out the
> bugs somewhat.
> 
>  -- Keir
> 
> > bk pull only showed 2 patches, neither of which affected kernels so
> > I didn't bother recompiling.
> > 
> > I have seen an error (shown by my diff script 'compare' or by xend
> > doing silly things like crashing), by simply starting another domain
> > and pinging it with something like:
> > 
> > ping -s 1400 -i 0.001 192.168.200.200
> > 
> > (ping -f might do it but I think it goes a bit fast)
> > 
> > That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running.
> > 
> > running it out of memory with this code:
> > 
> > #include <stdio.h>
> > #include <stdlib.h>
> > int main() {
> >         char *buf;
> >         int mem = 0;
> >         int size = 1;
> >         char rnd;
> >         rnd = rand() & 255;
> >         while(1) {
> >                 buf = (char *)malloc(size*1024*1024);
> >                 memset(buf, rnd, size*1024*1024);
> >                 if (buf != NULL) {
> >                         mem += size;
> >                         printf("%d\n", mem);
> >                 }
> >         }
> > }
> > 
> > causes a crash far more quickly. I guess it's possible that those are two different errors though...
> > 
> > James
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20 10:52                       ` Keir Fraser
@ 2004-07-20 13:38                         ` Christian Limpach
  2004-07-21  1:14                         ` James Harper
  1 sibling, 0 replies; 64+ messages in thread
From: Christian Limpach @ 2004-07-20 13:38 UTC (permalink / raw)
  To: Keir Fraser; +Cc: James Harper, xen-devel

On Tue, Jul 20, 2004 at 11:52:39AM +0100, Keir Fraser wrote:
> I've checked in some more fixes that might entirely solve the problems
> that everyone has been seeing.
> 
> Unfortunately xen.bkbits.net is down and I'm about to leave for
> Canada. :-( Hopefully it will be possible to push to bkbits in a few
> hours... 
> 
> The Changesets that will hopefully fix everything are:
>    1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0
>    More backend driver fixes and robustifying.
> 
>    1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0
>    Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk
>    into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno
> 
> So keep an eye out for these when you pull --- we're very interested
> to hear of further bugs in builds /with/ these changesets. :-)

I've now pushed these changesets to the xen.bkbits repository.

    christian



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20  1:01         ` Derek Glidden
  2004-07-20  6:56           ` Keir Fraser
@ 2004-07-20 15:51           ` Derek Glidden
  2004-07-20 18:10             ` Chris Andrews
  2004-07-21 23:39             ` Derek Glidden
  1 sibling, 2 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-20 15:51 UTC (permalink / raw)
  To: xen-devel


On Jul 19, 2004, at 9:01 PM, Derek Glidden wrote:

> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
> 4GB segment.
> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
> !!!!

After pounding on that box pretty much all evening with "nosmp", I 
wasn't able to make it crash, either in dom0 or a VM, like I had been 
able to do in SMP mode.

I had some weirdness in dom0 when I woke up and checked on it this 
morning - a compile had failed that shouldn't have, but there were no 
log messages either from Xen or dom0, so I'm not really sure what that 
was.

Tonight I'll pull the latest changes and rebuild everything, reboot it 
without "nosmp" (make it SMP again) and see what happens.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20 15:51           ` Derek Glidden
@ 2004-07-20 18:10             ` Chris Andrews
  2004-07-21 23:39             ` Derek Glidden
  1 sibling, 0 replies; 64+ messages in thread
From: Chris Andrews @ 2004-07-20 18:10 UTC (permalink / raw)
  To: Derek Glidden; +Cc: xen-devel

Derek Glidden wrote:
> 
> On Jul 19, 2004, at 9:01 PM, Derek Glidden wrote:
> 
>> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
>> 4GB segment.
>> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
>> !!!!
> 
> 
> After pounding on that box pretty much all evening with "nosmp", I 
> wasn't able to make it crash, either in dom0 or a VM, like I had been 
> able to do in SMP mode.

Which revision of the code were you running there? I'd like to give it a 
go...

> I had some weirdness in dom0 when I woke up and checked on it this 
> morning - a compile had failed that shouldn't have, but there were no 
> log messages either from Xen or dom0, so I'm not really sure what that was.
> 
> Tonight I'll pull the latest changes and rebuild everything, reboot it 
> without "nosmp" (make it SMP again) and see what happens.

I've been trying various old revisions as far back as 1.1068[*] (so 
far), and I can't find one that doesn't blow up.

My test is to run James' 'compare' script in domain0 on two large 
identical files of randomness, and compile various things continuously 
in a 2.4.26 domain1. It usually takes only a few minutes to start 
showing differences, and if I leave it I'll get segfaults in domain0, 
then (with at least one revision) a panic in domain0 and reboot.

Just now I tried the latest code (post Keir's 1.1116/1.1117 csets) and 
I'm seeing much the same results.

Hardware is a Dell 1650, single CPU, 1G RAM, aacraid controller. I've 
got rid of the devicemapper stuff I was running before, and domain1's 
root is on an ordinary disk partition.

Chris.

[*] is this a suitably precise way of specifying revision? 1.1068 is 
based on the list from:
http://xen.bkbits.net:8080/xeno-unstable.bk/ChangeSet@-2w?nav=index.html

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-20 10:52                       ` Keir Fraser
  2004-07-20 13:38                         ` Christian Limpach
@ 2004-07-21  1:14                         ` James Harper
  2004-07-21 10:12                           ` Christian Limpach
  2004-07-21 13:30                           ` Keir Fraser
  1 sibling, 2 replies; 64+ messages in thread
From: James Harper @ 2004-07-21  1:14 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3578 bytes --]

I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
The tests I was running were my 'compare' script and pinging the two domains I had running with
ping -q -i 0.01 -s 1400 <ip address>

Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.

btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.

James




From: Keir Fraser
Sent: Tue 20/07/2004 8:52 PM
To: Keir Fraser
Cc: James Harper; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


I've checked in some more fixes that might entirely solve the problems
that everyone has been seeing.

Unfortunately xen.bkbits.net is down and I'm about to leave for
Canada. :-( Hopefully it will be possible to push to bkbits in a few
hours... 

The Changesets that will hopefully fix everything are:
   1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0
   More backend driver fixes and robustifying.

   1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0
   Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk
   into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno

So keep an eye out for these when you pull --- we're very interested
to hear of further bugs in builds /with/ these changesets. :-)

 -- Keir

> I've just checked in a few networking fixes that should make things
> rather more robust in low-memory conditions. I suspect there are still
> some bugs lurking somewhere, but hopefully this has thinned out the
> bugs somewhat.
> 
>  -- Keir
> 
> > bk pull only showed 2 patches, neither of which affected kernels so
> > I didn't bother recompiling.
> > 
> > I have seen an error (shown by my diff script 'compare' or by xend
> > doing silly things like crashing), by simply starting another domain
> > and pinging it with something like:
> > 
> > ping -s 1400 -i 0.001 192.168.200.200
> > 
> > (ping -f might do it but I think it goes a bit fast)
> > 
> > That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running.
> > 
> > running it out of memory with this code:
> > 
> > #include <stdio.h>
> > #include <stdlib.h>
> > int main() {
> >         char *buf;
> >         int mem = 0;
> >         int size = 1;
> >         char rnd;
> >         rnd = rand() & 255;
> >         while(1) {
> >                 buf = (char *)malloc(size*1024*1024);
> >                 memset(buf, rnd, size*1024*1024);
> >                 if (buf != NULL) {
> >                         mem += size;
> >                         printf("%d\n", mem);
> >                 }
> >         }
> > }
> > 
> > causes a crash far more quickly. I guess it's possible that those are two different errors though...
> > 
> > James
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 4695 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-21  1:14                         ` James Harper
@ 2004-07-21 10:12                           ` Christian Limpach
  2004-07-21 13:30                           ` Keir Fraser
  1 sibling, 0 replies; 64+ messages in thread
From: Christian Limpach @ 2004-07-21 10:12 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

On Wed, Jul 21, 2004 at 11:14:48AM +1000, James Harper wrote:
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U]
> in /boot? ksymoops would be much happier.

done, the install target will now install the System.map along with
the kernel and config file.

    christian



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-21  1:14                         ` James Harper
  2004-07-21 10:12                           ` Christian Limpach
@ 2004-07-21 13:30                           ` Keir Fraser
  2004-07-21 13:47                             ` James Harper
                                               ` (3 more replies)
  1 sibling, 4 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-21 13:30 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> The tests I was running were my 'compare' script and pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
> 
> Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> 
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> 
> James

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-21 13:30                           ` Keir Fraser
@ 2004-07-21 13:47                             ` James Harper
  2004-07-21 14:17                               ` Keir Fraser
  2004-07-22  1:48                             ` segfault in VM Derek Glidden
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-21 13:47 UTC (permalink / raw)
  Cc: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1765 bytes --]

i'll try this out tomorrow morning (too late tonight - need sleep!)

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> The tests I was running were my 'compare' script and pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
> 
> Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> 
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> 
> James

[-- Attachment #2: Type: text/html, Size: 2119 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-21 13:47                             ` James Harper
@ 2004-07-21 14:17                               ` Keir Fraser
  2004-07-22  4:36                                 ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-21 14:17 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

That would be extremely helpful! If it turns out to be the net backend
(probably most likely, although I guess it may not be a backend
problem at all, which would be harder to debug), then we can isolate
it to the receive or transmit path as follows:

To disable the receive path for guest OSes:
Edit netif_be_start_xmit in arch/xen/drivers/netif/backend/main.c to
always 'goto drop;'.

To disable the transmit path for guest OSes:
Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
call to netif_schedule_work(), add:
  make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
  netif_put(netif);
  continue;

With one half of the network path disabled, to load up the remaining
direction you'll need to flood ping from an external machine to the
guest OS (when you disable the guest's transmit path) or flood ping
out from the guest (when you disable it's rx path). I guess in both
cases you'll need a broadcast ping (yuk!) since ARP won't work (needs
both tx and rx).

 -- Keir

> i'll try this out tomorrow morning (too late tonight - need sleep!)
> 
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always 'return 0;'.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always 'return 0;'.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you'll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you'll need to boot off a ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> > The tests I was running were my 'compare' script and pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> > 
> > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James
\x1f -=- MIME -=- \x1f\f

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i'll try this out tomorrow morning (too late tonight - need sleep!)

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was=
 down - I assume it's identical to his latest fixes) and started my tests r=
unning and went to bed, but it looks like I got errors within a very short =
time.
> The tests I was running were my 'compare' script and pinging the two doma=
ins I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corrupti=
on and not indicative of the cause. They look similar to Jody's dump so I w=
on't bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] i=
n /boot? ksymoops would be much happier.
>=20
> James

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText57341 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i'll try this ou=
t tomorrow morning (too late tonight - need sleep!)</FONT></DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 2=
1/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xe=
n-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault i=
n VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this=
 to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

&gt; I downloaded these (from a tgz that Keir had given me a link to as bk =
was down - I assume it's identical to his latest fixes) and started my test=
s running and went to bed, but it looks like I got errors within a very sho=
rt time.
&gt; The tests I was running were my 'compare' script and pinging the two d=
omains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the corru=
ption and not indicative of the cause. They look similar to Jody's dump so =
I won't bother sending them unless someone thinks they might be useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U=
] in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-20 15:51           ` Derek Glidden
  2004-07-20 18:10             ` Chris Andrews
@ 2004-07-21 23:39             ` Derek Glidden
  1 sibling, 0 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-21 23:39 UTC (permalink / raw)
  To: xen-devel


FWIW: after coming home last night from work, dom0 crashed right away 
as soon as I logged in.

I rebooted, repaired, and checked out and rebuilt everything and so 
far, so good.  It hasn't generated those same "Bailing" messages when I 
create a domain at least.

If I can keep everything up and running tonight, I'll start hammering 
on them using the compare/ping thing and see what I can make break.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-21 13:30                           ` Keir Fraser
  2004-07-21 13:47                             ` James Harper
@ 2004-07-22  1:48                             ` Derek Glidden
  2004-07-22  1:54                               ` Keir Fraser
  2004-07-22  1:57                             ` James Harper
  2004-07-22  5:28                             ` Derek Glidden
  3 siblings, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-22  1:48 UTC (permalink / raw)
  To: xen-devel


On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:

>
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
>
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...

I'll give this a go as well, but, is this for dom0 or domU kernels?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22  1:48                             ` segfault in VM Derek Glidden
@ 2004-07-22  1:54                               ` Keir Fraser
  2004-07-22  2:39                                 ` Derek Glidden
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-22  1:54 UTC (permalink / raw)
  To: Derek Glidden; +Cc: xen-devel

> 
> On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:
> 
> >
> > Could someone try to isolate this to either the network backend driver
> > or the blkdev backend driver?
> >
> > The best way to do this is to disable the frontend drivers so that
> > they never try to coinnect to the backend driver...
> 
> I'll give this a go as well, but, is this for dom0 or domU kernels?

It's modifying the frontend drivers in the domU kernel so that the
data paths in the dom0 backend drivers do not get executed.

i.e., it's the domU kernel that needs recompiling.

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-21 13:30                           ` Keir Fraser
  2004-07-21 13:47                             ` James Harper
  2004-07-22  1:48                             ` segfault in VM Derek Glidden
@ 2004-07-22  1:57                             ` James Harper
  2004-07-22  2:03                               ` Keir Fraser
  2004-07-22  5:28                             ` Derek Glidden
  3 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-22  1:57 UTC (permalink / raw)
  Cc: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2792 bytes --]

i'm building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i'll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem.

Instead of changing the source code to disable the net stuff, would it work if I just specified 'nics=0' or is some part of the net subsystem still activated? I'll test this too anyway.

In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.

i'm almost confused, but am about to start testing - firstly with no network.

James

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> The tests I was running were my 'compare' script and pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
> 
> Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> 
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> 
> James

[-- Attachment #2: Type: text/html, Size: 3634 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22  1:57                             ` James Harper
@ 2004-07-22  2:03                               ` Keir Fraser
  2004-07-22  2:48                                 ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-22  2:03 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

> i'm building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i'll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem.
> 
> Instead of changing the source code to disable the net stuff, would it work if I just specified 'nics=0' or is some part of the net subsystem still activated? I'll test this too anyway.

I think the source will need to be changed. In any case, it's a
trivial change and then we can be certain that no device channel is
being set up.

> In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.

That's the reason for the broadcast ping. Unfortunately I'm not sure
how useful that will turn out to be -- e.g., we may just end up hosing
DOM0. 

> i'm almost confused, but am about to start testing - firstly with no network.

Stage 1 (isolating blkdev and network) shouldn't be too
hard. Basically we're ensuring the data paths in teh backend drivers
do not get executed -- they will only ever execute if there is a
device channel set up to a frontend in another guest, so disabling the
frontend drivers ensures this.

 -- Keir

> James
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always 'return 0;'.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always 'return 0;'.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you'll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you'll need to boot off a ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> > The tests I was running were my 'compare' script and pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> > 
> > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James
\x1f -=- MIME -=- \x1f\f

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i'm building this now, and am just thinking about how to test this... I was=
 using a ping as my test mechanism. I guess i'll do lots of block device co=
pies. I guess this lends weight to your thoughts that it probably is a net =
problem and not a block problem.

Instead of changing the source code to disable the net stuff, would it work=
 if I just specified 'nics=3D0' or is some part of the net subsystem still =
activated? I'll test this too anyway.

In order to test disabling send or receive, this might be a bit trickier th=
an you first make out. Send-only should be easy enough, just start another =
domain and then ping it (a manual arp table entry should alleviate the need=
 to broadcast). Receive-only will be tricker. How do you get a domain to se=
nd to it? This problem of course assumes that corruption is not limited to =
the domain... if it is limited to the domain then you should be able to hav=
e a send/receive domain and ignore crashes in there, just focus on the cras=
hes in the receive-only domain.

i'm almost confused, but am about to start testing - firstly with no networ=
k.

James

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was=
 down - I assume it's identical to his latest fixes) and started my tests r=
unning and went to bed, but it looks like I got errors within a very short =
time.
> The tests I was running were my 'compare' script and pinging the two doma=
ins I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corrupti=
on and not indicative of the cause. They look similar to Jody's dump so I w=
on't bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] i=
n /boot? ksymoops would be much happier.
>=20
> James

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText8898 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i'm building thi=
s now, and am</FONT><FONT face=3DArial size=3D2> just thinking about how to=
 test this... I was using a ping as my test mechanism. I guess i'll do lots=
 of block device copies. I guess this lends weight to your thoughts that it=
 probably is a net problem and not a block problem.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Instead of changing the source c=
ode to disable the net stuff, would it work if I just specified 'nics=3D0' =
or is some part of the net subsystem still activated? </FONT><FONT face=3DA=
rial size=3D2>I'll test this too anyway.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>In order to test disabling send =
or receive, this might be a bit trickier than you first make out. Send-only=
 should be easy enough, just start another domain and then ping it (a manua=
l arp table entry should alleviate the need to broadcast). Receive-only wil=
l be tricker. How do you get a domain to send to it? This problem of course=
 assumes that corruption is not&nbsp;limited to the domain... if it is limi=
ted to the domain then you should be able to have a send/receive domain and=
 ignore crashes in there, just focus on the crashes in the receive-only dom=
ain.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>i'm almost confused, but am abou=
t to start testing - firstly with no network.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV></DIV>
<DIV dir=3Dltr>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 2=
1/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xe=
n-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault i=
n VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this=
 to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

&gt; I downloaded these (from a tgz that Keir had given me a link to as bk =
was down - I assume it's identical to his latest fixes) and started my test=
s running and went to bed, but it looks like I got errors within a very sho=
rt time.
&gt; The tests I was running were my 'compare' script and pinging the two d=
omains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the corru=
ption and not indicative of the cause. They look similar to Jody's dump so =
I won't bother sending them unless someone thinks they might be useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U=
] in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22  1:54                               ` Keir Fraser
@ 2004-07-22  2:39                                 ` Derek Glidden
  0 siblings, 0 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-22  2:39 UTC (permalink / raw)
  To: xen-devel


On Jul 21, 2004, at 9:54 PM, Keir Fraser wrote:

> It's modifying the frontend drivers in the domU kernel so that the
> data paths in the dom0 backend drivers do not get executed.
>
> i.e., it's the domU kernel that needs recompiling.

Got it.  domU kernel recompiled and now running large amounts of block 
i/o while dom0 gets pung and also large amounts of block i/o.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-22  2:03                               ` Keir Fraser
@ 2004-07-22  2:48                                 ` James Harper
  2004-07-22  2:56                                   ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-22  2:48 UTC (permalink / raw)
  Cc: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 11696 bytes --]

As a first test I have just disabled networking via nics=0 in the config, and running this script in dom1:
#!/bin/sh
while [ 1 = 1 ]
do
  dd if=/dev/sda1 of=/dev/null bs=1024 count=128K &
  dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K
done

it tells me 'ioctl 801c6d02 not supported by XL blkif' but that doesn't seem to matter. Anyway, there are no crashes so far so i'm thinking at this stage that the block interface stuff is probably fine and I should now concentrate on the network. Disabling the block stuff will be a huge hassle at this stage so i'll have to let it go for the moment.

I think i need a crash course in how all this hangs together before I can understand what i'm testing... My understanding is as follows:

packets sent to dom0.vif1.0 appear at dom1.eth0.
packets sent to dom1.eth0 appear at dom0.vif1.0.

and that's about it. Are they symmetrical? Is the transmit code for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?

James

From: Keir Fraser
Sent: Thu 22/07/2004 12:03 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

> i'm building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i'll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem.
> 
> Instead of changing the source code to disable the net stuff, would it work if I just specified 'nics=0' or is some part of the net subsystem still activated? I'll test this too anyway.

I think the source will need to be changed. In any case, it's a
trivial change and then we can be certain that no device channel is
being set up.

> In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.

That's the reason for the broadcast ping. Unfortunately I'm not sure
how useful that will turn out to be -- e.g., we may just end up hosing
DOM0. 

> i'm almost confused, but am about to start testing - firstly with no network.

Stage 1 (isolating blkdev and network) shouldn't be too
hard. Basically we're ensuring the data paths in teh backend drivers
do not get executed -- they will only ever execute if there is a
device channel set up to a frontend in another guest, so disabling the
frontend drivers ensures this.

 -- Keir

> James
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always 'return 0;'.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always 'return 0;'.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you'll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you'll need to boot off a ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> > The tests I was running were my 'compare' script and pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> > 
> > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James
\x1f -=- MIME -=- \x1f\f
--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i'm building this now, and am just thinking about how to test this... I was=
 using a ping as my test mechanism. I guess i'll do lots of block device co=
pies. I guess this lends weight to your thoughts that it probably is a net =
problem and not a block problem.

Instead of changing the source code to disable the net stuff, would it work=
 if I just specified 'nics=3D0' or is some part of the net subsystem still =
activated? I'll test this too anyway.

In order to test disabling send or receive, this might be a bit trickier th=
an you first make out. Send-only should be easy enough, just start another =
domain and then ping it (a manual arp table entry should alleviate the need=
 to broadcast). Receive-only will be tricker. How do you get a domain to se=
nd to it? This problem of course assumes that corruption is not limited to =
the domain... if it is limited to the domain then you should be able to hav=
e a send/receive domain and ignore crashes in there, just focus on the cras=
hes in the receive-only domain.

i'm almost confused, but am about to start testing - firstly with no networ=
k.

James

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was=
 down - I assume it's identical to his latest fixes) and started my tests r=
unning and went to bed, but it looks like I got errors within a very short =
time.
> The tests I was running were my 'compare' script and pinging the two doma=
ins I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corrupti=
on and not indicative of the cause. They look similar to Jody's dump so I w=
on't bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] i=
n /boot? ksymoops would be much happier.
>=20
> James

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText8898 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i'm building thi=
s now, and am</FONT><FONT face=3DArial size=3D2> just thinking about how to=
 test this... I was using a ping as my test mechanism. I guess i'll do lots=
 of block device copies. I guess this lends weight to your thoughts that it=
 probably is a net problem and not a block problem.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Instead of changing the source c=
ode to disable the net stuff, would it work if I just specified 'nics=3D0' =
or is some part of the net subsystem still activated? </FONT><FONT face=3DA=
rial size=3D2>I'll test this too anyway.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>In order to test disabling send =
or receive, this might be a bit trickier than you first make out. Send-only=
 should be easy enough, just start another domain and then ping it (a manua=
l arp table entry should alleviate the need to broadcast). Receive-only wil=
l be tricker. How do you get a domain to send to it? This problem of course=
 assumes that corruption is not&nbsp;limited to the domain... if it is limi=
ted to the domain then you should be able to have a send/receive domain and=
 ignore crashes in there, just focus on the crashes in the receive-only dom=
ain.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>i'm almost confused, but am abou=
t to start testing - firstly with no network.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV></DIV>
<DIV dir=3Dltr>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 2=
1/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xe=
n-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault i=
n VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this=
 to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

&gt; I downloaded these (from a tgz that Keir had given me a link to as bk =
was down - I assume it's identical to his latest fixes) and started my test=
s running and went to bed, but it looks like I got errors within a very sho=
rt time.
&gt; The tests I was running were my 'compare' script and pinging the two d=
omains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the corru=
ption and not indicative of the cause. They look similar to Jody's dump so =
I won't bother sending them unless someone thinks they might be useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U=
] in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_--

[-- Attachment #2: Type: text/html, Size: 13487 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22  2:48                                 ` James Harper
@ 2004-07-22  2:56                                   ` Keir Fraser
  2004-07-22  3:49                                     ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-22  2:56 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

> As a first test I have just disabled networking via nics=0 in the config, and running this script in dom1:
> #!/bin/sh
> while [ 1 = 1 ]
> do
>   dd if=/dev/sda1 of=/dev/null bs=1024 count=128K &
>   dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K
> done
> 
> it tells me 'ioctl 801c6d02 not supported by XL blkif' but that doesn't seem to matter. Anyway, there are no crashes so far so i'm thinking at this stage that the block interface stuff is probably fine and I should now concentrate on the network. Disabling the block stuff will be a huge hassle at this stage so i'll have to let it go for the moment.

It does seem more likely that the network backend driver is to blame
-- it's considerably more complicated than the blkdev driver.

> I think i need a crash course in how all this hangs together before I can understand what i'm testing... My understanding is as follows:
> 
> packets sent to dom0.vif1.0 appear at dom1.eth0.
> packets sent to dom1.eth0 appear at dom0.vif1.0.

Yes, it's basically a point-to-point link. The transmit side on each
interface is directly linked to the receive side on the other.

> and that's about it. Are they symmetrical? Is the transmit code for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?

No. dom1.eth0 is implemented by the frontend driver
arch/xen/drivers/netif/frontend/main.c
dom0.vif* is implemented by arch/xen/drivers/netif/backend/main.c 

So they look symmetric to users, but the implementation is not
symmetric. 

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-22  2:56                                   ` Keir Fraser
@ 2004-07-22  3:49                                     ` James Harper
  2004-07-22 11:54                                       ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-22  3:49 UTC (permalink / raw)
  Cc: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2497 bytes --]

Okay, I have made the following change in dom0:

To disable the transmit path for guest OSes:
Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
call to netif_schedule_work(), add:
  make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
  netif_put(netif);
  continue;

compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the bridge, gave it it's own ip address, added a static arp entry and pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding indicating that dom0 was sending packets, dom1 was receiving packets, but that a packet sent by dom1 was unable to reach dom0 again. I got the same sort of crashes after about 10 minutes.

I'm now testing the other half.

James

From: Keir Fraser
Sent: Thu 22/07/2004 12:56 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

> As a first test I have just disabled networking via nics=0 in the config, and running this script in dom1:
> #!/bin/sh
> while [ 1 = 1 ]
> do
>   dd if=/dev/sda1 of=/dev/null bs=1024 count=128K &
>   dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K
> done
> 
> it tells me 'ioctl 801c6d02 not supported by XL blkif' but that doesn't seem to matter. Anyway, there are no crashes so far so i'm thinking at this stage that the block interface stuff is probably fine and I should now concentrate on the network. Disabling the block stuff will be a huge hassle at this stage so i'll have to let it go for the moment.

It does seem more likely that the network backend driver is to blame
-- it's considerably more complicated than the blkdev driver.

> I think i need a crash course in how all this hangs together before I can understand what i'm testing... My understanding is as follows:
> 
> packets sent to dom0.vif1.0 appear at dom1.eth0.
> packets sent to dom1.eth0 appear at dom0.vif1.0.

Yes, it's basically a point-to-point link. The transmit side on each
interface is directly linked to the receive side on the other.

> and that's about it. Are they symmetrical? Is the transmit code for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?

No. dom1.eth0 is implemented by the frontend driver
arch/xen/drivers/netif/frontend/main.c
dom0.vif* is implemented by arch/xen/drivers/netif/backend/main.c 

So they look symmetric to users, but the implementation is not
symmetric. 

 -- Keir

[-- Attachment #2: Type: text/html, Size: 3756 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-21 14:17                               ` Keir Fraser
@ 2004-07-22  4:36                                 ` James Harper
  2004-07-22 11:22                                   ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-22  4:36 UTC (permalink / raw)
  Cc: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7872 bytes --]

At this stage, it looks like disabling the receive path for the guest os eg netif_be_start_xmit  'goto drop' means that I can ping from the guest OS all i like with no crashes. I hope that's the right way around to do it...

I'm just looking at that procedure, how is the ring actually managed - what do all the _prod and _cons variables actually represent? And how is synchronisation handled between the domains? i notice there is no spinlock in there, is this done by the calling function?

james

From: Keir Fraser
Sent: Thu 22/07/2004 12:17 AM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

That would be extremely helpful! If it turns out to be the net backend
(probably most likely, although I guess it may not be a backend
problem at all, which would be harder to debug), then we can isolate
it to the receive or transmit path as follows:

To disable the receive path for guest OSes:
Edit netif_be_start_xmit in arch/xen/drivers/netif/backend/main.c to
always 'goto drop;'.

To disable the transmit path for guest OSes:
Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
call to netif_schedule_work(), add:
  make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
  netif_put(netif);
  continue;

With one half of the network path disabled, to load up the remaining
direction you'll need to flood ping from an external machine to the
guest OS (when you disable the guest's transmit path) or flood ping
out from the guest (when you disable it's rx path). I guess in both
cases you'll need a broadcast ping (yuk!) since ARP won't work (needs
both tx and rx).

 -- Keir

> i'll try this out tomorrow morning (too late tonight - need sleep!)
> 
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always 'return 0;'.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always 'return 0;'.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you'll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you'll need to boot off a ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it's identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time.
> > The tests I was running were my 'compare' script and pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody's dump so I won't bother sending them unless someone thinks they might be useful.
> > 
> > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James
\x1f -=- MIME -=- \x1f\f
--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i'll try this out tomorrow morning (too late tonight - need sleep!)

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was=
 down - I assume it's identical to his latest fixes) and started my tests r=
unning and went to bed, but it looks like I got errors within a very short =
time.
> The tests I was running were my 'compare' script and pinging the two doma=
ins I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corrupti=
on and not indicative of the cause. They look similar to Jody's dump so I w=
on't bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] i=
n /boot? ksymoops would be much happier.
>=20
> James

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText57341 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i'll try this ou=
t tomorrow morning (too late tonight - need sleep!)</FONT></DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 2=
1/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xe=
n-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault i=
n VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this=
 to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always 'return 0;'.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always 'return 0;'.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you'll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you'll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

&gt; I downloaded these (from a tgz that Keir had given me a link to as bk =
was down - I assume it's identical to his latest fixes) and started my test=
s running and went to bed, but it looks like I got errors within a very sho=
rt time.
&gt; The tests I was running were my 'compare' script and pinging the two d=
omains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the corru=
ption and not indicative of the cause. They look similar to Jody's dump so =
I won't bother sending them unless someone thinks they might be useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U=
] in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_--

[-- Attachment #2: Type: text/html, Size: 8943 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-21 13:30                           ` Keir Fraser
                                               ` (2 preceding siblings ...)
  2004-07-22  1:57                             ` James Harper
@ 2004-07-22  5:28                             ` Derek Glidden
  3 siblings, 0 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-22  5:28 UTC (permalink / raw)
  To: xen-devel

On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:
>
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always 'return 0;'.

changing netif_init() so that all the returns are "return 0;" doesn't 
seem to do much, the VMs still get network access, and everything looks 
and acts normal and there's still corruption after a few minutes of 
stress testing and network traffic.  (Although it does seem to be 
network related.  It ran for a while with no network traffic and no 
corruption, and within a minute or two of starting the pings, it 
started to flake out.)

changing netif_init() so that it immediately does "return 0" runs for a 
good long time with no corruption, unless you try to send data to one 
of the vifs, which makes dom0 blow up real good. Running it for a while 
with just block I/o and ping traffic to dom0 didn't result in any 
obvious corruption while running, but I did get these messages when I 
rebooted:

(XEN) (file=/opt/src/xeno/xeno-unstable.bk/xen/include/asm/mm.h, 
line=215) Unexpected type (saw c0000000 != exp e0000000) for pfn 
000032db
(XEN) DOM0: (file=memory.c, line=249) Bad page type for pfn 000032db 
(d0000005)
(XEN) (file=traps.c, line=466) GPF (0004): fc5277c8 -> fc52a094
Kernel panic: Failed to execute MMU updates
  (XEN) Domain 0 shutdown: rebooting machine!

which I've only seen on a reboot when there has been corruption.

disabling the receive path seems to still let packets through and shows 
signs of corruption, even with very little network traffic.  I'm not 
sure if that's because I have everything doing NAT instead of bridging, 
although that doesn't really make sense since it's still the same 
interface and the code looks like it should simply drop the packets...

It's getting late so I'll have to work on disabling the transmit path 
and working out how to go about testing the blockdev backend tomorrow.

I'll let it run for a while without even up'ing the vifs on the dom0 
side, which should preclude any network traffic at all getting to the 
VMs and see if there's any corruption going on running overnight or 
longer.

Can anyone else corrupt their systems with no network traffic?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22  4:36                                 ` James Harper
@ 2004-07-22 11:22                                   ` Keir Fraser
  2004-07-22 15:38                                     ` Derek Glidden
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-22 11:22 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

> At this stage, it looks like disabling the receive path for the
> guest os eg netif_be_start_xmit 'goto drop' means that I can ping
> from the guest OS all i like with no crashes. I hope that's the
> right way around to do it...

Yep, an unconditional 'goto drop;' at the start of netif_be_start_xmit
will prevent the guest from ever receiving packets.

How did you do send packets from the guest -- did you poke an ARP
entry, or send broadcast packets?

Anyway - currently sounds like teh bug resides in the most complex
half of the most complex driver. Who'd've thought it? ;-)

> I'm just looking at that procedure,
> how is the ring actually managed - what do all the _prod and _cons
> variables actually represent? And how is synchronisation handled
> between the domains? i notice there is no spinlock in there, is this
> done by the calling function?

Synchronisation between backend and frontend is lock-free --- for each
ring one guy is producer and the other is consumer so they each update
a disjoint set of ring indexes.

Within the backend, there is implicit per-interface locking on
netif_be_start_xmit so we'll never reenter for the same
interface. Then when we batch stuff up for a tasklet we're still okay
because tasklets are guaranteed non-reentrant also.

 -- Keir

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22  3:49                                     ` James Harper
@ 2004-07-22 11:54                                       ` Keir Fraser
  2004-07-22 12:53                                         ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-22 11:54 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

> Okay, I have made the following change in dom0:
> 
> To disable the transmit path for guest OSes:
> Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
> call to netif_schedule_work(), add:
>   make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
>   netif_put(netif);
>   continue;
> 
> compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the bridge, gave it it's own ip address, added a static arp entry and pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding indicating that dom0 was sending packets, dom1 was receiving packets, but that a packet sent by dom1 was unable to reach dom0 again. I got the same sort of crashes after about 10 minutes.

If you do a test with DPRINTK enabled in
linux-2.4.26-xen-sparse/arch/xen/drivers/netif/backend/common.h
and with debugging enabled in Xen 'debug=y make'
then you may get some useful debugging out of the machine when it all
goes horribly wrong. e.g., perhaps something is failing apparently
spuriously... one example would be that a page reassignment (from dom0
to the other guest) is failing for some weird reason.

If we can get somne debugging out when things first go wrong, that
would be very useful indeed.

 Thanks,
 Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-22 11:54                                       ` Keir Fraser
@ 2004-07-22 12:53                                         ` James Harper
  2004-07-22 13:09                                           ` Keir Fraser
  2004-07-22 15:32                                           ` Derek Glidden
  0 siblings, 2 replies; 64+ messages in thread
From: James Harper @ 2004-07-22 12:53 UTC (permalink / raw)
  Cc: Keir Fraser, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

I am trying this now. Within a few seconds of starting the flood ping, dom1 rebooted. no messages in the logs to give any hint as to why though. Trying again and I didn't get anything useful either once I started getting noticable corruption.

just on the subject of page reassignment, I'm trying to figure out what the code is doing.

in netif_be_start_xmit, there is a check to make sure that the packet is entirely on 1 page. What happens if the packet is too big for one page, or if there is other data on the same page? (it's all black magic to me at the moment!)

James

From: Keir Fraser
Sent: Thu 22/07/2004 9:54 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

> Okay, I have made the following change in dom0:
> 
> To disable the transmit path for guest OSes:
> Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
> call to netif_schedule_work(), add:
>   make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
>   netif_put(netif);
>   continue;
> 
> compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the bridge, gave it it's own ip address, added a static arp entry and pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding indicating that dom0 was sending packets, dom1 was receiving packets, but that a packet sent by dom1 was unable to reach dom0 again. I got the same sort of crashes after about 10 minutes.

If you do a test with DPRINTK enabled in
linux-2.4.26-xen-sparse/arch/xen/drivers/netif/backend/common.h
and with debugging enabled in Xen 'debug=y make'
then you may get some useful debugging out of the machine when it all
goes horribly wrong. e.g., perhaps something is failing apparently
spuriously... one example would be that a page reassignment (from dom0
to the other guest) is failing for some weird reason.

If we can get somne debugging out when things first go wrong, that
would be very useful indeed.

 Thanks,
 Keir

[-- Attachment #2: Type: text/html, Size: 2727 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22 12:53                                         ` James Harper
@ 2004-07-22 13:09                                           ` Keir Fraser
  2004-07-22 15:32                                           ` Derek Glidden
  1 sibling, 0 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-22 13:09 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, xen-devel

> I am trying this now. Within a few seconds of starting the flood ping,
> dom1 rebooted. no messages in the logs to give any hint as to why
> though. Trying again and I didn't get anything useful either once I
> started getting noticable corruption.

Hmmm.... I guess maybe there's a race somewhere, rather than the
problem being a broken error-handling path. Which is a shame, as it's
bound to be harder to track down. :-(

> just on the subject of page reassignment, I'm trying to figure out
> what the code is doing.
>
> in netif_be_start_xmit, there is a check to make sure that the packet
> is entirely on 1 page. What happens if the packet is too big for one
> page, or if there is other data on the same page? (it's all black
> magic to me at the moment!)

Unless you're using jumbo Ethernet frames (which you're almost
certainly not) then the packet will certainly fit in a page. We also
check that the packet buffer is at least half a page in size --- since
the slab allocator allocates in powers-of-two, that means the packet
buffer must actually be a full aligned page in size.

If our checks are insufficient and a few packets that are sharing
their data page are getting thru, for example, then we would be pretty
screwed! This might be another area to explore -- whether there are a
few skbuffs coming thru now and then that are of a layout that we 
mishandle. 

 -- Keir

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22 12:53                                         ` James Harper
  2004-07-22 13:09                                           ` Keir Fraser
@ 2004-07-22 15:32                                           ` Derek Glidden
  1 sibling, 0 replies; 64+ messages in thread
From: Derek Glidden @ 2004-07-22 15:32 UTC (permalink / raw)
  To: xen-devel


On Jul 22, 2004, at 8:53 AM, James Harper wrote:

> I am trying this now. Within a few seconds of starting the flood ping, 
> dom1 rebooted. no messages in the logs to give any hint as to why 
> though. Trying again and I didn't get anything useful either once I 
> started getting noticable corruption.

Just to corroborate, I've been able to pretty reliably induce 
corruption and I have my Xen kernel compiled with "debug=y".   Xen will 
pretty much continuously spit out "GPF (0004)" messages, but I've only 
ever seen it output "Bailing" a couple of times on a corruption.  Most 
of the time there's nothing when the corruption starts.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22 11:22                                   ` Keir Fraser
@ 2004-07-22 15:38                                     ` Derek Glidden
  2004-07-22 17:48                                       ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-22 15:38 UTC (permalink / raw)
  To: xen-devel


On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
>
> Anyway - currently sounds like teh bug resides in the most complex
> half of the most complex driver. Who'd've thought it? ;-)

At this point this data is surely redundant but...

When I went to sleep last night I let my box run dom0 and four VMs 
doing md5sum checks on a couple of large files, hammering the heck out 
of the block i/o drivers and CPU but with all the ifaces/vifs on the 
machine down.  When I woke up, all compares had been correct for the 
six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
and the VMs and within a minute of the pings starting dom0 started to 
report incorrect md5sums.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-22 15:38                                     ` Derek Glidden
@ 2004-07-22 17:48                                       ` Keir Fraser
  2004-07-23  1:03                                         ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Keir Fraser @ 2004-07-22 17:48 UTC (permalink / raw)
  To: Derek Glidden; +Cc: xen-devel

It's useful to have the extra data points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

> 
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who'd've thought it? ;-)
> 
> At this point this data is surely redundant but...
> 
> When I went to sleep last night I let my box run dom0 and four VMs 
> doing md5sum checks on a couple of large files, hammering the heck out 
> of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> machine down.  When I woke up, all compares had been correct for the 
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> and the VMs and within a minute of the pings starting dom0 started to 
> report incorrect md5sums.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        http://www.eff.org/
> in blood. But if you live your     |  http://www.anti-dmca.org/
> life right, that kind of thing     |---------------------------
> doesn't have to stop there." -- Dana Gould
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-22 17:48                                       ` Keir Fraser
@ 2004-07-23  1:03                                         ` James Harper
  2004-07-23  1:11                                           ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-23  1:03 UTC (permalink / raw)
  To: Keir Fraser, Derek Glidden; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 4019 bytes --]

I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.

As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.

So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

It's useful to have the extra data points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

> 
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who'd've thought it? ;-)
> 
> At this point this data is surely redundant but...
> 
> When I went to sleep last night I let my box run dom0 and four VMs 
> doing md5sum checks on a couple of large files, hammering the heck out 
> of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> machine down.  When I woke up, all compares had been correct for the 
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> and the VMs and within a minute of the pings starting dom0 started to 
> report incorrect md5sums.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        http://www.eff.org/
> in blood. But if you live your     |  http://www.anti-dmca.org/
> life right, that kind of thing     |---------------------------
> doesn't have to stop there." -- Dana Gould
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 4947 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM
  2004-07-23  1:03                                         ` James Harper
@ 2004-07-23  1:11                                           ` Keir Fraser
  2004-07-23  4:49                                             ` James Harper
  2004-07-23 16:01                                             ` segfault in VM - FIXED! Keir Fraser
  0 siblings, 2 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-23  1:11 UTC (permalink / raw)
  To: James Harper; +Cc: Keir Fraser, Derek Glidden, xen-devel

Yeah, it turns out I can reproduce this bug trivially by md5summing a
file just slightly bigger than dom0's memory allocation, while
floodpinging dom1.

I'm trying out a few things right now, so hopefully I'll be able to
report progress on this evil bug r.s.n. :-)

 -- Keir

> I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.
> 
> As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.
> 
> So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 
> 
> 
> 
> From: Keir Fraser
> Sent: Fri 23/07/2004 3:48 AM
> To: Derek Glidden
> Cc: xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> It's useful to have the extra data points -- it adds to our confidence
> that it's the network driver that is somehow at fault here.
> 
> Quite how to proceed in narrowing down the problem is
> unclear. One approach is to perturb the backend driver's data path
> (e.g., always copying packets into a known-safe page-sized buffer, as
> a check that our current copy-avoidancxe checks are not at fault; and
> replacing the current high-performance but convoluted code for
> batching hypercalls with something slower but easier to grok). The
> latter is useful because if the bug goes away then we have a smaller
> chunk of code to look at; if the bug remains then we end up with a
> less complex data path that is easier to instrument and bughunt.
> 
> If anyone is interested in pursuing this bug independently, the
> functions most under suspicion are netif_be_start_xmit and
> net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> These two form the data path for packets getting sent to guest OSes.
> 
>  -- Keir
> 
> 
> > 
> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > >
> > > Anyway - currently sounds like teh bug resides in the most complex
> > > half of the most complex driver. Who'd've thought it? ;-)
> > 
> > At this point this data is surely redundant but...
> > 
> > When I went to sleep last night I let my box run dom0 and four VMs 
> > doing md5sum checks on a couple of large files, hammering the heck out 
> > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > machine down.  When I woke up, all compares had been correct for the 
> > six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> > and the VMs and within a minute of the pings starting dom0 started to 
> > report incorrect md5sums.
> > 
> > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > "We all enter this world in the    | Support Electronic Freedom
> > same way: naked; screaming; soaked |        http://www.eff.org/
> > in blood. But if you live your     |  http://www.anti-dmca.org/
> > life right, that kind of thing     |---------------------------
> > doesn't have to stop there." -- Dana Gould
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel
\x1f -=- MIME -=- \x1f\f

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I just made a change so that the skbuf is always copied in netif_be_start_x=
mit but it still crashes, which means most likely that bit is fine or at le=
ast isn't the only code containing bugs.

As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)=
 || skb_cloned(skb) || ...' block, (still block the receive but do it later=
) and there were no crashes, so i'm comfortable that we've exhausted netif_=
be_start_xmit as a source for bugs.

So I guess that leaves net_rx_action. I'm unsure on one thing though, the p=
ages that get passed from dom0 to domU, how/where/do they get recycled back=
 to dom0? Is it possible that domU could still write to a page that dom0 th=
ought it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

It's useful to have the extra data points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

>=20
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who'd've thought it? ;-)
>=20
> At this point this data is surely redundant but...
>=20
> When I went to sleep last night I let my box run dom0 and four VMs=20
> doing md5sum checks on a couple of large files, hammering the heck out=20
> of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
> machine down.  When I woke up, all compares had been correct for the=20
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0=20
> and the VMs and within a minute of the pings starting dom0 started to=20
> report incorrect md5sums.
>=20
> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        http://www.eff.org/
> in blood. But if you live your     |  http://www.anti-dmca.org/
> life right, that kind of thing     |---------------------------
> doesn't have to stop there." -- Dana Gould
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText58627 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just made a ch=
ange so that the skbuf is always copied in netif_be_start_xmit but it still=
 crashes, which means most likely that bit is fine or at least isn't the on=
ly code containing bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also put the '=
goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' bloc=
k, (still block the receive but do it later) and there were no crashes, so =
i'm comfortable that we've exhausted netif_be_start_xmit as a source for bu=
gs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves net_rx_ac=
tion. I'm unsure on one thing though, the pages that get passed from dom0 t=
o domU, how/where/do they get recycled back to dom0? Is it possible that do=
mU could still write to a page that dom0 thought it had free to use for som=
ething else? If so, where would that be?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able to repr=
oduce these errors at all?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Fri 2=
3/07/2004 3:48 AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B> xen-devel@lists=
.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FON=
T><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">It's useful to have the extra dat=
a points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

&gt;=20
&gt; On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
&gt; &gt;
&gt; &gt; Anyway - currently sounds like teh bug resides in the most comple=
x
&gt; &gt; half of the most complex driver. Who'd've thought it? ;-)
&gt;=20
&gt; At this point this data is surely redundant but...
&gt;=20
&gt; When I went to sleep last night I let my box run dom0 and four VMs=20
&gt; doing md5sum checks on a couple of large files, hammering the heck out=
=20
&gt; of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
&gt; machine down.  When I woke up, all compares had been correct for the=20
&gt; six hours or so it ran.  I re-upped the ifaces and started to ping dom=
0=20
&gt; and the VMs and within a minute of the pings starting dom0 started to=
=20
&gt; report incorrect md5sums.
&gt;=20
&gt; -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
&gt; "We all enter this world in the    | Support Electronic Freedom
&gt; same way: naked; screaming; soaked |        http://www.eff.org/
&gt; in blood. But if you live your     |  http://www.anti-dmca.org/
&gt; life right, that kind of thing     |---------------------------
&gt; doesn't have to stop there." -- Dana Gould
&gt;=20
&gt;=20
&gt;=20
&gt; -------------------------------------------------------
&gt; This SF.Net email is sponsored by BEA Weblogic Workshop
&gt; FREE Java Enterprise J2EE developer tools!
&gt; Get your free copy of BEA WebLogic Workshop 8.1 today.
&gt; http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
&gt; _______________________________________________
&gt; Xen-devel mailing list
&gt; Xen-devel@lists.sourceforge.net
&gt; https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM
  2004-07-23  1:11                                           ` Keir Fraser
@ 2004-07-23  4:49                                             ` James Harper
  2004-07-23 16:01                                             ` segfault in VM - FIXED! Keir Fraser
  1 sibling, 0 replies; 64+ messages in thread
From: James Harper @ 2004-07-23  4:49 UTC (permalink / raw)
  Cc: Keir Fraser, Derek Glidden, xen-devel

[-- Attachment #1: Type: text/plain, Size: 14647 bytes --]

That's comforting. I was starting to think of looking for gcc bugs and the like.

Even so, it might be useful to collect the gcc versions of anyone who either has seen the bug or has tried to reproduce it and can't. Mine reports itself as "gcc (GCC) 3.3.4 (Debian 1:3.3.4-2)" with "gcc --version"

James

From: Keir Fraser
Sent: Fri 23/07/2004 11:11 AM
To: James Harper
Cc: Keir Fraser; Derek Glidden; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Yeah, it turns out I can reproduce this bug trivially by md5summing a
file just slightly bigger than dom0's memory allocation, while
floodpinging dom1.

I'm trying out a few things right now, so hopefully I'll be able to
report progress on this evil bug r.s.n. :-)

 -- Keir

> I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.
> 
> As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.
> 
> So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 
> 
> 
> 
> From: Keir Fraser
> Sent: Fri 23/07/2004 3:48 AM
> To: Derek Glidden
> Cc: xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> It's useful to have the extra data points -- it adds to our confidence
> that it's the network driver that is somehow at fault here.
> 
> Quite how to proceed in narrowing down the problem is
> unclear. One approach is to perturb the backend driver's data path
> (e.g., always copying packets into a known-safe page-sized buffer, as
> a check that our current copy-avoidancxe checks are not at fault; and
> replacing the current high-performance but convoluted code for
> batching hypercalls with something slower but easier to grok). The
> latter is useful because if the bug goes away then we have a smaller
> chunk of code to look at; if the bug remains then we end up with a
> less complex data path that is easier to instrument and bughunt.
> 
> If anyone is interested in pursuing this bug independently, the
> functions most under suspicion are netif_be_start_xmit and
> net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> These two form the data path for packets getting sent to guest OSes.
> 
>  -- Keir
> 
> 
> > 
> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > >
> > > Anyway - currently sounds like teh bug resides in the most complex
> > > half of the most complex driver. Who'd've thought it? ;-)
> > 
> > At this point this data is surely redundant but...
> > 
> > When I went to sleep last night I let my box run dom0 and four VMs 
> > doing md5sum checks on a couple of large files, hammering the heck out 
> > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > machine down.  When I woke up, all compares had been correct for the 
> > six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> > and the VMs and within a minute of the pings starting dom0 started to 
> > report incorrect md5sums.
> > 
> > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > "We all enter this world in the    | Support Electronic Freedom
> > same way: naked; screaming; soaked |        http://www.eff.org/
> > in blood. But if you live your     |  http://www.anti-dmca.org/
> > life right, that kind of thing     |---------------------------
> > doesn't have to stop there." -- Dana Gould
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel
\x1f -=- MIME -=- \x1f\f
--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I just made a change so that the skbuf is always copied in netif_be_start_x=
mit but it still crashes, which means most likely that bit is fine or at le=
ast isn't the only code containing bugs.

As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)=
 || skb_cloned(skb) || ...' block, (still block the receive but do it later=
) and there were no crashes, so i'm comfortable that we've exhausted netif_=
be_start_xmit as a source for bugs.

So I guess that leaves net_rx_action. I'm unsure on one thing though, the p=
ages that get passed from dom0 to domU, how/where/do they get recycled back=
 to dom0? Is it possible that domU could still write to a page that dom0 th=
ought it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

It's useful to have the extra data points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

>=20
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who'd've thought it? ;-)
>=20
> At this point this data is surely redundant but...
>=20
> When I went to sleep last night I let my box run dom0 and four VMs=20
> doing md5sum checks on a couple of large files, hammering the heck out=20
> of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
> machine down.  When I woke up, all compares had been correct for the=20
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0=20
> and the VMs and within a minute of the pings starting dom0 started to=20
> report incorrect md5sums.
>=20
> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        http://www.eff.org/
> in blood. But if you live your     |  http://www.anti-dmca.org/
> life right, that kind of thing     |---------------------------
> doesn't have to stop there." -- Dana Gould
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText58627 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just made a ch=
ange so that the skbuf is always copied in netif_be_start_xmit but it still=
 crashes, which means most likely that bit is fine or at least isn't the on=
ly code containing bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also put the '=
goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' bloc=
k, (still block the receive but do it later) and there were no crashes, so =
i'm comfortable that we've exhausted netif_be_start_xmit as a source for bu=
gs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves net_rx_ac=
tion. I'm unsure on one thing though, the pages that get passed from dom0 t=
o domU, how/where/do they get recycled back to dom0? Is it possible that do=
mU could still write to a page that dom0 thought it had free to use for som=
ething else? If so, where would that be?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able to repr=
oduce these errors at all?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Fri 2=
3/07/2004 3:48 AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B> xen-devel@lists=
.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FON=
T><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">It's useful to have the extra dat=
a points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

&gt;=20
&gt; On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
&gt; &gt;
&gt; &gt; Anyway - currently sounds like teh bug resides in the most comple=
x
&gt; &gt; half of the most complex driver. Who'd've thought it? ;-)
&gt;=20
&gt; At this point this data is surely redundant but...
&gt;=20
&gt; When I went to sleep last night I let my box run dom0 and four VMs=20
&gt; doing md5sum checks on a couple of large files, hammering the heck out=
=20
&gt; of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
&gt; machine down.  When I woke up, all compares had been correct for the=20
&gt; six hours or so it ran.  I re-upped the ifaces and started to ping dom=
0=20
&gt; and the VMs and within a minute of the pings starting dom0 started to=
=20
&gt; report incorrect md5sums.
&gt;=20
&gt; -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
&gt; "We all enter this world in the    | Support Electronic Freedom
&gt; same way: naked; screaming; soaked |        http://www.eff.org/
&gt; in blood. But if you live your     |  http://www.anti-dmca.org/
&gt; life right, that kind of thing     |---------------------------
&gt; doesn't have to stop there." -- Dana Gould
&gt;=20
&gt;=20
&gt;=20
&gt; -------------------------------------------------------
&gt; This SF.Net email is sponsored by BEA Weblogic Workshop
&gt; FREE Java Enterprise J2EE developer tools!
&gt; Get your free copy of BEA WebLogic Workshop 8.1 today.
&gt; http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
&gt; _______________________________________________
&gt; Xen-devel mailing list
&gt; Xen-devel@lists.sourceforge.net
&gt; https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--

[-- Attachment #2: Type: text/html, Size: 16361 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-23  1:11                                           ` Keir Fraser
  2004-07-23  4:49                                             ` James Harper
@ 2004-07-23 16:01                                             ` Keir Fraser
  2004-07-23 17:44                                               ` Derek Glidden
                                                                 ` (2 more replies)
  1 sibling, 3 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-23 16:01 UTC (permalink / raw)
  To: Keir Fraser; +Cc: James Harper, Derek Glidden, xen-devel


Okay, so I found that the problem is due to overly-aggressive merging
of block requests in the IDE driver. The code assumes that if buffers
are adjacent in virtual or physical address space then they can be
merged --- this isn't always the case over Xen since those physical
addresses may map to different real machine pages.

I've checked in a fix that I think is safe for IDE --- in the
occasional instances that a merged scatter-gather list is invalid, we
should now cause IDE to fall back to a super-safe mode (basically
PIO). On my system this happens so occasionally that performance
shouldn't be affected.

If this also turns out to be a problem for SCSI then we may need to do
some more work --- our safety check will still trigger and we will
still fail the scatter-gather list, but it doesn't look as though many
SCSI drivers pick up the error return code and do anything sane. This
is a bug in those drivers, but this is small comfort to us in our aim
to work with the full range of Linux SCSI drivers.

What we need now is some more checking, particularly with SCSI block
devices, to see whether there are any more bugs to shake out.

 -- Keir


> 
> Yeah, it turns out I can reproduce this bug trivially by md5summing a
> file just slightly bigger than dom0's memory allocation, while
> floodpinging dom1.
> 
> I'm trying out a few things right now, so hopefully I'll be able to
> report progress on this evil bug r.s.n. :-)
> 
>  -- Keir
> 
> > I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.
> > 
> > As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.
> > 
> > So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?
> > 
> > Keir: have you been able to reproduce these errors at all?
> > 
> > James
> > 
> > 
> > 
> > 
> > From: Keir Fraser
> > Sent: Fri 23/07/2004 3:48 AM
> > To: Derek Glidden
> > Cc: xen-devel@lists.sourceforge.net
> > Subject: Re: [Xen-devel] segfault in VM
> > 
> > 
> > It's useful to have the extra data points -- it adds to our confidence
> > that it's the network driver that is somehow at fault here.
> > 
> > Quite how to proceed in narrowing down the problem is
> > unclear. One approach is to perturb the backend driver's data path
> > (e.g., always copying packets into a known-safe page-sized buffer, as
> > a check that our current copy-avoidancxe checks are not at fault; and
> > replacing the current high-performance but convoluted code for
> > batching hypercalls with something slower but easier to grok). The
> > latter is useful because if the bug goes away then we have a smaller
> > chunk of code to look at; if the bug remains then we end up with a
> > less complex data path that is easier to instrument and bughunt.
> > 
> > If anyone is interested in pursuing this bug independently, the
> > functions most under suspicion are netif_be_start_xmit and
> > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> > These two form the data path for packets getting sent to guest OSes.
> > 
> >  -- Keir
> > 
> > 
> > > 
> > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > > >
> > > > Anyway - currently sounds like teh bug resides in the most complex
> > > > half of the most complex driver. Who'd've thought it? ;-)
> > > 
> > > At this point this data is surely redundant but...
> > > 
> > > When I went to sleep last night I let my box run dom0 and four VMs 
> > > doing md5sum checks on a couple of large files, hammering the heck out 
> > > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > > machine down.  When I woke up, all compares had been correct for the 
> > > six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> > > and the VMs and within a minute of the pings starting dom0 started to 
> > > report incorrect md5sums.
> > > 
> > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > > "We all enter this world in the    | Support Electronic Freedom
> > > same way: naked; screaming; soaked |        http://www.eff.org/
> > > in blood. But if you live your     |  http://www.anti-dmca.org/
> > > life right, that kind of thing     |---------------------------
> > > doesn't have to stop there." -- Dana Gould
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by BEA Weblogic Workshop
> > > FREE Java Enterprise J2EE developer tools!
> > > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/xen-devel
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
> \x1f -=- MIME -=- \x1f\f

> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> 
> I just made a change so that the skbuf is always copied in netif_be_start_x=
> mit but it still crashes, which means most likely that bit is fine or at le=
> ast isn't the only code containing bugs.
> 
> As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)=
>  || skb_cloned(skb) || ...' block, (still block the receive but do it later=
> ) and there were no crashes, so i'm comfortable that we've exhausted netif_=
> be_start_xmit as a source for bugs.
> 
> So I guess that leaves net_rx_action. I'm unsure on one thing though, the p=
> ages that get passed from dom0 to domU, how/where/do they get recycled back=
>  to dom0? Is it possible that domU could still write to a page that dom0 th=
> ought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-23 16:01                                             ` segfault in VM - FIXED! Keir Fraser
@ 2004-07-23 17:44                                               ` Derek Glidden
  2004-07-23 17:55                                                 ` Keir Fraser
  2004-07-23 19:14                                               ` Chris Andrews
  2004-07-24  8:52                                               ` James Harper
  2 siblings, 1 reply; 64+ messages in thread
From: Derek Glidden @ 2004-07-23 17:44 UTC (permalink / raw)
  To: xen-devel


On Jul 23, 2004, at 12:01 PM, Keir Fraser wrote:

>
> Okay, so I found that the problem is due to overly-aggressive merging
> of block requests in the IDE driver. The code assumes that if buffers
> are adjacent in virtual or physical address space then they can be
> merged --- this isn't always the case over Xen since those physical
> addresses may map to different real machine pages.

And there was much rejoicing!

Thanks Keir for working so hard on digging this problem out and getting 
a fix in.

Other than the doms not dying after a halt, which you said you checked 
in a fix, and the occasional strange unbalanced dom scheduling, which I 
understand the scheduler is being worked on, the -unstable branch has 
worked very well for me so far.  (Well, outside of the random 
crashes... :)

I'll do a pull tonight when I get home and rebuild everything and start 
hammering on it some more.

> I've checked in a fix that I think is safe for IDE --- in the
> occasional instances that a merged scatter-gather list is invalid, we
> should now cause IDE to fall back to a super-safe mode (basically
> PIO). On my system this happens so occasionally that performance
> shouldn't be affected.

Does it revert back to "normal" behaviour for consequent operations?  
i.e. is the "basically PIO" mode just for the operation that fails?

> What we need now is some more checking, particularly with SCSI block
> devices, to see whether there are any more bugs to shake out.

Would it help at all for me to set up a box as ide-scsi, or is it 
strictly the data path inside the individual SCSI drivers that could 
cause problems?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that's what they mean by   |
"nickels a day can feed a child."   |       http://www.eff.org/
I thought, "How can food be so      | http://www.anti-dmca.org/
cheap over there?"  It's not, they  |--------------------------
just eat the nickels." -- Peter Nguyen


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        http://www.eff.org/
in blood. But if you live your     |  http://www.anti-dmca.org/
life right, that kind of thing     |---------------------------
doesn't have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-23 17:44                                               ` Derek Glidden
@ 2004-07-23 17:55                                                 ` Keir Fraser
  0 siblings, 0 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-23 17:55 UTC (permalink / raw)
  To: Derek Glidden; +Cc: xen-devel

> > I've checked in a fix that I think is safe for IDE --- in the
> > occasional instances that a merged scatter-gather list is invalid, we
> > should now cause IDE to fall back to a super-safe mode (basically
> > PIO). On my system this happens so occasionally that performance
> > shouldn't be affected.
> 
> Does it revert back to "normal" behaviour for consequent operations?  
> i.e. is the "basically PIO" mode just for the operation that fails?

That is correct -- in practice very very few requests should end up
using PIO.

> > What we need now is some more checking, particularly with SCSI block
> > devices, to see whether there are any more bugs to shake out.
> 
> Would it help at all for me to set up a box as ide-scsi, or is it 
> strictly the data path inside the individual SCSI drivers that could 
> cause problems?

Stress-testing in as many environments and setups as possible is very
welcome! 

I've also contacted the linux-kernel mailing list to find out whether
anyone there has a btter fix, or would be amenable to some
Xen-friendly patches being sent their way. ;-)

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-23 16:01                                             ` segfault in VM - FIXED! Keir Fraser
  2004-07-23 17:44                                               ` Derek Glidden
@ 2004-07-23 19:14                                               ` Chris Andrews
  2004-07-26 12:07                                                 ` Keir Fraser
  2004-07-24  8:52                                               ` James Harper
  2 siblings, 1 reply; 64+ messages in thread
From: Chris Andrews @ 2004-07-23 19:14 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On 23 Jul 2004, at 17:01, Keir Fraser wrote:
>
> What we need now is some more checking, particularly with SCSI block
> devices, to see whether there are any more bugs to shake out.

I've given this change a go on my PE1650 (aacraid driver). 
Unfortunately this seems to be one of the SCSI drivers that doesn't 
correctly handle the error condition.

Running my usual test ('compare' in dom0, compiles in other domains), I 
don't see any differences in the compares, but after a few minutes, I 
get the following on the console, and everything is stuck waiting for 
disk.

aacraid: cmd len 00000000 cmd underflow 00010000
aacraid: Host adapter reset request. SCSI hang ?

The latter message repeats every few seconds. I rebooted the box with 
the Xen console after a few lines. I'm going to try the aacraid driver 
from 2.4.27-rc3, which I believe has had some attention recently.

Chris.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM - FIXED!
  2004-07-23 16:01                                             ` segfault in VM - FIXED! Keir Fraser
  2004-07-23 17:44                                               ` Derek Glidden
  2004-07-23 19:14                                               ` Chris Andrews
@ 2004-07-24  8:52                                               ` James Harper
  2004-07-24 12:47                                                 ` Chris Andrews
  2 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-24  8:52 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Derek Glidden, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7567 bytes --]

My system doesn't have any ide devices, it's scsi only. The scsi driver is aic7xxx, and i'm still having crashes even with the latest checkout. I noticed in the logs for the first time some scsi errors in amongst all the others, but given the nature of the crash i don't know if that means anything.

Is this the same problem that we thought was in the network code? I could not readily induce the crash without creating lots of network traffic.

James



From: Keir Fraser
Sent: Sat 24/07/2004 2:01 AM
To: Keir Fraser
Cc: James Harper; Derek Glidden; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!


Okay, so I found that the problem is due to overly-aggressive merging
of block requests in the IDE driver. The code assumes that if buffers
are adjacent in virtual or physical address space then they can be
merged --- this isn't always the case over Xen since those physical
addresses may map to different real machine pages.

I've checked in a fix that I think is safe for IDE --- in the
occasional instances that a merged scatter-gather list is invalid, we
should now cause IDE to fall back to a super-safe mode (basically
PIO). On my system this happens so occasionally that performance
shouldn't be affected.

If this also turns out to be a problem for SCSI then we may need to do
some more work --- our safety check will still trigger and we will
still fail the scatter-gather list, but it doesn't look as though many
SCSI drivers pick up the error return code and do anything sane. This
is a bug in those drivers, but this is small comfort to us in our aim
to work with the full range of Linux SCSI drivers.

What we need now is some more checking, particularly with SCSI block
devices, to see whether there are any more bugs to shake out.

 -- Keir


> 
> Yeah, it turns out I can reproduce this bug trivially by md5summing a
> file just slightly bigger than dom0's memory allocation, while
> floodpinging dom1.
> 
> I'm trying out a few things right now, so hopefully I'll be able to
> report progress on this evil bug r.s.n. :-)
> 
>  -- Keir
> 
> > I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.
> > 
> > As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.
> > 
> > So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?
> > 
> > Keir: have you been able to reproduce these errors at all?
> > 
> > James
> > 
> > 
> > 
> > 
> > From: Keir Fraser
> > Sent: Fri 23/07/2004 3:48 AM
> > To: Derek Glidden
> > Cc: xen-devel@lists.sourceforge.net
> > Subject: Re: [Xen-devel] segfault in VM
> > 
> > 
> > It's useful to have the extra data points -- it adds to our confidence
> > that it's the network driver that is somehow at fault here.
> > 
> > Quite how to proceed in narrowing down the problem is
> > unclear. One approach is to perturb the backend driver's data path
> > (e.g., always copying packets into a known-safe page-sized buffer, as
> > a check that our current copy-avoidancxe checks are not at fault; and
> > replacing the current high-performance but convoluted code for
> > batching hypercalls with something slower but easier to grok). The
> > latter is useful because if the bug goes away then we have a smaller
> > chunk of code to look at; if the bug remains then we end up with a
> > less complex data path that is easier to instrument and bughunt.
> > 
> > If anyone is interested in pursuing this bug independently, the
> > functions most under suspicion are netif_be_start_xmit and
> > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> > These two form the data path for packets getting sent to guest OSes.
> > 
> >  -- Keir
> > 
> > 
> > > 
> > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > > >
> > > > Anyway - currently sounds like teh bug resides in the most complex
> > > > half of the most complex driver. Who'd've thought it? ;-)
> > > 
> > > At this point this data is surely redundant but...
> > > 
> > > When I went to sleep last night I let my box run dom0 and four VMs 
> > > doing md5sum checks on a couple of large files, hammering the heck out 
> > > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > > machine down.  When I woke up, all compares had been correct for the 
> > > six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> > > and the VMs and within a minute of the pings starting dom0 started to 
> > > report incorrect md5sums.
> > > 
> > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > > "We all enter this world in the    | Support Electronic Freedom
> > > same way: naked; screaming; soaked |        http://www.eff.org/
> > > in blood. But if you live your     |  http://www.anti-dmca.org/
> > > life right, that kind of thing     |---------------------------
> > > doesn't have to stop there." -- Dana Gould
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by BEA Weblogic Workshop
> > > FREE Java Enterprise J2EE developer tools!
> > > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/xen-devel
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
> \x1f -=- MIME -=- \x1f\f
> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> 
> I just made a change so that the skbuf is always copied in netif_be_start_x=
> mit but it still crashes, which means most likely that bit is fine or at le=
> ast isn't the only code containing bugs.
> 
> As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)=
>  || skb_cloned(skb) || ...' block, (still block the receive but do it later=
> ) and there were no crashes, so i'm comfortable that we've exhausted netif_=
> be_start_xmit as a source for bugs.
> 
> So I guess that leaves net_rx_action. I'm unsure on one thing though, the p=
> ages that get passed from dom0 to domU, how/where/do they get recycled back=
>  to dom0? Is it possible that domU could still write to a page that dom0 th=
> ought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 

[-- Attachment #2: Type: text/html, Size: 8851 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-24  8:52                                               ` James Harper
@ 2004-07-24 12:47                                                 ` Chris Andrews
  2004-07-24 15:54                                                   ` Chris Andrews
  0 siblings, 1 reply; 64+ messages in thread
From: Chris Andrews @ 2004-07-24 12:47 UTC (permalink / raw)
  To: James Harper; +Cc: xen-devel

On 24 Jul 2004, at 09:52, James Harper wrote:

> My system doesn't have any ide devices, it's scsi only. The scsi 
> driver is aic7xxx, and i'm still having crashes even with the latest 
> checkout. I noticed in the logs for the first time some scsi errors in 
> amongst all the others, but given the nature of the crash i don't know 
> if that means anything.

You might want to try disabling merging in the scsi driver altogether 
and see if that produces a stable system... I ran the aacraid driver 
like that and couldn't provoke any problems -- although the performance 
was of course *horrible*. If you want to try it I've attached an 
untested but trivial patch below for your aic7xxx driver.

I referred to the aacraid driver having had some work since 2.4.26 .. 
it's not actually been touched in 2.4, and all the work is happening in 
2.6. I added a bit of code to try and handle the zero return from 
pci_map_sg, and pass the error condition up, but I couldn't find 
anything to pass back that would make the scsi layer do something 
useful (plenty of everything-stuck-in-D, or a nice tight loop leading 
to an oops with a *long* call trace, depending on which error code I 
used).

I'm just testing a patch which disables merging in the scsi layer when 
it believes it has contiguous requests in different pages. I think this 
is more pessimistic that it needs to be, as the pages may after all be 
contiguous, but it does allow some merging  to happen and so far seems 
to be stable.

Chris.

--- aic7xxx_osm.c       2003-11-28 18:26:20.000000000 +0000
+++ aic7xxx_osm_hacked.c        2004-07-24 11:07:06.000000000 +0100
@@ -1309,7 +1309,7 @@
         .can_queue              = AHC_MAX_QUEUE,
         .this_id                = -1,
         .cmd_per_lun            = 2,
-       .use_clustering         = ENABLE_CLUSTERING,
+       .use_clustering         = DISABLE_CLUSTERING,
  #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,4,7)
         /*
          * We can only map 16MB per-SG

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-24 12:47                                                 ` Chris Andrews
@ 2004-07-24 15:54                                                   ` Chris Andrews
  2004-07-25  9:27                                                     ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Chris Andrews @ 2004-07-24 15:54 UTC (permalink / raw)
  To: xen-devel


On 24 Jul 2004, at 13:47, I wrote:

> I'm just testing a patch which disables merging in the scsi layer when 
> it believes it has contiguous requests in different pages. I think 
> this is more pessimistic that it needs to be, as the pages may after 
> all be contiguous, but it does allow some merging  to happen and so 
> far seems to be stable.

I've given this a bit more testing, and it seems to be working fine - 
the machine is now running a dom0 kernel built while running the patch. 
As for performance, it's 'not bad' -- I've just done a bonnie++ run, 
and some compiles. Based on sticking printks in and watching the 
console, it's allowing merges much more often than not, but still I 
suspect not as much as it could. Probably it should use something with 
more arch-knowledge than page_to_phys().

patch for linux-2.4.26-xen0: 
http://munky.nodnol.org/~chris/xen_scsi_merge.diff
bonnie++ stats: http://munky.nodnol.org/~chris/munkyII_stats.txt

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM - FIXED!
  2004-07-24 15:54                                                   ` Chris Andrews
@ 2004-07-25  9:27                                                     ` James Harper
  2004-07-25 11:24                                                       ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-25  9:27 UTC (permalink / raw)
  To: Chris Andrews, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2132 bytes --]

I'm building this now.

The way I see it, currently it must be incorrect for only a very very small number of cases or the system would crash and burn almost instantly. So in theory, unless these cases are undetectable, or the cost of detecting them is high for some reason, the performance difference should be almost unnoticable

I assume the patch would only affect dom0 and so should matter if domU is patched or not. Is there a way of installing a patch so that it's picked up by 'make world'?

i'll follow up with results shortly.

James

From: Chris Andrews
Sent: Sun 25/07/2004 1:54 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!

On 24 Jul 2004, at 13:47, I wrote:

> I'm just testing a patch which disables merging in the scsi layer when 
> it believes it has contiguous requests in different pages. I think 
> this is more pessimistic that it needs to be, as the pages may after 
> all be contiguous, but it does allow some merging  to happen and so 
> far seems to be stable.

I've given this a bit more testing, and it seems to be working fine - 
the machine is now running a dom0 kernel built while running the patch. 
As for performance, it's 'not bad' -- I've just done a bonnie++ run, 
and some compiles. Based on sticking printks in and watching the 
console, it's allowing merges much more often than not, but still I 
suspect not as much as it could. Probably it should use something with 
more arch-knowledge than page_to_phys().

patch for linux-2.4.26-xen0: 
http://munky.nodnol.org/~chris/xen_scsi_merge.diff
bonnie++ stats: http://munky.nodnol.org/~chris/munkyII_stats.txt

Chris.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 3022 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM - FIXED!
  2004-07-25  9:27                                                     ` James Harper
@ 2004-07-25 11:24                                                       ` James Harper
  2004-07-25 15:08                                                         ` Chris Andrews
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-25 11:24 UTC (permalink / raw)
  To: James Harper, Chris Andrews, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2422 bytes --]

so far so good. It's been running for a while now with no errors. much longer than it would have survived previously.

James

From: James Harper
Sent: Sun 25/07/2004 7:27 PM
To: Chris Andrews; xen-devel@lists.sourceforge.net
Subject: RE: [Xen-devel] segfault in VM - FIXED!

I'm building this now.

The way I see it, currently it must be incorrect for only a very very small number of cases or the system would crash and burn almost instantly. So in theory, unless these cases are undetectable, or the cost of detecting them is high for some reason, the performance difference should be almost unnoticable

I assume the patch would only affect dom0 and so should matter if domU is patched or not. Is there a way of installing a patch so that it's picked up by 'make world'?

i'll follow up with results shortly.

James

From: Chris Andrews
Sent: Sun 25/07/2004 1:54 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!

On 24 Jul 2004, at 13:47, I wrote:

> I'm just testing a patch which disables merging in the scsi layer when 
> it believes it has contiguous requests in different pages. I think 
> this is more pessimistic that it needs to be, as the pages may after 
> all be contiguous, but it does allow some merging  to happen and so 
> far seems to be stable.

I've given this a bit more testing, and it seems to be working fine - 
the machine is now running a dom0 kernel built while running the patch. 
As for performance, it's 'not bad' -- I've just done a bonnie++ run, 
and some compiles. Based on sticking printks in and watching the 
console, it's allowing merges much more often than not, but still I 
suspect not as much as it could. Probably it should use something with 
more arch-knowledge than page_to_phys().

patch for linux-2.4.26-xen0: 
http://munky.nodnol.org/~chris/xen_scsi_merge.diff
bonnie++ stats: http://munky.nodnol.org/~chris/munkyII_stats.txt

Chris.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

[-- Attachment #2: Type: text/html, Size: 3646 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-25 11:24                                                       ` James Harper
@ 2004-07-25 15:08                                                         ` Chris Andrews
  2004-07-25 23:23                                                           ` James Harper
  0 siblings, 1 reply; 64+ messages in thread
From: Chris Andrews @ 2004-07-25 15:08 UTC (permalink / raw)
  To: James Harper; +Cc: xen-devel


On 25 Jul 2004, at 12:24, James Harper wrote:

> so far so good. It's been running for a while now with no errors. much 
> longer than it would have survived previously.

It's broken for me - I suspect it's that although it checks that 
requests to be merged begin in the same page, it doesn't also check 
they end in that same page. I'm testing a version now that tries to do 
that.

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: segfault in VM - FIXED!
  2004-07-25 15:08                                                         ` Chris Andrews
@ 2004-07-25 23:23                                                           ` James Harper
  2004-07-26 12:12                                                             ` Keir Fraser
  0 siblings, 1 reply; 64+ messages in thread
From: James Harper @ 2004-07-25 23:23 UTC (permalink / raw)
  To: Chris Andrews; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2672 bytes --]

I was running my diff script all night which itself reported no errors, but this morning I have the following in dom0's kern.log:

Jul 25 21:53:58 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 25 23:02:49 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 25 23:31:25 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 01:07:55 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 01:38:59 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 02:35:21 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 02:47:33 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 04:55:37 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 06:32:56 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 06:59:22 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 08:00:19 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
Jul 26 08:24:50 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2

and in dom2:

Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)

so something funny is going on. i started my diff and ping scripts at about 21:20. At least the above error is detected though.

James



From: Chris Andrews
Sent: Mon 26/07/2004 1:08 AM
To: James Harper
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!


On 25 Jul 2004, at 12:24, James Harper wrote:

> so far so good. It's been running for a while now with no errors. much 
> longer than it would have survived previously.

It's broken for me - I suspect it's that although it checks that 
requests to be merged begin in the same page, it doesn't also check 
they end in that same page. I'm testing a version now that tries to do 
that.

Chris.

[-- Attachment #2: Type: text/html, Size: 3521 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-23 19:14                                               ` Chris Andrews
@ 2004-07-26 12:07                                                 ` Keir Fraser
  0 siblings, 0 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-26 12:07 UTC (permalink / raw)
  To: Chris Andrews; +Cc: Keir Fraser, xen-devel

> 
> On 23 Jul 2004, at 17:01, Keir Fraser wrote:
> >
> > What we need now is some more checking, particularly with SCSI block
> > devices, to see whether there are any more bugs to shake out.
> 
> I've given this change a go on my PE1650 (aacraid driver). 
> Unfortunately this seems to be one of the SCSI drivers that doesn't 
> correctly handle the error condition.
> 
> Running my usual test ('compare' in dom0, compiles in other domains), I 
> don't see any differences in the compares, but after a few minutes, I 
> get the following on the console, and everything is stuck waiting for 
> disk.
> 
> aacraid: cmd len 00000000 cmd underflow 00010000
> aacraid: Host adapter reset request. SCSI hang ?
> 
> The latter message repeats every few seconds. I rebooted the box with 
> the Xen console after a few lines. I'm going to try the aacraid driver 
> from 2.4.27-rc3, which I believe has had some attention recently.

Looks like the SCSI-merge code will have to be modified. I'll do
IDE-merge code at the same time, so that we just merge less
aggressively rather than falling back to PIO transfers.

I'll leave the check in pci_map_sg(), but it shouldn't ever trigger
after I patch the IDE and SCSI merge routines, so I'll add a warning
message if an invalid scatter-gather list is detected.

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: segfault in VM - FIXED!
  2004-07-25 23:23                                                           ` James Harper
@ 2004-07-26 12:12                                                             ` Keir Fraser
  0 siblings, 0 replies; 64+ messages in thread
From: Keir Fraser @ 2004-07-26 12:12 UTC (permalink / raw)
  To: James Harper; +Cc: Chris Andrews, xen-devel

Looks like this is a very occasional failure, from the timestamps
between messages. If you make a debug buil dof Xen then we'll get some
info as to why the page transfer is failing.

 -- Keir

> I was running my diff script all night which itself reported no errors, but this morning I have the following in dom0's kern.log:
> 
> Jul 25 21:53:58 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 25 23:02:49 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 25 23:31:25 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 01:07:55 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 01:38:59 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 02:35:21 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 02:47:33 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 04:55:37 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 06:32:56 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 06:59:22 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 08:00:19 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> Jul 26 08:24:50 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2
> 
> and in dom2:
> 
> Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)
> 
> so something funny is going on. i started my diff and ping scripts at about 21:20. At least the above error is detected though.
> 
> James
> 
> 
> 
> From: Chris Andrews
> Sent: Mon 26/07/2004 1:08 AM
> To: James Harper
> Cc: xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM - FIXED!
> 
> 
> On 25 Jul 2004, at 12:24, James Harper wrote:
> 
> > so far so good. It's been running for a while now with no errors. much 
> > longer than it would have survived previously.
> 
> It's broken for me - I suspect it's that although it checks that 
> requests to be merged begin in the same page, it doesn't also check 
> they end in that same page. I'm testing a version now that tries to do 
> that.
> 
> Chris.
\x1f -=- MIME -=- \x1f\f

--_7B4740D2-5940-4EA9-8376-C62BADEDF385_
Content-Type: text/plain;
	charset="iso-8859-1";
	format=flowed
Content-Transfer-Encoding: quoted-printable

I was running my diff script all night which itself reported no errors, but=
 this morning I have the following in dom0's kern.log:

Jul 25 21:53:58 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 25 23:02:49 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 25 23:31:25 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 01:07:55 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 01:38:59 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 02:35:21 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 02:47:33 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 04:55:37 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 06:32:56 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 06:59:22 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 08:00:19 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2
Jul 26 08:24:50 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update =
transferring to DOM2

and in dom2:

Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)

so something funny is going on. i started my diff and ping scripts at about=
 21:20. At least the above error is detected though.

James

From: Chris Andrews
Sent: Mon 26/07/2004 1:08 AM
To: James Harper
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!

On 25 Jul 2004, at 12:24, James Harper wrote:

> so far so good. It's been running for a while now with no errors. much=20
> longer than it would have survived previously.

It's broken for me - I suspect it's that although it checks that=20
requests to be merged begin in the same page, it doesn't also check=20
they end in that same page. I'm testing a version now that tries to do=20
that.

Chris.

--_7B4740D2-5940-4EA9-8376-C62BADEDF385_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText44056 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I was running my=
 diff script all night which itself reported no errors, but this morning I =
have the following in dom0's kern.log:</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Jul 25 21:53:58 xen1 kernel: (fi=
le=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 25 2=
3:02:49 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transfer=
ring to DOM2<BR>Jul 25 23:31:25 xen1 kernel: (file=3Dmain.c, line=3D270) Fa=
iled MMU update transferring to DOM2<BR>Jul 26 01:07:55 xen1 kernel: (file=
=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 01:=
38:59 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferri=
ng to DOM2<BR>Jul 26 02:35:21 xen1 kernel: (file=3Dmain.c, line=3D270) Fail=
ed MMU update transferring to DOM2<BR>Jul 26 02:47:33 xen1 kernel: (file=3D=
main.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 04:55:=
37 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring =
to DOM2<BR>Jul 26 06:32:56 xen1 kernel: (file=3Dmain.c, line=3D270) Failed =
MMU update transferring to DOM2<BR>Jul 26 06:59:22 xen1 kernel: (file=3Dmai=
n.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 08:00:19 =
xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to =
DOM2<BR>Jul 26 08:24:50 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU=
 update transferring to DOM2<BR></FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>and in dom2:</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Jul 25 21:53:58 mail2 kernel: ba=
d buffer on RX ring!(-1)<BR>Jul 25 23:02:49 mail2 kernel: bad buffer on RX =
ring!(-1)<BR>Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)<BR>Ju=
l 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 01:38:59 m=
ail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 02:35:21 mail2 kernel: ba=
d buffer on RX ring!(-1)<BR>Jul 26 02:47:33 mail2 kernel: bad buffer on RX =
ring!(-1)<BR>Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)<BR>Ju=
l 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 06:59:22 m=
ail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 08:00:19 mail2 kernel: ba=
d buffer on RX ring!(-1)<BR>Jul 26 08:24:50 mail2 kernel: bad buffer on RX =
ring!(-1)<BR></FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>so something funny is going on. =
i started my diff and ping scripts at about 21:20. At least the above error=
 is detected though.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>&nbsp;</DIV></FONT>
<DIV dir=3Dltr>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Chris Andrews<BR><B>Sent:</B> Mon=
 26/07/2004 1:08 AM<BR><B>To:</B> James Harper<BR><B>Cc:</B> xen-devel@list=
s.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM - FIXED=
!<BR></FONT><BR></DIV></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">On 25 Jul 2004, at 12:24, James H=
arper wrote:

&gt; so far so good. It's been running for a while now with no errors. much=
=20
&gt; longer than it would have survived previously.

It's broken for me - I suspect it's that although it checks that=20
requests to be merged begin in the same page, it doesn't also check=20
they end in that same page. I'm testing a version now that tries to do=20
that.

Chris.

</PRE></DIV></BODY></HTML>

--_7B4740D2-5940-4EA9-8376-C62BADEDF385_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2004-07-26 12:12 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-19  5:22 segfault in VM Derek Glidden
2004-07-19  5:50 ` James Harper
2004-07-19  7:27   ` Keir Fraser
2004-07-19  8:28     ` Chris Andrews
2004-07-19  8:57       ` Keir Fraser
2004-07-19  9:01         ` Chris Andrews
2004-07-19 12:48           ` Wm
2004-07-19 13:22             ` Keir Fraser
2004-07-19 19:06               ` Derek Glidden
2004-07-20  0:01                 ` James Harper
2004-07-20  1:04                   ` James Harper
2004-07-20  7:59                     ` Keir Fraser
2004-07-20 10:42                       ` James Harper
2004-07-20 10:52                       ` Keir Fraser
2004-07-20 13:38                         ` Christian Limpach
2004-07-21  1:14                         ` James Harper
2004-07-21 10:12                           ` Christian Limpach
2004-07-21 13:30                           ` Keir Fraser
2004-07-21 13:47                             ` James Harper
2004-07-21 14:17                               ` Keir Fraser
2004-07-22  4:36                                 ` James Harper
2004-07-22 11:22                                   ` Keir Fraser
2004-07-22 15:38                                     ` Derek Glidden
2004-07-22 17:48                                       ` Keir Fraser
2004-07-23  1:03                                         ` James Harper
2004-07-23  1:11                                           ` Keir Fraser
2004-07-23  4:49                                             ` James Harper
2004-07-23 16:01                                             ` segfault in VM - FIXED! Keir Fraser
2004-07-23 17:44                                               ` Derek Glidden
2004-07-23 17:55                                                 ` Keir Fraser
2004-07-23 19:14                                               ` Chris Andrews
2004-07-26 12:07                                                 ` Keir Fraser
2004-07-24  8:52                                               ` James Harper
2004-07-24 12:47                                                 ` Chris Andrews
2004-07-24 15:54                                                   ` Chris Andrews
2004-07-25  9:27                                                     ` James Harper
2004-07-25 11:24                                                       ` James Harper
2004-07-25 15:08                                                         ` Chris Andrews
2004-07-25 23:23                                                           ` James Harper
2004-07-26 12:12                                                             ` Keir Fraser
2004-07-22  1:48                             ` segfault in VM Derek Glidden
2004-07-22  1:54                               ` Keir Fraser
2004-07-22  2:39                                 ` Derek Glidden
2004-07-22  1:57                             ` James Harper
2004-07-22  2:03                               ` Keir Fraser
2004-07-22  2:48                                 ` James Harper
2004-07-22  2:56                                   ` Keir Fraser
2004-07-22  3:49                                     ` James Harper
2004-07-22 11:54                                       ` Keir Fraser
2004-07-22 12:53                                         ` James Harper
2004-07-22 13:09                                           ` Keir Fraser
2004-07-22 15:32                                           ` Derek Glidden
2004-07-22  5:28                             ` Derek Glidden
2004-07-19 18:58       ` Derek Glidden
2004-07-19 19:34         ` Chris Andrews
2004-07-20  0:04           ` James Harper
2004-07-19 18:56     ` Derek Glidden
2004-07-19 23:06       ` Derek Glidden
2004-07-20  1:01         ` Derek Glidden
2004-07-20  6:56           ` Keir Fraser
2004-07-20 15:51           ` Derek Glidden
2004-07-20 18:10             ` Chris Andrews
2004-07-21 23:39             ` Derek Glidden
2004-07-19 18:52   ` Derek Glidden

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.