From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: Xen 4.0.0x allows for data corruption in Dom0
Date: Mon, 08 Mar 2010 14:24:20 -0800
Message-ID: <4B957914.4050408@goop.org>
References: <4B922A89.2060105@invisiblethingslab.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4B922A89.2060105@invisiblethingslab.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Joanna Rutkowska <joanna@invisiblethingslab.com>
Cc: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

On 03/06/2010 02:12 AM, Joanna Rutkowska wrote:
> There is a nasty data corruption problem most likely allowed by a bug in
> the Xen 4.0.0-x hypervisors.
>
> The problem occurs with a frequency of "a few chunks per 10 GB of data
> copied", and only when running a VM (PV domU) with a specific kernel.
> The problem, however, affects not only the VM but also the Dom0, which
> is of significant importance.
>
> How to reproduce:
>
> 1) Start at least one Xen PV VM with a pvops0 kernel. One kernel known
> to demonstrate the problem is the one built by Michael Young, based on
> xen/master git from Dec 23. It has recently been replaced by a newer
> kernel, which doesn't always show the problem, but I uploaded the
> previous one at the URL below, so people can use it for testing:
>
> http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64.rpm
>
> Now you can start a dummy VM with this kernel, e.g.:
>
> # xm create -c /dev/null memory=400 kernel=<path/to/kernel>
> extra="rootdelay=1000"
>
> 2) Now, in Dom0, after having started this dummy VM, create a big test
> file, filled all with zeros. Make sure to choose a size bigger than your
> DRAM size, to avoid fs caching effect, e.g.:
>
> $ dd if=/dev/zero of=test bs=1M count=10000
>
> That should create a 10GB file. Make sure to use /dev/zero and not
> /dev/null!
>
> 3) Once the test file got created, check if it really consists of zeros
> only:
>
> $ xxd test.bin | grep -v "0000 0000 0000 0000 0000 0000 0000 0000"
>
> Normally you should not get any output. However, I consistently get
> something like this:
>
> 4593a000:940d 0000 0000 0000 2d40 d6fc c803 0000  ........-@......
> 4593a010:00f6 1f52 b301 0000 b620 dcd5 ff00 0000  ...R..... ......
> a5df0000:e542 712c 77da c9f9 a429 4b85 ecc4 9395  .Bq,w....)K.....
> a5df0010:d9d6 971f 0d58 5c70 aba6 387d 805f 09e2  .....X\p..8}._..
> ceecb000:f80d 0000 0000 0000 096e 1cdc e403 0000  .........n......
> ceecb010:2460 7ef6 be01 0000 b620 dcd5 ff00 0000  $`~...... ......
> 148432000580e 0000 0000 0000 5665 ed9d ff03 0000  X.......Ve......
> 1484320107bcc a023 ca01 0000 b620 dcd5 ff00 0000  {..#..... ......
> 1c548b000bc0e 0000 0000 0000 6942 387d 1b04 0000  ........iB8}....
> 1c548b010872b 01c8 d501 0000 b620 dcd5 ff00 0000  .+....... ......
> 225d450004448 27cd b966 b37e 1f0c e9e3 c2db b6ee  DH'..f.~........
> 225d45010d2b2 55b8 9ef1 e818 a7e3 364d 2322 dc75  ..U.......6M#".u
> 242056000140f 0000 0000 0000 0bb0 3704 3404 0000  ..........7.4...
> 2420560109601 b606 e001 0000 b620 dcd5 ff00 0000  ......... ......
>
> The actual data vary between tests, however, the "dcd5 ff00 0000"
> pattern seems to be repeatable on a given system with a given hypervisor
> binary (the above numbers are for Xen-4.0.0-rc5 built from Michael
> Young's SRPM). The errors always occur in chunks of 32-bytes.
>
> We have tested this in our lab on three different machines, with various
> Dom0 kernels -- based on xen/master (AKA xen/stable-2.6.31) and
> xen/stable (AKA xen/stable-2.6.32) -- and with a few Xen 4 hypervisors
> (rc2, rc4, rc5). Not every kernel allows for reproducing the error with
> such a simple "dummy" VM as the one given above -- e.g. the 2.6.32-based
> kernels required some more regular VMs to be started for the problem to
> be noticeable. However, with the previously mentioned kernel (M. Young
> Dec23), the problem has been 100% reproducible us.
>
> When downgraded to Xen 3.4.2 the problem went away.
>
> Of course this problem cannot be attributed to a buggy VM kernel, as the
> hypervisor should be resistant to any kind of "wrong" software (buggy or
> malicious) that executes in a VM.
>    

Why "of course"?  You report looks to me like a bug in dom0 which is 
causing data corruption when there's another domain running.  I don't 
see anything that specifically implicates Xen.  The fact that the 
symptoms change with a different Xen version could mean kernel bug is 
effected by the Xen version (different memory layout, for example, or 
different paths in the kernel caused by different feature availability).

     J