VMX status report. Xen:26323 & Dom0:3.7.1

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* VMX status report. Xen:26323 & Dom0:3.7.1
@ 2013-01-10  7:51 Ren, Yongjie
  2013-01-10  8:57 ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Ren, Yongjie @ 2013-01-10  7:51 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com
  Cc: Dai, Yan, Xu, YongweiX, Liu, SongtaoX, Zhou, Chao

Hi, All,
This is the test report of xen-unstable tree on some Intel platforms.
We found 1 new bug about live migration.

Version Info:
=================================================================
xen changeset:    26323:64b36dde26bc
Dom0:           Linux 3.7.1
=================================================================

New issue(1)
==============
1. sometimes live migration failed and reported call trace in dom0
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841

Fixed issue(0)
==============

Old issues(11)
==============
1. [ACPI] Dom0 can't resume from S3 sleep
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
2. [XL]"xl vcpu-set" causes dom0 crash or panic
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
  -- Dom0 will not panic with latest linux.git tree 3.8-rc..
  -- We'll have more detailed update on that bugzilla.
3. [VT-D] VNC console broken after detach NIC from guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1736
4. Sometimes Xen panic on ia32pae Sandybridge when restore guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747
5. 'xl vcpu-set' can't decrease the vCPU number of a HVM guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
6. Dom0 cannot be shutdown before PCI device detachment from guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
7. xl pci-list shows one PCI device (PF or VF) could be assigned to two different guests
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834
8. [upstream qemu] Guest free memory with upstream qemu is 14MB lower than that with qemu-xen-unstable.git
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836
9. [upstream qemu]'maxvcpus=NUM' item is not supported in upstream QEMU
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837
10. [upstream qemu] Guest console hangs after save/restore or live-migration when setting 'hpet=0' in guest config file
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838
11. [upstream qemu] 'xen_platform_pci=0' setting cannot make the guest use emulated PCI devices by default
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839


Best Regards,
     Yongjie (Jay)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-10  7:51 VMX status report. Xen:26323 & Dom0:3.7.1 Ren, Yongjie
@ 2013-01-10  8:57 ` Jan Beulich
  2013-01-10 17:10   ` Andres Lagar-Cavilla
  2013-01-10 19:23   ` David Vrabel
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Beulich @ 2013-01-10  8:57 UTC (permalink / raw)
  To: Yongjie Ren, Andres Lagar-Cavilla, Konrad Rzeszutek Wilk
  Cc: xen-devel, Chao Zhou, Yan Dai, YongweiX Xu, SongtaoX Liu

>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
> New issue(1)
> ==============
> 1. sometimes live migration failed and reported call trace in dom0
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 

For the failed allocation, the only obvious candidate appears to be

	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);

which quite obviously can be of (almost) arbitrary size because

	nr_pages = m.num;
	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
		return -EINVAL;

really only checks for completely insane values.

This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
support autotranslated physmap guests", which added another
similar (twice as large) allocation in alloc_empty_pages().

I'd like to note that the forward ported kernels don't appear to
have a similar issue, as they never allocates more than a page at
a time. Was that code consulted at all when that addition was
done?

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-10  8:57 ` Jan Beulich
@ 2013-01-10 17:10   ` Andres Lagar-Cavilla
  2013-01-10 17:27     ` Mats Petersson
  2013-01-11  8:34     ` Jan Beulich
  2013-01-10 19:23   ` David Vrabel
  1 sibling, 2 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-10 17:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Yongjie Ren, Konrad Rzeszutek Wilk, xen-devel, Chao Zhou, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla

On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>> New issue(1)
>> ==============
>> 1. sometimes live migration failed and reported call trace in dom0
>>  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
> 
> For the failed allocation, the only obvious candidate appears to be
> 
> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
> 
> which quite obviously can be of (almost) arbitrary size because
> 
> 	nr_pages = m.num;
> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
> 		return -EINVAL;
> 
> really only checks for completely insane values.
> 
> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
> support autotranslated physmap guests", which added another
> similar (twice as large) allocation in alloc_empty_pages().

Perhaps the err_array in this case, since alloc_empty_pages only happens for auto translated dom0s.

Not familiar wether libxl changes (or is even capable of changing) parameters of the migration code. But right now in libxc, mapping is done in MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 ints, which is *one* page.

So is really the kernel incapable of allocating one measly page?

This leads me to think that it might be gather_array, but that one would allocate a grand total of two pages.

In any case, both functions allocate arbitrary number of pages, and that is the fundamental problem.

What is the approach in the forward ported kernel wrt to gather_array?

The cleanest alternative I can think of is to refactor the the body of mmap_batch to allocate one page for each array, and iteratively call traverse_pages recycling the local arrays and increasing the pointers in the source user space arrays.

Having said that, that would allocate two pages (always), and the code right now allocates max three (for libxc driven migrations). So maybe the problem is elsewhere….

Thanks,
Andres

> 
> I'd like to note that the forward ported kernels don't appear to
> have a similar issue, as they never allocates more than a page at
> a time. Was that code consulted at all when that addition was
> done?
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-10 17:10   ` Andres Lagar-Cavilla
@ 2013-01-10 17:27     ` Mats Petersson
  2013-01-11  8:34     ` Jan Beulich
  1 sibling, 0 replies; 18+ messages in thread
From: Mats Petersson @ 2013-01-10 17:27 UTC (permalink / raw)
  To: xen-devel

On 10/01/13 17:10, Andres Lagar-Cavilla wrote:
> On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com> wrote:
>
>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>>> New issue(1)
>>> ==============
>>> 1. sometimes live migration failed and reported call trace in dom0
>>>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841
>> For the failed allocation, the only obvious candidate appears to be
>>
>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>
>> which quite obviously can be of (almost) arbitrary size because
>>
>> 	nr_pages = m.num;
>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> 		return -EINVAL;
>>
>> really only checks for completely insane values.
>>
>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>> support autotranslated physmap guests", which added another
>> similar (twice as large) allocation in alloc_empty_pages().
> Perhaps the err_array in this case, since alloc_empty_pages only happens for auto translated dom0s.
>
> Not familiar wether libxl changes (or is even capable of changing) parameters of the migration code. But right now in libxc, mapping is done in MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 ints, which is *one* page.
>
> So is really the kernel incapable of allocating one measly page?
>
> This leads me to think that it might be gather_array, but that one would allocate a grand total of two pages.
>
> In any case, both functions allocate arbitrary number of pages, and that is the fundamental problem.
>
> What is the approach in the forward ported kernel wrt to gather_array?
>
> The cleanest alternative I can think of is to refactor the the body of mmap_batch to allocate one page for each array, and iteratively call traverse_pages recycling the local arrays and increasing the pointers in the source user space arrays.
>
> Having said that, that would allocate two pages (always), and the code right now allocates max three (for libxc driven migrations). So maybe the problem is elsewhere….
>
> Thanks,
> Andres

Whilst this may not add much to the discussion, where I have been 
working on the improved privcmd.c, I have been using 3.7.0rc5 and 3.8.0. 
Both of these seem to work fine for migration using the libxc interface 
(since I've been using the Xenserver build, the migration is not done 
through libxl).

I have not had a single failure to allocate pages in the migration, - I 
have a script that loops around migrating the guest to the same host as 
quickly as it can and I have used guests up to 64GB (and left that 
running overnight - it takes about 3 minutes, so a night gives several 
hundred iterations.

So I'm wondering what is different between my setup and this one...

--
Mats
>
>> I'd like to note that the forward ported kernels don't appear to
>> have a similar issue, as they never allocates more than a page at
>> a time. Was that code consulted at all when that addition was
>> done?
>>
>> Jan
>>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-10  8:57 ` Jan Beulich
  2013-01-10 17:10   ` Andres Lagar-Cavilla
@ 2013-01-10 19:23   ` David Vrabel
  2013-01-11 15:55     ` Andres Lagar-Cavilla
  1 sibling, 1 reply; 18+ messages in thread
From: David Vrabel @ 2013-01-10 19:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Yongjie Ren, Konrad Rzeszutek Wilk, xen-devel, Chao Zhou, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla

On 10/01/13 08:57, Jan Beulich wrote:
>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>> New issue(1)
>> ==============
>> 1. sometimes live migration failed and reported call trace in dom0
>>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
> 
> For the failed allocation, the only obvious candidate appears to be
> 
> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
> 
> which quite obviously can be of (almost) arbitrary size because
> 
> 	nr_pages = m.num;
> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
> 		return -EINVAL;
> 
> really only checks for completely insane values.
> 
> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
> support autotranslated physmap guests", which added another
> similar (twice as large) allocation in alloc_empty_pages().
> 
> I'd like to note that the forward ported kernels don't appear to
> have a similar issue, as they never allocates more than a page at
> a time. Was that code consulted at all when that addition was
> done?

I did highlight this at the time[1].

See [2] for how I avoided this allocation.

David

[1] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02208.html
[2] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02092.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-10 17:10   ` Andres Lagar-Cavilla
  2013-01-10 17:27     ` Mats Petersson
@ 2013-01-11  8:34     ` Jan Beulich
  2013-01-11 16:05       ` Andres Lagar-Cavilla
  2013-01-14  4:29       ` Andres Lagar-Cavilla
  1 sibling, 2 replies; 18+ messages in thread
From: Jan Beulich @ 2013-01-11  8:34 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Yongjie Ren, Konrad Rzeszutek Wilk, xen-devel, Chao Zhou, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla

>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla <andreslc@gridcentric.ca> wrote:
> On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>>> New issue(1)
>>> ==============
>>> 1. sometimes live migration failed and reported call trace in dom0
>>>  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>> 
>> For the failed allocation, the only obvious candidate appears to be
>> 
>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>> 
>> which quite obviously can be of (almost) arbitrary size because
>> 
>> 	nr_pages = m.num;
>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> 		return -EINVAL;
>> 
>> really only checks for completely insane values.
>> 
>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>> support autotranslated physmap guests", which added another
>> similar (twice as large) allocation in alloc_empty_pages().
> 
> Perhaps the err_array in this case, since alloc_empty_pages only happens for 
> auto translated dom0s.
> 
> Not familiar wether libxl changes (or is even capable of changing) 
> parameters of the migration code. But right now in libxc, mapping is done in 
> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 
> ints, which is *one* page.
> 
> So is really the kernel incapable of allocating one measly page?
> 
> This leads me to think that it might be gather_array, but that one would 
> allocate a grand total of two pages.

The log indicated a failed order-4 allocation though. So maybe not
that allocation after all (be the basically unbounded size here is a
problem in any case).

> In any case, both functions allocate arbitrary number of pages, and that is 
> the fundamental problem.
> 
> What is the approach in the forward ported kernel wrt to gather_array?

There's no gather_array there. It simply sets up things one page
at a time. (And you can always go and check linux-2.6.18-xen.hg).

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-10 19:23   ` David Vrabel
@ 2013-01-11 15:55     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-11 15:55 UTC (permalink / raw)
  To: David Vrabel
  Cc: Yongjie Ren, Konrad Rzeszutek Wilk, xen-devel, Chao Zhou,
	Jan Beulich, Yan Dai, YongweiX Xu, SongtaoX Liu,
	Andres Lagar-Cavilla

On Jan 10, 2013, at 2:23 PM, David Vrabel <david.vrabel@citrix.com> wrote:

> On 10/01/13 08:57, Jan Beulich wrote:
>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>>> New issue(1)
>>> ==============
>>> 1. sometimes live migration failed and reported call trace in dom0
>>>  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>> 
>> For the failed allocation, the only obvious candidate appears to be
>> 
>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>> 
>> which quite obviously can be of (almost) arbitrary size because
>> 
>> 	nr_pages = m.num;
>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> 		return -EINVAL;
>> 
>> really only checks for completely insane values.
>> 
>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>> support autotranslated physmap guests", which added another
>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> I'd like to note that the forward ported kernels don't appear to
>> have a similar issue, as they never allocates more than a page at
>> a time. Was that code consulted at all when that addition was
>> done?
> 
> I did highlight this at the time[1].
> 
> See [2] for how I avoided this allocation.

Well you should pushed more forcefully back then! I remember convincing you otherwise ;)

In any case see follow-up to Jan.

Andres
> 
> David
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02208.html
> [2] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02092.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-11  8:34     ` Jan Beulich
@ 2013-01-11 16:05       ` Andres Lagar-Cavilla
  2013-01-14  4:29       ` Andres Lagar-Cavilla
  1 sibling, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-11 16:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Yongjie Ren, Konrad Rzeszutek Wilk, xen-devel, Chao Zhou, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla

On Jan 11, 2013, at 3:34 AM, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla <andreslc@gridcentric.ca> wrote:
>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com> wrote:
>> 
>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>>>> New issue(1)
>>>> ==============
>>>> 1. sometimes live migration failed and reported call trace in dom0
>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>> 
>>> For the failed allocation, the only obvious candidate appears to be
>>> 
>>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>> 
>>> which quite obviously can be of (almost) arbitrary size because
>>> 
>>> 	nr_pages = m.num;
>>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>>> 		return -EINVAL;
>>> 
>>> really only checks for completely insane values.
>>> 
>>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>>> support autotranslated physmap guests", which added another
>>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> Perhaps the err_array in this case, since alloc_empty_pages only happens for 
>> auto translated dom0s.
>> 
>> Not familiar wether libxl changes (or is even capable of changing) 
>> parameters of the migration code. But right now in libxc, mapping is done in 
>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 
>> ints, which is *one* page.
>> 
>> So is really the kernel incapable of allocating one measly page?
>> 
>> This leads me to think that it might be gather_array, but that one would 
>> allocate a grand total of two pages.
> 
> The log indicated a failed order-4 allocation though. So maybe not
> that allocation after all (be the basically unbounded size here is a
> problem in any case).

Unless somehow the batch size got changed to 16384, this most definitely is not err_array. Would be good to ascertain that.

I do see how gather array avoids allocations larger than O(0). Let me look into cooking a similar solution for err_array.

Andres
> 
>> In any case, both functions allocate arbitrary number of pages, and that is 
>> the fundamental problem.
>> 
>> What is the approach in the forward ported kernel wrt to gather_array?
> 
> There's no gather_array there. It simply sets up things one page
> at a time. (And you can always go and check linux-2.6.18-xen.hg).
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-11  8:34     ` Jan Beulich
  2013-01-11 16:05       ` Andres Lagar-Cavilla
@ 2013-01-14  4:29       ` Andres Lagar-Cavilla
  2013-01-14  5:03         ` Andres Lagar-Cavilla
  2013-01-14 13:59         ` David Vrabel
  1 sibling, 2 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-14  4:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Mats Petersson, xen-devel,
	Chao Zhou, Yan Dai, YongweiX Xu, SongtaoX Liu,
	Andres Lagar-Cavilla

On Jan 11, 2013, at 3:34 AM, Jan Beulich <jbeulich@suse.com> wrote:

>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla <andreslc@gridcentric.ca> wrote:
>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com> wrote:
>> 
>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>>>> New issue(1)
>>>> ==============
>>>> 1. sometimes live migration failed and reported call trace in dom0
>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>> 
>>> For the failed allocation, the only obvious candidate appears to be
>>> 
>>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>> 
>>> which quite obviously can be of (almost) arbitrary size because
>>> 
>>> 	nr_pages = m.num;
>>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>>> 		return -EINVAL;
>>> 
>>> really only checks for completely insane values.
>>> 
>>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>>> support autotranslated physmap guests", which added another
>>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> Perhaps the err_array in this case, since alloc_empty_pages only happens for 
>> auto translated dom0s.
>> 
>> Not familiar wether libxl changes (or is even capable of changing) 
>> parameters of the migration code. But right now in libxc, mapping is done in 
>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 
>> ints, which is *one* page.
>> 
>> So is really the kernel incapable of allocating one measly page?
>> 
>> This leads me to think that it might be gather_array, but that one would 
>> allocate a grand total of two pages.
> 
> The log indicated a failed order-4 allocation though. So maybe not
> that allocation after all (be the basically unbounded size here is a
> problem in any case).

In my mind this casts serious doubt on whether err_array is the culprit of this test result. I would look into other variables.

Having said that, a kcalloc of a potentially multi-page-sized array is a problem.

Below you'll find pasted an RFC patch to fix this. I've expanded the cc line to add Mats Peterson, who is also looking into some improvements to privcmd (and IanC for general feedback).

The RFC patch cuts down code overall and cleans up logic too. I did change the behavior wrt classic implementations when it comes to handling errors & EFAULT. Instead of doing all the mapping work and then copying back to user, I copy back each individual mapping error as soon as it arises. And short-circuit and quit the whole operation as soon as the first EFAULT arises.

Short-circuiting on the first EFAULT is the right thing to do IMHO. There is no point in continue working if we can't tell the caller about the result. Further, libxc will just munmap the vma and undo all work upon receiving EFAULT. So any further work is wasted work.

Please advise, thanks
Andres

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 3421f0d..9433396 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c 
@@ -254,18 +254,16 @@ struct mmap_batch_state {
    unsigned long va;
    struct vm_area_struct *vma;
    int index;
-   /* A tristate: 
-    *      0 for no errors                 
-    *      1 if at least one error has happened (and no
-    *          -ENOENT errors have happened)
-    *      -ENOENT if at least 1 -ENOENT has happened.
-    */             
-   int global_error;   
+   /* Whether at least one enoent has happened. */
+   int enoent;     
    /* An array for individual errors */
    int *err;           
                    
-   /* User-space mfn array to store errors in the second pass for V1. */
+   int version;
+   /* User-space mfn array to store errors for V1. */
    xen_pfn_t __user *user_mfn;
+   /* User space int array to store errors for V2. */
+   int __user *user_err;
 };
 
 /* auto translated dom0 note: if domU being created is PV, then mfn is
@@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
                     st->vma->vm_page_prot, st->domain,
                     &cur_page);
 
-   /* Store error code for second pass. */
-   *(st->err++) = ret;
-
-   /* And see if it affects the global_error. */
-   if (ret < 0) {
-       if (ret == -ENOENT)
-           st->global_error = -ENOENT;
-       else {
-           /* Record that at least one error has happened. */
-           if (st->global_error == 0)
-               st->global_error = 1;
+   /* See if error affects the global enoent flag. */
+   if (ret == -ENOENT) {
+       st->enoent = 1;
+   }
+
+   /* And, if error, store in user space (version dependent). */
+   if (ret) {
+       int efault = 0;
+       if (st->version == 1) {
+           xen_pfn_t mfn_err = *mfnp;
+           mfn_err |= (ret == -ENOENT) ?
+                       PRIVCMD_MMAPBATCH_PAGED_ERROR :
+                       PRIVCMD_MMAPBATCH_MFN_ERROR;
+           efault = __put_user(mfn_err, st->user_mfn++);
+       } else { /* st->version == 2 */
+           efault = __put_user(ret, st->user_err++);
        }
+       /* If copy to user failed, short circuit now. */
+       if (efault)
+           return efault;
+   } else {
+       st->user_mfn++;
+       st->user_err++;
    }
-   st->va += PAGE_SIZE;

+   st->va += PAGE_SIZE;
    return 0;
 }

-static int mmap_return_errors_v1(void *data, void *state)
-{
-   xen_pfn_t *mfnp = data;
-   struct mmap_batch_state *st = state;
-   int err = *(st->err++);
-
-   /*
-    * V1 encodes the error codes in the 32bit top nibble of the
-    * mfn (with its known limitations vis-a-vis 64 bit callers).
-    */
-   *mfnp |= (err == -ENOENT) ?
-               PRIVCMD_MMAPBATCH_PAGED_ERROR :
-               PRIVCMD_MMAPBATCH_MFN_ERROR;
-   return __put_user(*mfnp, st->user_mfn++);
-}
-
 /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
  * the vma with the page info to use later.
  * Returns: 0 if success, otherwise -errno
@@ -357,7 +350,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
    struct vm_area_struct *vma;
    unsigned long nr_pages;
    LIST_HEAD(pagelist);
-   int *err_array = NULL;
    struct mmap_batch_state state;

    if (!xen_initial_domain())
@@ -396,10 +388,12 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
        goto out;
    }

-   err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
-   if (err_array == NULL) {
-       ret = -ENOMEM;
-       goto out;
+   /* Zero the V2 array of errors, so we only write return codes on error. */
+   if (version == 2) {
+       if (clear_user(m.err, sizeof(int) * m.num)) {
+           ret = -EFAULT;
+           goto out;
+       }
    }

    down_write(&mm->mmap_sem);
@@ -426,38 +420,22 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
    state.vma           = vma;
    state.va            = m.addr;
    state.index         = 0;
-   state.global_error  = 0;
-   state.err           = err_array;
+   state.enoent        = 0;
+   state.user_mfn      = (xen_pfn_t *)m.arr;
+   state.user_err      = m.err;
+   state.version       = version;

-   /* mmap_batch_fn guarantees ret == 0 */
-   BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
-                &pagelist, mmap_batch_fn, &state));
+   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+                        &pagelist, mmap_batch_fn, &state);

    up_write(&mm->mmap_sem);

-   if (version == 1) {
-       if (state.global_error) {
-           /* Write back errors in second pass. */
-           state.user_mfn = (xen_pfn_t *)m.arr;
-           state.err      = err_array;
-           ret = traverse_pages(m.num, sizeof(xen_pfn_t),
-                        &pagelist, mmap_return_errors_v1, &state);
-       } else
-           ret = 0;
-
-   } else if (version == 2) {
-       ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
-       if (ret)
-           ret = -EFAULT;
-   }
-
    /* If we have not had any EFAULT-like global errors then set the global
     * error to -ENOENT if necessary. */
-   if ((ret == 0) && (state.global_error == -ENOENT))
+   if ((ret == 0) && (state.enoent))
        ret = -ENOENT;

 out:
-   kfree(err_array);
    free_page_list(&pagelist);

    return ret;


> 
>> In any case, both functions allocate arbitrary number of pages, and that is 
>> the fundamental problem.
>> 
>> What is the approach in the forward ported kernel wrt to gather_array?
> 
> There's no gather_array there. It simply sets up things one page
> at a time. (And you can always go and check linux-2.6.18-xen.hg).
> 
> Jan
> 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14  4:29       ` Andres Lagar-Cavilla
@ 2013-01-14  5:03         ` Andres Lagar-Cavilla
  2013-01-14 13:59         ` David Vrabel
  1 sibling, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-14  5:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Mats Petersson, xen-devel,
	Chao Zhou, Yan Dai, YongweiX Xu, SongtaoX Liu,
	Andres Lagar-Cavilla

On Jan 13, 2013, at 11:29 PM, Andres Lagar-Cavilla <andreslc@gridcentric.ca> wrote:

> On Jan 11, 2013, at 3:34 AM, Jan Beulich <jbeulich@suse.com> wrote:
> 
>>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla <andreslc@gridcentric.ca> wrote:
>>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com> wrote:
>>> 
>>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@intel.com> wrote:
>>>>> New issue(1)
>>>>> ==============
>>>>> 1. sometimes live migration failed and reported call trace in dom0
>>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>>> 
>>>> For the failed allocation, the only obvious candidate appears to be
>>>> 
>>>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>>> 
>>>> which quite obviously can be of (almost) arbitrary size because
>>>> 
>>>> 	nr_pages = m.num;
>>>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>>>> 		return -EINVAL;
>>>> 
>>>> really only checks for completely insane values.
>>>> 
>>>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>>>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>>>> support autotranslated physmap guests", which added another
>>>> similar (twice as large) allocation in alloc_empty_pages().
>>> 
>>> Perhaps the err_array in this case, since alloc_empty_pages only happens for 
>>> auto translated dom0s.
>>> 
>>> Not familiar wether libxl changes (or is even capable of changing) 
>>> parameters of the migration code. But right now in libxc, mapping is done in 
>>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 
>>> ints, which is *one* page.
>>> 
>>> So is really the kernel incapable of allocating one measly page?
>>> 
>>> This leads me to think that it might be gather_array, but that one would 
>>> allocate a grand total of two pages.
>> 
>> The log indicated a failed order-4 allocation though. So maybe not
>> that allocation after all (be the basically unbounded size here is a
>> problem in any case).
> 
> In my mind this casts serious doubt on whether err_array is the culprit of this test result. I would look into other variables.
> 
> Having said that, a kcalloc of a potentially multi-page-sized array is a problem.
> 
> Below you'll find pasted an RFC patch to fix this. I've expanded the cc line to add Mats Peterson, who is also looking into some improvements to privcmd (and IanC for general feedback).
> 
> The RFC patch cuts down code overall and cleans up logic too. I did change the behavior wrt classic implementations when it comes to handling errors & EFAULT. Instead of doing all the mapping work and then copying back to user, I copy back each individual mapping error as soon as it arises. And short-circuit and quit the whole operation as soon as the first EFAULT arises.
> 
> Short-circuiting on the first EFAULT is the right thing to do IMHO. There is no point in continue working if we can't tell the caller about the result. Further, libxc will just munmap the vma and undo all work upon receiving EFAULT. So any further work is wasted work.

Copying back one error at a time is what the current code does for the ioctl V1 in any case.

I've noticed a glaring logic error in the V1 case. If there is any mapping error, a second pass to copy back errors will be done. That second pass always sets the top nibble for each mfn to some error, whether one has happened for that mapping or not. Uggh.

Patch I just posted fixes this. Some version of a fix has to go in soon.

Andres
> 
> Please advise, thanks
> Andres
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 3421f0d..9433396 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c 
> @@ -254,18 +254,16 @@ struct mmap_batch_state {
>    unsigned long va;
>    struct vm_area_struct *vma;
>    int index;
> -   /* A tristate: 
> -    *      0 for no errors                 
> -    *      1 if at least one error has happened (and no
> -    *          -ENOENT errors have happened)
> -    *      -ENOENT if at least 1 -ENOENT has happened.
> -    */             
> -   int global_error;   
> +   /* Whether at least one enoent has happened. */
> +   int enoent;     
>    /* An array for individual errors */
>    int *err;           
> 
> -   /* User-space mfn array to store errors in the second pass for V1. */
> +   int version;
> +   /* User-space mfn array to store errors for V1. */
>    xen_pfn_t __user *user_mfn;
> +   /* User space int array to store errors for V2. */
> +   int __user *user_err;
> };
> 
> /* auto translated dom0 note: if domU being created is PV, then mfn is
> @@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
>                     st->vma->vm_page_prot, st->domain,
>                     &cur_page);
> 
> -   /* Store error code for second pass. */
> -   *(st->err++) = ret;
> -
> -   /* And see if it affects the global_error. */
> -   if (ret < 0) {
> -       if (ret == -ENOENT)
> -           st->global_error = -ENOENT;
> -       else {
> -           /* Record that at least one error has happened. */
> -           if (st->global_error == 0)
> -               st->global_error = 1;
> +   /* See if error affects the global enoent flag. */
> +   if (ret == -ENOENT) {
> +       st->enoent = 1;
> +   }
> +
> +   /* And, if error, store in user space (version dependent). */
> +   if (ret) {
> +       int efault = 0;
> +       if (st->version == 1) {
> +           xen_pfn_t mfn_err = *mfnp;
> +           mfn_err |= (ret == -ENOENT) ?
> +                       PRIVCMD_MMAPBATCH_PAGED_ERROR :
> +                       PRIVCMD_MMAPBATCH_MFN_ERROR;
> +           efault = __put_user(mfn_err, st->user_mfn++);
> +       } else { /* st->version == 2 */
> +           efault = __put_user(ret, st->user_err++);
>        }
> +       /* If copy to user failed, short circuit now. */
> +       if (efault)
> +           return efault;
> +   } else {
> +       st->user_mfn++;
> +       st->user_err++;
>    }
> -   st->va += PAGE_SIZE;
> 
> +   st->va += PAGE_SIZE;
>    return 0;
> }
> 
> -static int mmap_return_errors_v1(void *data, void *state)
> -{
> -   xen_pfn_t *mfnp = data;
> -   struct mmap_batch_state *st = state;
> -   int err = *(st->err++);
> -
> -   /*
> -    * V1 encodes the error codes in the 32bit top nibble of the
> -    * mfn (with its known limitations vis-a-vis 64 bit callers).
> -    */
> -   *mfnp |= (err == -ENOENT) ?
> -               PRIVCMD_MMAPBATCH_PAGED_ERROR :
> -               PRIVCMD_MMAPBATCH_MFN_ERROR;
> -   return __put_user(*mfnp, st->user_mfn++);
> -}
> -
> /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
>  * the vma with the page info to use later.
>  * Returns: 0 if success, otherwise -errno
> @@ -357,7 +350,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>    struct vm_area_struct *vma;
>    unsigned long nr_pages;
>    LIST_HEAD(pagelist);
> -   int *err_array = NULL;
>    struct mmap_batch_state state;
> 
>    if (!xen_initial_domain())
> @@ -396,10 +388,12 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>        goto out;
>    }
> 
> -   err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
> -   if (err_array == NULL) {
> -       ret = -ENOMEM;
> -       goto out;
> +   /* Zero the V2 array of errors, so we only write return codes on error. */
> +   if (version == 2) {
> +       if (clear_user(m.err, sizeof(int) * m.num)) {
> +           ret = -EFAULT;
> +           goto out;
> +       }
>    }
> 
>    down_write(&mm->mmap_sem);
> @@ -426,38 +420,22 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>    state.vma           = vma;
>    state.va            = m.addr;
>    state.index         = 0;
> -   state.global_error  = 0;
> -   state.err           = err_array;
> +   state.enoent        = 0;
> +   state.user_mfn      = (xen_pfn_t *)m.arr;
> +   state.user_err      = m.err;
> +   state.version       = version;
> 
> -   /* mmap_batch_fn guarantees ret == 0 */
> -   BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
> -                &pagelist, mmap_batch_fn, &state));
> +   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
> +                        &pagelist, mmap_batch_fn, &state);
> 
>    up_write(&mm->mmap_sem);
> 
> -   if (version == 1) {
> -       if (state.global_error) {
> -           /* Write back errors in second pass. */
> -           state.user_mfn = (xen_pfn_t *)m.arr;
> -           state.err      = err_array;
> -           ret = traverse_pages(m.num, sizeof(xen_pfn_t),
> -                        &pagelist, mmap_return_errors_v1, &state);
> -       } else
> -           ret = 0;
> -
> -   } else if (version == 2) {
> -       ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
> -       if (ret)
> -           ret = -EFAULT;
> -   }
> -
>    /* If we have not had any EFAULT-like global errors then set the global
>     * error to -ENOENT if necessary. */
> -   if ((ret == 0) && (state.global_error == -ENOENT))
> +   if ((ret == 0) && (state.enoent))
>        ret = -ENOENT;
> 
> out:
> -   kfree(err_array);
>    free_page_list(&pagelist);
> 
>    return ret;
> 
> 
>> 
>>> In any case, both functions allocate arbitrary number of pages, and that is 
>>> the fundamental problem.
>>> 
>>> What is the approach in the forward ported kernel wrt to gather_array?
>> 
>> There's no gather_array there. It simply sets up things one page
>> at a time. (And you can always go and check linux-2.6.18-xen.hg).
>> 
>> Jan
>> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14  4:29       ` Andres Lagar-Cavilla
  2013-01-14  5:03         ` Andres Lagar-Cavilla
@ 2013-01-14 13:59         ` David Vrabel
  2013-01-14 15:06           ` Andres Lagar-Cavilla
  1 sibling, 1 reply; 18+ messages in thread
From: David Vrabel @ 2013-01-14 13:59 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Mats Petersson, xen-devel,
	Chao Zhou, Jan Beulich, Yan Dai, YongweiX Xu, SongtaoX Liu,
	Andres Lagar-Cavilla

On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
> 
> Below you'll find pasted an RFC patch to fix this. I've expanded the
> cc line to add Mats Peterson, who is also looking into some improvements
> to privcmd (and IanC for general feedback).
> 
> The RFC patch cuts down code overall and cleans up logic too. I did
> change the behavior wrt classic implementations when it comes to
> handling errors & EFAULT. Instead of doing all the mapping work and then
> copying back to user, I copy back each individual mapping error as soon
> as it arises. And short-circuit and quit the whole operation as soon as
> the first EFAULT arises.

Which is broken.  Please just look at my v3 patch and implement that method.

> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 3421f0d..9433396 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c 
[...]
> @@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
[...]
> +           efault = __put_user(mfn_err, st->user_mfn++);
> +       } else { /* st->version == 2 */
> +           efault = __put_user(ret, st->user_err++);

You can't use __put_user() or any other function accessing user memory
while holding mmap_sem or you will occasionally deadlock in the page
fault handler (depending on whether the user page is currently present
or not).

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 13:59         ` David Vrabel
@ 2013-01-14 15:06           ` Andres Lagar-Cavilla
  2013-01-14 16:03             ` David Vrabel
  0 siblings, 1 reply; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-14 15:06 UTC (permalink / raw)
  To: David Vrabel
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Andres Lagar-Cavilla,
	xen-devel, Chao Zhou, Jan Beulich, Mats Petersson, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla

On Jan 14, 2013, at 8:59 AM, David Vrabel <david.vrabel@citrix.com> wrote:

> On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
>> 
>> Below you'll find pasted an RFC patch to fix this. I've expanded the
>> cc line to add Mats Peterson, who is also looking into some improvements
>> to privcmd (and IanC for general feedback).
>> 
>> The RFC patch cuts down code overall and cleans up logic too. I did
>> change the behavior wrt classic implementations when it comes to
>> handling errors & EFAULT. Instead of doing all the mapping work and then
>> copying back to user, I copy back each individual mapping error as soon
>> as it arises. And short-circuit and quit the whole operation as soon as
>> the first EFAULT arises.
> 
> Which is broken.
Certainly due to copy_on_write within mmap semaphore. Unfortunately I didn't have time last night to post the fix, pardon for the noise.
>  Please just look at my v3 patch and implement that method.
The one nit I have about that is that it does an unnecessary get_user of the mfn on the second pass for V1. HOw about this?
Andres

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 3421f0d..fc4952d 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -261,11 +261,12 @@ struct mmap_batch_state {
 	 *      -ENOENT if at least 1 -ENOENT has happened.
 	 */
 	int global_error;
-	/* An array for individual errors */
-	int *err;
+	int version;
 
 	/* User-space mfn array to store errors in the second pass for V1. */
 	xen_pfn_t __user *user_mfn;
+	/* User-space int array to store errors in the second pass for V2. */
+	int __user *user_err;
 };
 
 /* auto translated dom0 note: if domU being created is PV, then mfn is
@@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
 					 &cur_page);
 
 	/* Store error code for second pass. */
-	*(st->err++) = ret;
+	if (st->version == 1) {
+		if (ret < 0) {
+			/*
+			 * V1 encodes the error codes in the 32bit top nibble of the
+			 * mfn (with its known limitations vis-a-vis 64 bit callers).
+			 */
+			*mfnp |= (ret == -ENOENT) ?
+						PRIVCMD_MMAPBATCH_PAGED_ERROR :
+						PRIVCMD_MMAPBATCH_MFN_ERROR;
+		}
+	} else { /* st->version == 2 */
+		*((int *) mfnp) = ret;
+	}
 
 	/* And see if it affects the global_error. */
 	if (ret < 0) {
@@ -305,20 +318,25 @@ static int mmap_batch_fn(void *data, void *state)
 	return 0;
 }
 
-static int mmap_return_errors_v1(void *data, void *state)
+static int mmap_return_errors(void *data, void *state)
 {
-	xen_pfn_t *mfnp = data;
 	struct mmap_batch_state *st = state;
-	int err = *(st->err++);
 
-	/*
-	 * V1 encodes the error codes in the 32bit top nibble of the
-	 * mfn (with its known limitations vis-a-vis 64 bit callers).
-	 */
-	*mfnp |= (err == -ENOENT) ?
-				PRIVCMD_MMAPBATCH_PAGED_ERROR :
-				PRIVCMD_MMAPBATCH_MFN_ERROR;
-	return __put_user(*mfnp, st->user_mfn++);
+	if (st->version == 1) {
+		xen_pfn_t mfnp = *((xen_pfn_t *) data);
+		if (mfnp & PRIVCMD_MMAPBATCH_MFN_ERROR)
+			return __put_user(mfnp, st->user_mfn++);
+		else
+			st->user_mfn++;
+	} else { /* st->version == 2 */
+		int err = *((int *) data);
+		if (err)
+			return __put_user(err, st->user_err++);
+		else
+			st->user_err++;
+	}
+
+	return 0;
 }
 
 /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
@@ -357,7 +375,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
 	struct vm_area_struct *vma;
 	unsigned long nr_pages;
 	LIST_HEAD(pagelist);
-	int *err_array = NULL;
 	struct mmap_batch_state state;
 
 	if (!xen_initial_domain())
@@ -396,10 +413,12 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
 		goto out;
 	}
 
-	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
-	if (err_array == NULL) {
-		ret = -ENOMEM;
-		goto out;
+	if (version == 2) {
+		/* Zero error array now to only copy back actual errors. */
+		if (clear_user(m.err, sizeof(int) * m.num)) {
+			ret = -EFAULT;
+			goto out;
+		}
 	}
 
 	down_write(&mm->mmap_sem);
@@ -427,7 +446,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
 	state.va            = m.addr;
 	state.index         = 0;
 	state.global_error  = 0;
-	state.err           = err_array;
+	state.version       = version;
 
 	/* mmap_batch_fn guarantees ret == 0 */
 	BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
@@ -435,21 +454,14 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
 
 	up_write(&mm->mmap_sem);
 
-	if (version == 1) {
-		if (state.global_error) {
-			/* Write back errors in second pass. */
-			state.user_mfn = (xen_pfn_t *)m.arr;
-			state.err      = err_array;
-			ret = traverse_pages(m.num, sizeof(xen_pfn_t),
-					     &pagelist, mmap_return_errors_v1, &state);
-		} else
-			ret = 0;
-
-	} else if (version == 2) {
-		ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
-		if (ret)
-			ret = -EFAULT;
-	}
+	if (state.global_error) {
+		/* Write back errors in second pass. */
+		state.user_mfn = (xen_pfn_t *)m.arr;
+		state.user_err = m.err;
+		ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+							 &pagelist, mmap_return_errors, &state);
+	} else
+		ret = 0;
 
 	/* If we have not had any EFAULT-like global errors then set the global
 	 * error to -ENOENT if necessary. */
@@ -457,7 +469,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
 		ret = -ENOENT;
 
 out:
-	kfree(err_array);
 	free_page_list(&pagelist);
 
 	return ret;


> 
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 3421f0d..9433396 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c 
> [...]
>> @@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
> [...]
>> +           efault = __put_user(mfn_err, st->user_mfn++);
>> +       } else { /* st->version == 2 */
>> +           efault = __put_user(ret, st->user_err++);
> 
> You can't use __put_user() or any other function accessing user memory
> while holding mmap_sem or you will occasionally deadlock in the page
> fault handler (depending on whether the user page is currently present
> or not).
> 
> David

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 15:06           ` Andres Lagar-Cavilla
@ 2013-01-14 16:03             ` David Vrabel
  2013-01-14 16:14               ` Jan Beulich
  2013-01-14 16:18               ` Andres Lagar-Cavilla
  0 siblings, 2 replies; 18+ messages in thread
From: David Vrabel @ 2013-01-14 16:03 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Andres Lagar-Cavilla,
	xen-devel, Chao Zhou, Jan Beulich, Mats Petersson, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla

On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
> On Jan 14, 2013, at 8:59 AM, David Vrabel <david.vrabel@citrix.com> wrote:
> 
>> On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
>>>
>>> Below you'll find pasted an RFC patch to fix this. I've expanded the
>>> cc line to add Mats Peterson, who is also looking into some improvements
>>> to privcmd (and IanC for general feedback).
>>>
>>> The RFC patch cuts down code overall and cleans up logic too. I did
>>> change the behavior wrt classic implementations when it comes to
>>> handling errors & EFAULT. Instead of doing all the mapping work and then
>>> copying back to user, I copy back each individual mapping error as soon
>>> as it arises. And short-circuit and quit the whole operation as soon as
>>> the first EFAULT arises.
>>
>> Which is broken.
> Certainly due to copy_on_write within mmap semaphore. Unfortunately I didn't have time last night to post the fix, pardon for the noise.
>>  Please just look at my v3 patch and implement that method.

... but be aware that I messed up mmap_return_errors() for V1 and set
all MFNs as having errors.  Oops.

> The one nit I have about that is that it does an unnecessary get_user of the mfn on the second pass for V1. HOw about this?

__get_user() and __put_user() are actually cheap (provided they don't
fault).

This looks ok except for one thing.

> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 3421f0d..fc4952d 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
[...]
> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>  					 &cur_page);
>  
>  	/* Store error code for second pass. */
> -	*(st->err++) = ret;
> +	if (st->version == 1) {
> +		if (ret < 0) {
> +			/*
> +			 * V1 encodes the error codes in the 32bit top nibble of the
> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
> +			 */
> +			*mfnp |= (ret == -ENOENT) ?
> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
> +						PRIVCMD_MMAPBATCH_MFN_ERROR;

You also need to clear the top nibble on success (ret >= 0) so large
PFNs with the top nibble already set don't give false positives of errors.

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 16:03             ` David Vrabel
@ 2013-01-14 16:14               ` Jan Beulich
  2013-01-14 16:15                 ` David Vrabel
  2013-01-14 16:18               ` Andres Lagar-Cavilla
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2013-01-14 16:14 UTC (permalink / raw)
  To: David Vrabel, Andres Lagar-Cavilla
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, MatsPetersson, xen-devel,
	Chao Zhou, Andres Lagar-Cavilla, Yan Dai, YongweiX Xu,
	SongtaoX Liu, AndresLagar-Cavilla

>>> On 14.01.13 at 17:03, David Vrabel <david.vrabel@citrix.com> wrote:
> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>>  					 &cur_page);
>>  
>>  	/* Store error code for second pass. */
>> -	*(st->err++) = ret;
>> +	if (st->version == 1) {
>> +		if (ret < 0) {
>> +			/*
>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>> +			 */
>> +			*mfnp |= (ret == -ENOENT) ?
>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
> 
> You also need to clear the top nibble on success (ret >= 0) so large
> PFNs with the top nibble already set don't give false positives of errors.

Not really - that's what v2 was added for (the caller, unless
keeping a second array with the original MFNs, wouldn't be able
to match things up in that case).

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 16:14               ` Jan Beulich
@ 2013-01-14 16:15                 ` David Vrabel
  2013-01-14 16:19                   ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 18+ messages in thread
From: David Vrabel @ 2013-01-14 16:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Mats Petersson, xen-devel,
	Chao Zhou, Andres Lagar-Cavilla, Yan Dai, YongweiX Xu,
	Andres Lagar-Cavilla, SongtaoX Liu, AndresLagar-Cavilla

On 14/01/13 16:14, Jan Beulich wrote:
>>>> On 14.01.13 at 17:03, David Vrabel <david.vrabel@citrix.com> wrote:
>> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>>>  					 &cur_page);
>>>  
>>>  	/* Store error code for second pass. */
>>> -	*(st->err++) = ret;
>>> +	if (st->version == 1) {
>>> +		if (ret < 0) {
>>> +			/*
>>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>>> +			 */
>>> +			*mfnp |= (ret == -ENOENT) ?
>>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
>>
>> You also need to clear the top nibble on success (ret >= 0) so large
>> PFNs with the top nibble already set don't give false positives of errors.
> 
> Not really - that's what v2 was added for (the caller, unless
> keeping a second array with the original MFNs, wouldn't be able
> to match things up in that case).

Ok, I can agree with that.

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 16:03             ` David Vrabel
  2013-01-14 16:14               ` Jan Beulich
@ 2013-01-14 16:18               ` Andres Lagar-Cavilla
  1 sibling, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-14 16:18 UTC (permalink / raw)
  To: David Vrabel
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Andres Lagar-Cavilla,
	xen-devel, Chao Zhou, Jan Beulich, Mats Petersson, Yan Dai,
	YongweiX Xu, SongtaoX Liu, Andres Lagar-Cavilla


[-- Attachment #1.1: Type: text/plain, Size: 2807 bytes --]

On Jan 14, 2013, at 11:03 AM, David Vrabel <david.vrabel@citrix.com> wrote:

> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>> On Jan 14, 2013, at 8:59 AM, David Vrabel <david.vrabel@citrix.com> wrote:
>> 
>>> On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
>>>> 
>>>> Below you'll find pasted an RFC patch to fix this. I've expanded the
>>>> cc line to add Mats Peterson, who is also looking into some improvements
>>>> to privcmd (and IanC for general feedback).
>>>> 
>>>> The RFC patch cuts down code overall and cleans up logic too. I did
>>>> change the behavior wrt classic implementations when it comes to
>>>> handling errors & EFAULT. Instead of doing all the mapping work and then
>>>> copying back to user, I copy back each individual mapping error as soon
>>>> as it arises. And short-circuit and quit the whole operation as soon as
>>>> the first EFAULT arises.
>>> 
>>> Which is broken.
>> Certainly due to copy_on_write within mmap semaphore. Unfortunately I didn't have time last night to post the fix, pardon for the noise.
>>> Please just look at my v3 patch and implement that method.
> 
> ... but be aware that I messed up mmap_return_errors() for V1 and set
> all MFNs as having errors.  Oops.
> 
>> The one nit I have about that is that it does an unnecessary get_user of the mfn on the second pass for V1. HOw about this?
> 
> __get_user() and __put_user() are actually cheap (provided they don't
> fault).
> 
> This looks ok except for one thing.
> 
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 3421f0d..fc4952d 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
> [...]
>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>> 					 &cur_page);
>> 
>> 	/* Store error code for second pass. */
>> -	*(st->err++) = ret;
>> +	if (st->version == 1) {
>> +		if (ret < 0) {
>> +			/*
>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>> +			 */
>> +			*mfnp |= (ret == -ENOENT) ?
>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
> 
> You also need to clear the top nibble on success (ret >= 0) so large
> PFNs with the top nibble already set don't give false positives of errors.

But classic kernels don't do this either, afaict (e.g. http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/file/c340a22a3a63/drivers/xen/privcmd/privcmd.c#l282 or XenServer's 6.1 kernel).

The key reason for V2 is to get rid of all these limitations and stop trying to fix them in V1.

I'm open to whichever fix, though. It'd be just one line for the else case. I'd just like some feedback before "officially" submitting a patch.

Thanks!
Andres
> 
> David
> 


[-- Attachment #1.2: Type: text/html, Size: 5905 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 16:15                 ` David Vrabel
@ 2013-01-14 16:19                   ` Andres Lagar-Cavilla
  2013-01-14 21:21                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2013-01-14 16:19 UTC (permalink / raw)
  To: David Vrabel
  Cc: Yongjie Ren, Ian Campbell, Konrad Wilk, Mats Petersson, xen-devel,
	Chao Zhou, Jan Beulich, Andres Lagar-Cavilla, Yan Dai,
	YongweiX Xu, SongtaoX Liu, AndresLagar-Cavilla

On Jan 14, 2013, at 11:15 AM, David Vrabel <david.vrabel@citrix.com> wrote:

> On 14/01/13 16:14, Jan Beulich wrote:
>>>>> On 14.01.13 at 17:03, David Vrabel <david.vrabel@citrix.com> wrote:
>>> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>>>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>>>> 					 &cur_page);
>>>> 
>>>> 	/* Store error code for second pass. */
>>>> -	*(st->err++) = ret;
>>>> +	if (st->version == 1) {
>>>> +		if (ret < 0) {
>>>> +			/*
>>>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>>>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>>>> +			 */
>>>> +			*mfnp |= (ret == -ENOENT) ?
>>>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>>>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
>>> 
>>> You also need to clear the top nibble on success (ret >= 0) so large
>>> PFNs with the top nibble already set don't give false positives of errors.
>> 
>> Not really - that's what v2 was added for (the caller, unless
>> keeping a second array with the original MFNs, wouldn't be able
>> to match things up in that case).
> 
> Ok, I can agree with that.

Ok, cool, thanks David. Jan, Konrad, is the last patch sent this (EST) morning decent enough?

Andres
> 
> David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VMX status report. Xen:26323 & Dom0:3.7.1
  2013-01-14 16:19                   ` Andres Lagar-Cavilla
@ 2013-01-14 21:21                     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-01-14 21:21 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Yongjie Ren, Ian Campbell, AndresLagar-Cavilla,
	Andres Lagar-Cavilla, xen-devel, Chao Zhou, Jan Beulich,
	Mats Petersson, Yan Dai, YongweiX Xu, SongtaoX Liu, David Vrabel

On Mon, Jan 14, 2013 at 11:19:18AM -0500, Andres Lagar-Cavilla wrote:
> On Jan 14, 2013, at 11:15 AM, David Vrabel <david.vrabel@citrix.com> wrote:
> 
> > On 14/01/13 16:14, Jan Beulich wrote:
> >>>>> On 14.01.13 at 17:03, David Vrabel <david.vrabel@citrix.com> wrote:
> >>> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
> >>>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
> >>>> 					 &cur_page);
> >>>> 
> >>>> 	/* Store error code for second pass. */
> >>>> -	*(st->err++) = ret;
> >>>> +	if (st->version == 1) {
> >>>> +		if (ret < 0) {
> >>>> +			/*
> >>>> +			 * V1 encodes the error codes in the 32bit top nibble of the
> >>>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
> >>>> +			 */
> >>>> +			*mfnp |= (ret == -ENOENT) ?
> >>>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
> >>>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
> >>> 
> >>> You also need to clear the top nibble on success (ret >= 0) so large
> >>> PFNs with the top nibble already set don't give false positives of errors.
> >> 
> >> Not really - that's what v2 was added for (the caller, unless
> >> keeping a second array with the original MFNs, wouldn't be able
> >> to match things up in that case).
> > 
> > Ok, I can agree with that.
> 
> Ok, cool, thanks David. Jan, Konrad, is the last patch sent this (EST) morning decent enough?


Hm, I am not seeing it in my mailbox. Was I on the 'To' or 'CC' list?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-01-14 21:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10  7:51 VMX status report. Xen:26323 & Dom0:3.7.1 Ren, Yongjie
2013-01-10  8:57 ` Jan Beulich
2013-01-10 17:10   ` Andres Lagar-Cavilla
2013-01-10 17:27     ` Mats Petersson
2013-01-11  8:34     ` Jan Beulich
2013-01-11 16:05       ` Andres Lagar-Cavilla
2013-01-14  4:29       ` Andres Lagar-Cavilla
2013-01-14  5:03         ` Andres Lagar-Cavilla
2013-01-14 13:59         ` David Vrabel
2013-01-14 15:06           ` Andres Lagar-Cavilla
2013-01-14 16:03             ` David Vrabel
2013-01-14 16:14               ` Jan Beulich
2013-01-14 16:15                 ` David Vrabel
2013-01-14 16:19                   ` Andres Lagar-Cavilla
2013-01-14 21:21                     ` Konrad Rzeszutek Wilk
2013-01-14 16:18               ` Andres Lagar-Cavilla
2013-01-10 19:23   ` David Vrabel
2013-01-11 15:55     ` Andres Lagar-Cavilla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).