From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: Xen-unstable-staging: Xen BUG at iommu_map.c:455
Date: Sat, 11 Apr 2015 18:35:57 +0100
Message-ID: <55295B7D.4020107@citrix.com>
References: <1864736142.20150331231153@eikelenboom.it>
	<551B2FFA.6080001@citrix.com>
	<10810464855.20150410122433@eikelenboom.it>
	<55281C9F.7080409@citrix.com> <467745093.20150411161154@eikelenboom.it>
	<55292E04.9000302@citrix.com>
	<1246180833.20150411182125@eikelenboom.it>
	<55294CA8.2070208@citrix.com> <55294DF9.9040207@citrix.com>
	<1189087521.20150411192519@eikelenboom.it>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <Andrew.Cooper3@citrix.com>) id 1YgzK7-00005A-GN
	for xen-devel@lists.xenproject.org; Sat, 11 Apr 2015 17:36:03 +0000
In-Reply-To: <1189087521.20150411192519@eikelenboom.it>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Sander Eikelenboom <linux@eikelenboom.it>
Cc: xen-devel <xen-devel@lists.xenproject.org>, Paul Durrant <paul.durrant@citrix.com>, Jan Beulich <jbeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

On 11/04/15 18:25, Sander Eikelenboom wrote:
> Saturday, April 11, 2015, 6:38:17 PM, you wrote:
>
>> On 11/04/15 17:32, Andrew Cooper wrote:
>>> On 11/04/15 17:21, Sander Eikelenboom wrote:
>>>> Saturday, April 11, 2015, 4:21:56 PM, you wrote:
>>>>
>>>>> On 11/04/15 15:11, Sander Eikelenboom wrote:
>>>>>> Friday, April 10, 2015, 8:55:27 PM, you wrote:
>>>>>>
>>>>>>> On 10/04/15 11:24, Sander Eikelenboom wrote:
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> Finally got some time to figure this out .. and i have narrowed it down to:
>>>>>>>> git://xenbits.xen.org/staging/qemu-upstream-unstable.git
>>>>>>>> commit 7665d6ba98e20fb05c420de947c1750fd47e5c07 "Xen: Use the ioreq-server API when available"
>>>>>>>> A straight revert of this commit prevents the issue from happening.
>>>>>>>>
>>>>>>>> The reason i had a hard time figuring this out was:
>>>>>>>> - I wasn't aware of this earlier, since git pulling the main xen tree, doesn't 
>>>>>>>>   auto update the qemu-* trees.
>>>>>>> This has caught me out so many times.  It is very non-obvious behaviour.
>>>>>>>> - So i happen to get this when i cloned a fresh tree to try to figure out the 
>>>>>>>>   other issue i was seeing.
>>>>>>>> - After that checking out previous versions of the main xen tree didn't resolve 
>>>>>>>>   this new issue, because the qemu tree doesn't get auto updated and is set 
>>>>>>>>   "master".
>>>>>>>> - Cloning a xen-stable-4.5.0 made it go away .. because that has a specific 
>>>>>>>>   git://xenbits.xen.org/staging/qemu-upstream-unstable.git tag which is not 
>>>>>>>>   master.
>>>>>>>>
>>>>>>>> *sigh* 
>>>>>>>>
>>>>>>>> This is tested with xen main tree at last commit 3a28f760508fb35c430edac17a9efde5aff6d1d5
>>>>>>>> (normal xen-unstable, not the staging branch)
>>>>>>>>
>>>>>>>> Ok so i have added some extra debug info (see attached diff) and this is the 
>>>>>>>> output when it crashes due to something the commit above triggered, the 
>>>>>>>> level is out of bounds and the pfn looks fishy too.
>>>>>>>> Complete serial log from both bad and good (specific commit reverted) are 
>>>>>>>> attached.
>>>>>>> Just to confirm, you are positively identifying a qemu changeset as
>>>>>>> causing this crash?
>>>>>>> If so, the qemu change has discovered a pre-existing issue in the
>>>>>>> toolstack pci-passthrough interface.  Whatever qemu is or isn't doing,
>>>>>>> it should not be able to cause a crash like this.
>>>>>>> With this in mind, I need to brush up on my AMD-Vi details.
>>>>>>> In the meantime, can you run with the following patch to identify what
>>>>>>> is going on, domctl wise?  I assume it is the assign_device which is
>>>>>>> failing, but it will be nice to observe the differences between the
>>>>>>> working and failing case, which might offer a hint.
>>>>>> Hrrm with your patch i end up with a fatal page fault in iommu_do_pci_domctl:
>>>>>>
>>>>>> (XEN) [2015-04-11 14:03:31.833] ----[ Xen-4.6-unstable  x86_64  debug=y  Tainted:    C ]----
>>>>>> (XEN) [2015-04-11 14:03:31.857] CPU:    5
>>>>>> (XEN) [2015-04-11 14:03:31.868] RIP:    e008:[<ffff82d08014c52c>] iommu_do_pci_domctl+0x2dc/0x740
>>>>>> (XEN) [2015-04-11 14:03:31.894] RFLAGS: 0000000000010256   CONTEXT: hypervisor
>>>>>> (XEN) [2015-04-11 14:03:31.915] rax: 0000000000000008   rbx: 0000000000000800   rcx: ffffffffffebe5ed
>>>>>> (XEN) [2015-04-11 14:03:31.942] rdx: 0000000000000800   rsi: 0000000000000000   rdi: ffff830256ef7e38
>>>>>> (XEN) [2015-04-11 14:03:31.968] rbp: ffff830256ef7c98   rsp: ffff830256ef7c08   r8:  00000000deadbeef
>>>>>> (XEN) [2015-04-11 14:03:31.995] r9:  00000000deadbeef   r10: ffff82d08024e500   r11: 0000000000000282
>>>>>> (XEN) [2015-04-11 14:03:32.022] r12: 0000000000000000   r13: 0000000000000008   r14: 0000000000000000
>>>>>> (XEN) [2015-04-11 14:03:32.049] r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000000006f0
>>>>>> (XEN) [2015-04-11 14:03:32.076] cr3: 00000002336a6000   cr2: 0000000000000000
>>>>>> (XEN) [2015-04-11 14:03:32.096] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>>>>>> (XEN) [2015-04-11 14:03:32.121] Xen stack trace from rsp=ffff830256ef7c08:
>>>>>> (XEN) [2015-04-11 14:03:32.141]    ffff830256ef7c78 ffff82d08012c178 ffff830256ef7c28 ffff830256ef7c28
>>>>>> (XEN) [2015-04-11 14:03:32.168]    0000000000000010 0000000000000000 0000000000000000 0000000000000000
>>>>>> (XEN) [2015-04-11 14:03:32.195]    00000000000006f0 00007fe300000000 ffff830256eb7790 ffff83025cc6d300
>>>>>> (XEN) [2015-04-11 14:03:32.222]    ffff82d080330c60 00007fe396bab004 0000000000000000 00007fe396bab004
>>>>>> (XEN) [2015-04-11 14:03:32.249]    0000000000000000 0000000000000005 ffff830256ef7ca8 ffff82d08014900b
>>>>>> (XEN) [2015-04-11 14:03:32.276]    ffff830256ef7d98 ffff82d080161f2d 0000000000000010 0000000000000000
>>>>>> (XEN) [2015-04-11 14:03:32.303]    0000000000000000 ffff830256ef7ce8 ffff82d08018b655 ffff830256ef7d48
>>>>>> (XEN) [2015-04-11 14:03:32.330]    ffff830256ef7cf8 ffff82d08018b66a ffff830256ef7d38 ffff82d08012925e
>>>>>> (XEN) [2015-04-11 14:03:32.357]    ffff830256efc068 0000000800000001 800000022e12c167 0000000000000000
>>>>>> (XEN) [2015-04-11 14:03:32.384]    0000000000000002 ffff830256ef7e38 0000000800000000 800000022e12c167
>>>>>> (XEN) [2015-04-11 14:03:32.411]    0000000000000003 ffff830256ef7db8 0000000000000000 00007fe396780eb0
>>>>>> (XEN) [2015-04-11 14:03:32.439]    0000000000000202 ffffffffffffffff 0000000000000000 00007fe396bab004
>>>>>> (XEN) [2015-04-11 14:03:32.466]    0000000000000000 0000000000000005 ffff830256ef7ef8 ffff82d08010497f
>>>>>> (XEN) [2015-04-11 14:03:32.493]    0000000000000001 0000000000100001 800000022e12c167 ffff88001f7ecc00
>>>>>> (XEN) [2015-04-11 14:03:32.520]    00007fe396780eb0 ffff88001c849508 0000000e00000007 ffffffff8105594a
>>>>>> (XEN) [2015-04-11 14:03:32.547]    000000000000e033 0000000000000202 ffff88001ece3d40 000000000000e02b
>>>>>> (XEN) [2015-04-11 14:03:32.574]    ffff830256ef7e28 ffff82d080194933 000000000000beef ffffffff81bd6c85
>>>>>> (XEN) [2015-04-11 14:03:32.601]    ffff830256ef7f08 ffff82d080193edd 0000000b0000002d 0000000000000001
>>>>>> (XEN) [2015-04-11 14:03:32.628]    0000000100000800 00007fe3962abbd0 ffff000a81050001 00007fe39656ce6e
>>>>>> (XEN) [2015-04-11 14:03:32.655]    00007ffdf2a654f0 00007fe39656d0c9 00007fe39656ce6e 00007fe3969a9a55
>>>>>> (XEN) [2015-04-11 14:03:32.682] Xen call trace:
>>>>>> (XEN) [2015-04-11 14:03:32.695]    [<ffff82d08014c52c>] iommu_do_pci_domctl+0x2dc/0x740
>>>>>> (XEN) [2015-04-11 14:03:32.718]    [<ffff82d08014900b>] iommu_do_domctl+0x17/0x1a
>>>>>> (XEN) [2015-04-11 14:03:32.739]    [<ffff82d080161f2d>] arch_do_domctl+0x2469/0x26e1
>>>>>> (XEN) [2015-04-11 14:03:32.762]    [<ffff82d08010497f>] do_domctl+0x1a1f/0x1d60
>>>>>> (XEN) [2015-04-11 14:03:32.783]    [<ffff82d080234c6b>] syscall_enter+0xeb/0x145
>>>>>> (XEN) [2015-04-11 14:03:32.804] 
>>>>>> (XEN) [2015-04-11 14:03:32.813] Pagetable walk from 0000000000000000:
>>>>>> (XEN) [2015-04-11 14:03:32.831]  L4[0x000] = 0000000234075067 000000000001f2a8
>>>>>> (XEN) [2015-04-11 14:03:32.852]  L3[0x000] = 0000000229ad4067 0000000000014c49
>>>>>> (XEN) [2015-04-11 14:03:32.873]  L2[0x000] = 0000000000000000 ffffffffffffffff 
>>>>>> (XEN) [2015-04-11 14:03:32.894] 
>>>>>> (XEN) [2015-04-11 14:03:32.903] ****************************************
>>>>>> (XEN) [2015-04-11 14:03:32.922] Panic on CPU 5:
>>>>>> (XEN) [2015-04-11 14:03:32.935] FATAL PAGE FAULT
>>>>>> (XEN) [2015-04-11 14:03:32.948] [error_code=0000]
>>>>>> (XEN) [2015-04-11 14:03:32.961] Faulting linear address: 0000000000000000
>>>>>> (XEN) [2015-04-11 14:03:32.981] ****************************************
>>>>>> (XEN) [2015-04-11 14:03:33.000] 
>>>>>> (XEN) [2015-04-11 14:03:33.009] Reboot in five seconds...
>>>>>>
>>>>>> The RIP resolves to the prink added by your patch in:
>>>>>>
>>>>>>     case XEN_DOMCTL_test_assign_device:
>>>>>>         ret = xsm_test_assign_device(XSM_HOOK, domctl->u.assign_device.machine_sbdf);
>>>>>>         if ( ret )
>>>>>>             break;
>>>>>>
>>>>>>         seg = domctl->u.assign_device.machine_sbdf >> 16;
>>>>>>         bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>>         devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>>
>>>>>>         printk("*** %pv->d%d: test_assign_device({%04x:%02x:%02x.%u})\n",
>>>>>>                current, d->domain_id,
>>>>>>                seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>
>>>>>>         if ( device_assigned(seg, bus, devfn) )
>>>>>>         {
>>>>>>             printk(XENLOG_G_INFO
>>>>>>                    "%04x:%02x:%02x.%u already assigned, or non-existent\n",
>>>>>>                    seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>>>             ret = -EINVAL;
>>>>>>         }
>>>>>>         break;
>>>>> hmm - 'd' is NULL.  This ought to work better.
>>>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>>>> index 9f3413c..85ff1fc 100644
>>>>> --- a/xen/drivers/passthrough/pci.c
>>>>> +++ b/xen/drivers/passthrough/pci.c
>>>>> @@ -1532,6 +1532,11 @@ int iommu_do_pci_domctl(
>>>>>          max_sdevs = domctl->u.get_device_group.max_sdevs;
>>>>>          sdevs = domctl->u.get_device_group.sdev_array;
>>>>>  
>>>>> +        printk("*** %pv->d%d: get_device_group({%04x:%02x:%02x.%u, %u})\n",
>>>>> +               current, d->domain_id,
>>>>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
>>>>> +               max_sdevs);
>>>>> +
>>>>>          ret = iommu_get_device_group(d, seg, bus, devfn, sdevs, max_sdevs);
>>>>>          if ( ret < 0 )
>>>>>          {
>>>>> @@ -1558,6 +1563,9 @@ int iommu_do_pci_domctl(
>>>>>          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>  
>>>>> +        printk("*** %pv: test_assign_device({%04x:%02x:%02x.%u})\n",
>>>>> +               current, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>> +
>>>>>          if ( device_assigned(seg, bus, devfn) )
>>>>>          {
>>>>>              printk(XENLOG_G_INFO
>>>>> @@ -1582,6 +1590,10 @@ int iommu_do_pci_domctl(
>>>>>          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>  
>>>>> +        printk("*** %pv->d%d: assign_device({%04x:%02x:%02x.%u})\n",
>>>>> +               current, d->domain_id,
>>>>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>> +
>>>>>          ret = device_assigned(seg, bus, devfn) ?:
>>>>>                assign_device(d, seg, bus, devfn);
>>>>>          if ( ret == -ERESTART )
>>>>> @@ -1604,6 +1616,10 @@ int iommu_do_pci_domctl(
>>>>>          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
>>>>>          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
>>>>>  
>>>>> +        printk("*** %pv->d%d: deassign_device({%04x:%02x:%02x.%u})\n",
>>>>> +               current, d->domain_id,
>>>>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>>>>> +
>>>>>          spin_lock(&pcidevs_lock);
>>>>>          ret = deassign_device(d, seg, bus, devfn);
>>>>>          spin_unlock(&pcidevs_lock);
>>>> Hi Andrew,
>>>>
>>>> Attached are the serial logs good (with revert) and bad (without):
>>>>
>>>> Some things that seems strange to me:
>>>> - The numerous calls to get the device 08:00.0 assigned ... for 0a:00.0 there 
>>>>   was only one call to both test assign and assign.
>>>> - However these numerous calls are there in both the good and the bad case,
>>>>   so perhaps it's strange and wrong .. but not the cause ..
>>>> - I had a hunch it could be due to the 08:00.0 using MSI-X, but when only 
>>>>   passing through 0a:00.0, i get the same numerous calls but now for the 
>>>>   0a:00.0 which uses IntX, so I think that is more related to being the *first*
>>>>   device to be passed through to a guest.
>>> I have also observed this behaviour, but not had time to investigate. 
>>> It doesn't appear problematic in the longrun but it probably a toolstack
>>> issue which wants fixing (if only in the name of efficiency).
>> And just after I sent this email, I have realised why.
>> The first assign device will have to build IO pagetables, which is a
>> long operation and subject to hypercall continuations.  The second
>> device will reused the same pagetables, so is quick to complete.
> So .. is the ioreq patch from Paul involved in providing something used in building 
> the pagetables .. and could it have say some off by one resulting in the 
> 0xffffffffffff .. which could lead to the pagetable building going beserk, 
> requiring a paging_mode far greater than normally would be required .. which 
> get's set .. since that isn't checked properly .. leading to things breaking 
> a bit further when it does get checked ? 

A -1 is slipping in somewhere and ending up in the gfn field.

The result is that update_paging_mode() attempts to construct
iopagetables to cover a 76bit address space, which is how level ends up
at 8.  (Note that a level of 7 is reserved, and a level of anything
greater than 4 is implausible on your system.)

I think the crash is collateral damage following on from
update_paging_mode() not properly sanitising its input, but that there
is still some other issue causing -1 to be passed in the first place.

I am still trying to locate where a -1 might plausibly be coming from.

~Andrew