* Re: [PATCH v9 00/12] Support PPTT for ARM64
From: Sudeep Holla @ 2018-05-29 13:18 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Sudeep Holla, Catalin Marinas, Jeremy Linton,
ACPI Devel Maling List, Mark Rutland, austinwc, tnowicki,
Palmer Dabbelt, Will Deacon, linux-riscv, Morten.Rasmussen,
vkilari, Lorenzo Pieralisi, jhugo, Al Stone, Len Brown,
John Garry, wangxiongfeng2, Dietmar Eggemann, Linux ARM,
Ard Biesheuvel, Greg KH
In-Reply-To: <CAMuHMdXgiMeD4uF+j8W+CpNwYYK2W_8xqk_=vGBiW=bUvKeq7w@mail.gmail.com>
On 29/05/18 12:56, Geert Uytterhoeven wrote:
> Hi Sudeep,
>
> On Tue, May 29, 2018 at 1:14 PM, Sudeep Holla <sudeep.holla@arm.com> wrote:
>> On 29/05/18 11:48, Geert Uytterhoeven wrote:
>>> On Thu, May 17, 2018 at 7:05 PM, Catalin Marinas
>>> <catalin.marinas@arm.com> wrote:
>>>> On Fri, May 11, 2018 at 06:57:55PM -0500, Jeremy Linton wrote:
>>>>> Jeremy Linton (12):
>>>>> drivers: base: cacheinfo: move cache_setup_of_node()
>>>>> drivers: base: cacheinfo: setup DT cache properties early
>>>>> cacheinfo: rename of_node to fw_token
>>>>> arm64/acpi: Create arch specific cpu to acpi id helper
>>>>> ACPI/PPTT: Add Processor Properties Topology Table parsing
>>>>> ACPI: Enable PPTT support on ARM64
>>>>> drivers: base cacheinfo: Add support for ACPI based firmware tables
>>>>> arm64: Add support for ACPI based firmware tables
>>>>> arm64: topology: rename cluster_id
>>>>> arm64: topology: enable ACPI/PPTT based CPU topology
>>>>> ACPI: Add PPTT to injectable table list
>>>>> arm64: topology: divorce MC scheduling domain from core_siblings
>>>>
>>>> Queued for 4.18 (without Sudeep's latest property_read_u64 cacheinfo
>>>> patch - http://lkml.kernel.org/r/20180517154701.GA20281@e107155-lin; I
>>>> can add it separately).
>>>
>>> This is now commit 37c3ec2d810f87ea ("arm64: topology: divorce MC
>>> scheduling domain from core_siblings") in arm64/for-next/core, causing
>>> system suspend on big.LITTLE systems to hang after shutting down the first
>>> CPU:
>>>
>>> $ echo mem > /sys/power/state
>>> PM: suspend entry (deep)
>>> PM: Syncing filesystems ... done.
>>> Freezing user space processes ... (elapsed 0.001 seconds) done.
>>> OOM killer disabled.
>>> Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
>>> Disabling non-boot CPUs ...
>>> CPU1: shutdown
>>> psci: CPU1 killed.
>>>
>>
>> Is it OK to assume the suspend failed just after shutting down one CPU
>> or it's failing during resume ? It depends on whether you had console
>> disabled or not.
>
> I have no-console-suspend enabled.
> It's failing during suspend, the next lines should be:
>
> CPU2: shutdown
> psci: CPU2 killed.
> ...
>
OK, I was hoping to be something during resume as this patch has nothing
executed during suspend. Do you see any change in topology before and
after this patch applied. I am interested in the output of:
$ grep "" /sys/devices/system/cpu/cpu*/topology/*
>>> For me, it fails on the following big.LITTLE systems:
>>>
>>> R-Car H3 ES2.0 (4xCA57 + 4xCA53)
>>> R-Car M3-W (2xCA57 + 4xCA53)
>>>
>>
>> Interesting, is it PSCI based system suspend ?
>
> Yes it is.
>
> Suspend-to-idle, which doesn't offline CPUs, still works.
>
>From DT, I guess this platform doesn't have any idle states.
Does this use genpd power domains ? I see power-domains in the DT, so
asking to get more info. Do you have any out of tree patches especially
if they are depending on some topology cpumasks ?
>>> System supend still works fine on systems with big cores only:
>>>
>>> R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware))
>>> R-Car M3-N (2xCA57)
>>>
>>> Reverting this commit fixes the issue for me.
>>
>> I can't find anything that relates to system suspend in these patches
>> unless they are messing with something during CPU hot plug-in back
>> during resume.
>
> It's only the last patch that introduces the breakage.
>
As specified in the commit log, it won't change any behavior for DT
systems if it's non-NUMA or single node system. So I am still wondering
what could trigger this regression.
--
Regards,
Sudeep
^ permalink raw reply
* Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
From: Xu Zaibo @ 2018-05-29 12:24 UTC (permalink / raw)
To: Jean-Philippe Brucker,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
Will Deacon, okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
liguozhu-C8/M+/jPZTeaMJb+Lgu22Q@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org
In-Reply-To: <99ff4f89-86ef-a251-894c-8aa8a47d2a69-5wv7dgnIgG8@public.gmane.org>
Hi,
On 2018/5/29 19:55, Jean-Philippe Brucker wrote:
> (If possible, please reply in plain text to the list. Reading this in a
> text-only reader is confusing, because all quotes have the same level)
Sorry for that, I have reset the thunderbird, :) thanks.
> On 26/05/18 04:53, Xu Zaibo wrote:
>> I guess there may be some misunderstandings :).
>>
>> From the current patches, 'iommu_sva_device_shutdown' is called by 'vfio_iommu_sva_shutdown', which
>> is mainly used by 'vfio_iommu_type1_detach_group' that is finally called by processes' release of vfio facilitiy
>> automatically or called by 'VFIO_GROUP_UNSET_CONTAINER' manually in the processes.
>>
>> So, just image that 2 processes is working on the device with IOPF feature, and the 2 do the following to enable SVA:
>>
>> 1.open("/dev/vfio/vfio") ;
>>
>> 2.open the group of the devcie by calling open("/dev/vfio/x"), but now,
>> I think the second processes will fail to open the group because current VFIO thinks that one group only can be used by one process/vm,
>> at present, I use mediated device to create more groups on the parent device to support multiple processes;
>>
>> 3.VFIO_GROUP_SET_CONTAINER;
>>
>> 4.VFIO_SET_IOMMU;
>>
>> 5.VFIO_IOMMU_BIND;
> I have a concern regarding your driver. With mdev you can't allow
> processes to program the PASID themselves, since the parent device has a
> single PASID table. You lose all isolation since processes could write
> any value in the PASID field and access address spaces of other
> processes bound to this parent device (even if the BIND call was for
> other mdevs).
Yes, if wrapdrive do nothing on this PASID setting procedure in kernel
space, I think
there definitely exists this security risk.
> The wrap driver has to mediate calls to bind(), and either program the
> PASID into the device itself, or at least check that, when receiving a
> SET_PASID ioctl from a process, the given PASID was actually allocated
> to the process.
Yes, good advice, thanks.
>
>> 6.Do some works with the hardware working unit filled by PASID on the device;
>>
>> 7.VFIO_IOMMU_UNBIND;
>>
>> 8.VFIO_GROUP_UNSET_CONTAINER;---here, have to sleep to wait another process to finish works of the step 6;
>>
>> 9. close(group); close(containner);
>>
>>
>> So, my idea is: If it is possible to just release the params or facilities that only belong to the current process while the process shutdown the device,
>> and while the last process releases them all. Then, as in the above step 8, we
>> don't need to wait, or maybe wait for a very long time if the other process is doing lots of work on the device.
> Given that you need to notify the mediating driver of IOMMU_BIND calls
> as explained above, you could do something similar for shutdown: from
> the mdev driver, call iommu_sva_shutdown_device() only for the last mdev.
Currently, I add an API to check if it is the last mdev in wrapdrive, as
vfio shutdowns the device,
it call the API to do the check at first.
Thanks
Zaibo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply
* [PATCH v3 2/2] PCI: Check phys_addr for pci_remap_iospace
From: Yisheng Xie @ 2018-05-29 12:18 UTC (permalink / raw)
To: bhelgaas, rjw, lenb, guohanjun, jcm, toshi.kani, tanxiaojun,
wangzhou1
Cc: linux-pci, linux-acpi, linux-kernel, Yisheng Xie
In-Reply-To: <1527596299-57567-1-git-send-email-xieyisheng1@huawei.com>
If phys_addr is not page aligned, ioremap_page_range() will align down it
when get pfn by phys_addr >> PAGE_SHIFT. An example in arm64 system with
64KB page size:
phys_addr: 0xefff8000
res->start: 0x0
res->end: 0x0ffff
PCI_IOBASE: 0xffff7fdffee00000
This will remap virtual address 0xffff7fdffee00000 to phys_addr 0xefff0000,
but what we really want is 0xefff8000, which makes later IO access to a
mess. And users may even donot know this until find some odd phenemenon.
This patch checks whether phys_addr is PAGE_ALIGNED or not to find the
primary scene.
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
---
drivers/pci/pci.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0eb0381..117ca6a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3547,6 +3547,9 @@ int pci_remap_iospace(const struct resource *res, phys_addr_t phys_addr)
if (!PAGE_ALIGNED(vaddr) || !PAGE_ALIGNED(resource_size(res)))
return -EINVAL;
+ if (!PAGE_ALIGNED(phys_addr))
+ return -EINVAL;
+
return ioremap_page_range(vaddr, vaddr + resource_size(res), phys_addr,
pgprot_device(PAGE_KERNEL));
#else
--
1.7.12.4
^ permalink raw reply related
* [PATCH v3 1/2] PCI: Avoid panic when PCI IO resource's size is not page aligned
From: Yisheng Xie @ 2018-05-29 12:18 UTC (permalink / raw)
To: bhelgaas, rjw, lenb, guohanjun, jcm, toshi.kani, tanxiaojun,
wangzhou1
Cc: linux-pci, linux-acpi, linux-kernel, Yisheng Xie
Zhou reported a bug on Hisilicon arm64 D06 platform with 64KB page size:
[ 2.470908] kernel BUG at lib/ioremap.c:72!
[ 2.475079] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 2.480551] Modules linked in:
[ 2.483594] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc7-00062-g0b41260-dirty #23
[ 2.491756] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI Nemo 2.0 RC0 - B120 03/23/2018
[ 2.500614] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 2.505395] pc : ioremap_page_range+0x268/0x36c
[ 2.509912] lr : pci_remap_iospace+0xe4/0x100
[...]
[ 2.603733] Call trace:
[ 2.606168] ioremap_page_range+0x268/0x36c
[ 2.610337] pci_remap_iospace+0xe4/0x100
[ 2.614334] acpi_pci_probe_root_resources+0x1d4/0x214
[ 2.619460] pci_acpi_root_prepare_resources+0x18/0xa8
[ 2.624585] acpi_pci_root_create+0x98/0x214
[ 2.628843] pci_acpi_scan_root+0x124/0x20c
[ 2.633013] acpi_pci_root_add+0x224/0x494
[ 2.637096] acpi_bus_attach+0xf8/0x200
[ 2.640918] acpi_bus_attach+0x98/0x200
[ 2.644740] acpi_bus_attach+0x98/0x200
[ 2.648562] acpi_bus_scan+0x48/0x9c
[ 2.652125] acpi_scan_init+0x104/0x268
[ 2.655948] acpi_init+0x308/0x374
[ 2.659337] do_one_initcall+0x48/0x14c
[ 2.663160] kernel_init_freeable+0x19c/0x250
[ 2.667504] kernel_init+0x10/0x100
[ 2.670979] ret_from_fork+0x10/0x18
The cause is the size of PCI IO resource is 32KB, which is 4K aligned but
not 64KB aligned, however, ioremap_page_range() request the range as page
aligned or it will trigger a BUG_ON() on ioremap_pte_range() it calls, as
ioremap_pte_range increase the addr by PAGE_SIZE, which makes addr != end
until trigger BUG_ON, if its incoming end is not page aligned. More detail
trace is as following:
ioremap_page_range
-> ioremap_p4d_range
-> ioremap_p4d_range
-> ioremap_pud_range
-> ioremap_pmd_range
-> ioremap_pte_range
This patch avoid panic by return -EINVAL if vaddr or resource size is not
page aligned.
Reported-by: Zhou Wang <wangzhou1@hisilicon.com>
Tested-by: Xiaojun Tan <tanxiaojun@huawei.com>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
---
v3:
- pci_remap_iospace() sanitize its arguments instead - per Rafael
v2:
- Let the caller of ioremap_page_range() align the request by PAGE_SIZE - per Toshi
drivers/pci/pci.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index dbfe7c4..0eb0381 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3544,6 +3544,9 @@ int pci_remap_iospace(const struct resource *res, phys_addr_t phys_addr)
if (res->end > IO_SPACE_LIMIT)
return -EINVAL;
+ if (!PAGE_ALIGNED(vaddr) || !PAGE_ALIGNED(resource_size(res)))
+ return -EINVAL;
+
return ioremap_page_range(vaddr, vaddr + resource_size(res), phys_addr,
pgprot_device(PAGE_KERNEL));
#else
--
1.7.12.4
^ permalink raw reply related
* Re: [PATCH v9 00/12] Support PPTT for ARM64
From: Geert Uytterhoeven @ 2018-05-29 11:56 UTC (permalink / raw)
To: Sudeep Holla
Cc: Catalin Marinas, Jeremy Linton, ACPI Devel Maling List,
Mark Rutland, austinwc, tnowicki, Palmer Dabbelt, Will Deacon,
linux-riscv, Morten.Rasmussen, vkilari, Lorenzo Pieralisi, jhugo,
Al Stone, Len Brown, John Garry, wangxiongfeng2, Dietmar Eggemann,
Linux ARM, Ard Biesheuvel, Greg KH
In-Reply-To: <09fb3fe7-d703-43f1-74f7-f8cb5ff1f67a@arm.com>
Hi Sudeep,
On Tue, May 29, 2018 at 1:14 PM, Sudeep Holla <sudeep.holla@arm.com> wrote:
> On 29/05/18 11:48, Geert Uytterhoeven wrote:
>> On Thu, May 17, 2018 at 7:05 PM, Catalin Marinas
>> <catalin.marinas@arm.com> wrote:
>>> On Fri, May 11, 2018 at 06:57:55PM -0500, Jeremy Linton wrote:
>>>> Jeremy Linton (12):
>>>> drivers: base: cacheinfo: move cache_setup_of_node()
>>>> drivers: base: cacheinfo: setup DT cache properties early
>>>> cacheinfo: rename of_node to fw_token
>>>> arm64/acpi: Create arch specific cpu to acpi id helper
>>>> ACPI/PPTT: Add Processor Properties Topology Table parsing
>>>> ACPI: Enable PPTT support on ARM64
>>>> drivers: base cacheinfo: Add support for ACPI based firmware tables
>>>> arm64: Add support for ACPI based firmware tables
>>>> arm64: topology: rename cluster_id
>>>> arm64: topology: enable ACPI/PPTT based CPU topology
>>>> ACPI: Add PPTT to injectable table list
>>>> arm64: topology: divorce MC scheduling domain from core_siblings
>>>
>>> Queued for 4.18 (without Sudeep's latest property_read_u64 cacheinfo
>>> patch - http://lkml.kernel.org/r/20180517154701.GA20281@e107155-lin; I
>>> can add it separately).
>>
>> This is now commit 37c3ec2d810f87ea ("arm64: topology: divorce MC
>> scheduling domain from core_siblings") in arm64/for-next/core, causing
>> system suspend on big.LITTLE systems to hang after shutting down the first
>> CPU:
>>
>> $ echo mem > /sys/power/state
>> PM: suspend entry (deep)
>> PM: Syncing filesystems ... done.
>> Freezing user space processes ... (elapsed 0.001 seconds) done.
>> OOM killer disabled.
>> Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
>> Disabling non-boot CPUs ...
>> CPU1: shutdown
>> psci: CPU1 killed.
>>
>
> Is it OK to assume the suspend failed just after shutting down one CPU
> or it's failing during resume ? It depends on whether you had console
> disabled or not.
I have no-console-suspend enabled.
It's failing during suspend, the next lines should be:
CPU2: shutdown
psci: CPU2 killed.
...
>> For me, it fails on the following big.LITTLE systems:
>>
>> R-Car H3 ES2.0 (4xCA57 + 4xCA53)
>> R-Car M3-W (2xCA57 + 4xCA53)
>>
>
> Interesting, is it PSCI based system suspend ?
Yes it is.
Suspend-to-idle, which doesn't offline CPUs, still works.
>> System supend still works fine on systems with big cores only:
>>
>> R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware))
>> R-Car M3-N (2xCA57)
>>
>> Reverting this commit fixes the issue for me.
>
> I can't find anything that relates to system suspend in these patches
> unless they are messing with something during CPU hot plug-in back
> during resume.
It's only the last patch that introduces the breakage.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
From: Jean-Philippe Brucker @ 2018-05-29 11:55 UTC (permalink / raw)
To: Xu Zaibo,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
Will Deacon, okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
liguozhu-C8/M+/jPZTeaMJb+Lgu22Q@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org
In-Reply-To: <5B08DA21.3070507-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
(If possible, please reply in plain text to the list. Reading this in a
text-only reader is confusing, because all quotes have the same level)
On 26/05/18 04:53, Xu Zaibo wrote:
> I guess there may be some misunderstandings :).
>
> From the current patches, 'iommu_sva_device_shutdown' is called by 'vfio_iommu_sva_shutdown', which
> is mainly used by 'vfio_iommu_type1_detach_group' that is finally called by processes' release of vfio facilitiy
> automatically or called by 'VFIO_GROUP_UNSET_CONTAINER' manually in the processes.
>
> So, just image that 2 processes is working on the device with IOPF feature, and the 2 do the following to enable SVA:
>
> 1.open("/dev/vfio/vfio") ;
>
> 2.open the group of the devcie by calling open("/dev/vfio/x"), but now,
> I think the second processes will fail to open the group because current VFIO thinks that one group only can be used by one process/vm,
> at present, I use mediated device to create more groups on the parent device to support multiple processes;
>
> 3.VFIO_GROUP_SET_CONTAINER;
>
> 4.VFIO_SET_IOMMU;
>
> 5.VFIO_IOMMU_BIND;
I have a concern regarding your driver. With mdev you can't allow
processes to program the PASID themselves, since the parent device has a
single PASID table. You lose all isolation since processes could write
any value in the PASID field and access address spaces of other
processes bound to this parent device (even if the BIND call was for
other mdevs).
The wrap driver has to mediate calls to bind(), and either program the
PASID into the device itself, or at least check that, when receiving a
SET_PASID ioctl from a process, the given PASID was actually allocated
to the process.
> 6.Do some works with the hardware working unit filled by PASID on the device;
>
> 7.VFIO_IOMMU_UNBIND;
>
> 8.VFIO_GROUP_UNSET_CONTAINER;---here, have to sleep to wait another process to finish works of the step 6;
>
> 9. close(group); close(containner);
>
>
> So, my idea is: If it is possible to just release the params or facilities that only belong to the current process while the process shutdown the device,
> and while the last process releases them all. Then, as in the above step 8, we
> don't need to wait, or maybe wait for a very long time if the other process is doing lots of work on the device.
Given that you need to notify the mediating driver of IOMMU_BIND calls
as explained above, you could do something similar for shutdown: from
the mdev driver, call iommu_sva_shutdown_device() only for the last mdev.
Thanks,
Jean
^ permalink raw reply
* Re: [PATCH v9 00/12] Support PPTT for ARM64
From: Sudeep Holla @ 2018-05-29 11:14 UTC (permalink / raw)
To: Geert Uytterhoeven, Catalin Marinas, Jeremy Linton
Cc: Sudeep Holla, ACPI Devel Maling List, Mark Rutland, austinwc,
tnowicki, Palmer Dabbelt, Will Deacon, linux-riscv,
Morten.Rasmussen, vkilari, Lorenzo Pieralisi, jhugo, Al Stone,
Len Brown, John Garry, wangxiongfeng2, Dietmar Eggemann,
Linux ARM, Ard Biesheuvel, Greg KH, Rafael J. Wysocki
In-Reply-To: <CAMuHMdWJWj3a0MZgEi7VJTUJRNoeR+X3eoN8A-sW6fwimEr6Fg@mail.gmail.com>
On 29/05/18 11:48, Geert Uytterhoeven wrote:
> Hi Catalin, Jeremy,
>
> On Thu, May 17, 2018 at 7:05 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> On Fri, May 11, 2018 at 06:57:55PM -0500, Jeremy Linton wrote:
>>> Jeremy Linton (12):
>>> drivers: base: cacheinfo: move cache_setup_of_node()
>>> drivers: base: cacheinfo: setup DT cache properties early
>>> cacheinfo: rename of_node to fw_token
>>> arm64/acpi: Create arch specific cpu to acpi id helper
>>> ACPI/PPTT: Add Processor Properties Topology Table parsing
>>> ACPI: Enable PPTT support on ARM64
>>> drivers: base cacheinfo: Add support for ACPI based firmware tables
>>> arm64: Add support for ACPI based firmware tables
>>> arm64: topology: rename cluster_id
>>> arm64: topology: enable ACPI/PPTT based CPU topology
>>> ACPI: Add PPTT to injectable table list
>>> arm64: topology: divorce MC scheduling domain from core_siblings
>>
>> Queued for 4.18 (without Sudeep's latest property_read_u64 cacheinfo
>> patch - http://lkml.kernel.org/r/20180517154701.GA20281@e107155-lin; I
>> can add it separately).
>
> This is now commit 37c3ec2d810f87ea ("arm64: topology: divorce MC
> scheduling domain from core_siblings") in arm64/for-next/core, causing
> system suspend on big.LITTLE systems to hang after shutting down the first
> CPU:
>
> $ echo mem > /sys/power/state
> PM: suspend entry (deep)
> PM: Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.001 seconds) done.
> OOM killer disabled.
> Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> Disabling non-boot CPUs ...
> CPU1: shutdown
> psci: CPU1 killed.
>
Is it OK to assume the suspend failed just after shutting down one CPU
or it's failing during resume ? It depends on whether you had console
disabled or not.
> For me, it fails on the following big.LITTLE systems:
>
> R-Car H3 ES2.0 (4xCA57 + 4xCA53)
> R-Car M3-W (2xCA57 + 4xCA53)
>
Interesting, is it PSCI based system suspend ?
> System supend still works fine on systems with big cores only:
>
> R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware))
> R-Car M3-N (2xCA57)
>
> Reverting this commit fixes the issue for me.
>
I can't find anything that relates to system suspend in these patches
unless they are messing with something during CPU hot plug-in back
during resume.
--
Regards,
Sudeep
^ permalink raw reply
* Re: [PATCH v9 00/12] Support PPTT for ARM64
From: Geert Uytterhoeven @ 2018-05-29 10:48 UTC (permalink / raw)
To: Catalin Marinas, Jeremy Linton
Cc: ACPI Devel Maling List, Mark Rutland, austinwc, tnowicki,
Palmer Dabbelt, Will Deacon, linux-riscv, Morten.Rasmussen,
vkilari, Lorenzo Pieralisi, jhugo, Al Stone, Len Brown,
John Garry, wangxiongfeng2, Dietmar Eggemann, Linux ARM,
Ard Biesheuvel, Greg KH, Rafael J. Wysocki,
Linux Kernel Mailing List, Hanju
In-Reply-To: <20180517170523.h7tuvbzdfluuidcz@armageddon.cambridge.arm.com>
Hi Catalin, Jeremy,
On Thu, May 17, 2018 at 7:05 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Fri, May 11, 2018 at 06:57:55PM -0500, Jeremy Linton wrote:
>> Jeremy Linton (12):
>> drivers: base: cacheinfo: move cache_setup_of_node()
>> drivers: base: cacheinfo: setup DT cache properties early
>> cacheinfo: rename of_node to fw_token
>> arm64/acpi: Create arch specific cpu to acpi id helper
>> ACPI/PPTT: Add Processor Properties Topology Table parsing
>> ACPI: Enable PPTT support on ARM64
>> drivers: base cacheinfo: Add support for ACPI based firmware tables
>> arm64: Add support for ACPI based firmware tables
>> arm64: topology: rename cluster_id
>> arm64: topology: enable ACPI/PPTT based CPU topology
>> ACPI: Add PPTT to injectable table list
>> arm64: topology: divorce MC scheduling domain from core_siblings
>
> Queued for 4.18 (without Sudeep's latest property_read_u64 cacheinfo
> patch - http://lkml.kernel.org/r/20180517154701.GA20281@e107155-lin; I
> can add it separately).
This is now commit 37c3ec2d810f87ea ("arm64: topology: divorce MC
scheduling domain from core_siblings") in arm64/for-next/core, causing
system suspend on big.LITTLE systems to hang after shutting down the first
CPU:
$ echo mem > /sys/power/state
PM: suspend entry (deep)
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
OOM killer disabled.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Disabling non-boot CPUs ...
CPU1: shutdown
psci: CPU1 killed.
For me, it fails on the following big.LITTLE systems:
R-Car H3 ES2.0 (4xCA57 + 4xCA53)
R-Car M3-W (2xCA57 + 4xCA53)
System supend still works fine on systems with big cores only:
R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware))
R-Car M3-N (2xCA57)
Reverting this commit fixes the issue for me.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [PATCH v2 39/40] iommu/arm-smmu-v3: Add support for PRI
From: Jean-Philippe Brucker @ 2018-05-29 10:27 UTC (permalink / raw)
To: Bharat Kumar Gogada,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Will Deacon,
okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
Ravikiran Gummaluri,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org
In-Reply-To: <BLUPR0201MB150513BBAA161355DE9B3A48A5690-hRBPhS1iNj/g9tdZWAsUFxrHTHEw16jenBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
On 25/05/18 15:08, Bharat Kumar Gogada wrote:
>> + master->can_fault = true;
>> + master->ste.prg_resp_needs_ssid =
>> pci_prg_resp_requires_prefix(pdev);
>
> Any reason why this is not cleared in arm_smmu_disable_pri ?
Actually, setting it here is wrong. Since we now call enable_pri()
lazily, prg_resp_needs_ssid isn't initialized when writing the STE. That
bit is read by the SMMU when the PRIQ is full and it needs to
auto-respond. Fortunately the PRI doesn't need to be enabled in order to
read this bit, so we can move pci_prg_resp_requires_prefix() to
add_device() and clear the bit in remove_device(). Thanks for catching this.
Jean
^ permalink raw reply
* Re: [PATCH v2 07/40] iommu: Add a page fault handler
From: Jean-Philippe Brucker @ 2018-05-29 10:00 UTC (permalink / raw)
To: Jacob Pan
Cc: ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
Will Deacon,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <20180525173544.05638510@jacob-builder>
On 26/05/18 01:35, Jacob Pan wrote:
>>>> Not exactly, it's the IOMMU driver that makes sure all LPIG in its
>>>> queues are submitted by the above flush call. In more details the
>>>> flow is:
>>>>
>>>> * Either device driver calls unbind()/sva_device_shutdown(), or the
>>>> process exits.
>>>> * If the device driver called, then it already told the device to
>>>> stop using the PASID. Otherwise we use the mm_exit() callback to
>>>> tell the device driver to stop using the PASID.
> Sorry I still need more clarification. For the PASID termination
> initiated by vfio unbind, I don't see device driver given a chance to
> stop PASID. Seems just call __iommu_sva_unbind_device() which already
> assume device stopped issuing DMA with the PASID.
> So it is the vfio unbind caller responsible for doing driver callback
> to stop DMA on a given PASID?
Yes, but I don't know how to implement this. Since PCI doesn't formalize
the PASID stop mechanism and the device doesn't have a kernel driver,
VFIO would need help from the userspace driver for stopping PASID
(notify the userspace driver when an other process exits).
>>>> * In either case, when receiving a stop request from the driver,
>>>> the device sends the LPIGs to the IOMMU queue.
>>>> * Then, the flush call above ensures that the IOMMU reports the
>>>> LPIG with iommu_report_device_fault.
>>>> * While submitting all LPIGs for this PASID to the work queue,
>>>> ipof_queue_fault also picked up all partial faults, so the partial
>>>> list is clean.
>>>>
>>>> Maybe I should improve this comment?
>>>>
>>> thanks for explaining. LPIG submission is done by device
>>> asynchronously w.r.t. driver stopping/decommission PASID.
>>
>> Hmm, it should really be synchronous, otherwise there is no way to
>> know when the PASID can be decommissioned. We need a guarantee such
>> as the one in 6.20.1 of the PCIe spec, "Managing PASID TLP Prefix
>> Usage":
>>
>> "When the stop request mechanism indicates completion, the Function
>> has:
>> * Completed all Non-Posted Requests associated with this PASID.
>> * Flushed to the host all Posted Requests addressing host memory in
>> all TCs that were used by the PASID."
>>
>> That's in combination with "The function shall [...] finish
>> transmitting any multi-page Page Request Messages for this PASID
>> (i.e. send the Page Request Message with the L bit Set)." from the
>> ATS spec.
>>
> I am not contesting on the device side, what I meant was from the
> host IOMMU driver perspective, LPIG is received via IOMMU host queue,
> therefore asynchronous. Not sure about ARM, but on VT-d LPIG submission
> could meet queue full condition. So per VT-d spec, iommu will generate a
> successful auto response to the device. At this point, assume we
> already stopped the given PASID on the device, there might not be
> another LPIG sent for the device. Therefore, you could have a partial
> list. I think we can just drop the requests in the partial list for
> that PASID until the PASID gets re-allocated.
Indeed, I'll add this in next version. For a complete solution to the
queue-full condition (which seems to behave the same way on ARM) I was
thinking the IOMMU driver should also have a method for removing all
partial faults when detecting a queue overflow. Since it doesn't know
which PRGs did receive an auto-response, all it can do is remove all
partial faults, for all devices using this queue. But freeing the stuck
partial faults in flush() and remove_device() should be good enough
Thanks,
Jean
^ permalink raw reply
* Re: [RFC] acpi: Declare memory allocations in acpi_ut_create_internal_object_dbg as non-leaks
From: Rafael J. Wysocki @ 2018-05-29 9:03 UTC (permalink / raw)
To: Larry Finger; +Cc: linux-kernel, linux-acpi, Erik Schmauss
In-Reply-To: <20180512145938.7936-1-Larry.Finger@lwfinger.net>
On Saturday, May 12, 2018 4:59:38 PM CEST Larry Finger wrote:
> In kernel 4.17.0-rcX, kmemleak reports 9 leaks with tracebacks similat to
> the following:
>
> unreferenced object 0xffff880224a077e0 (size 72):
> comm "swapper/0", pid 1, jiffies 4294892358 (age 1022.636s)
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 0e 01 01 00 00 00 00 01 ................
> 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00 00 ................
> backtrace:
> [<000000004f506615>] acpi_ut_create_internal_object_dbg+0x4d/0x10e
> [<000000006e7730e3>] acpi_ds_build_internal_object+0xed/0x1cd
> [<00000000272b7c73>] acpi_ds_build_internal_package_obj+0x245/0x3a2
> [<000000000b64c50e>] acpi_ds_eval_data_object_operands+0x17b/0x21b
> [<00000000589647ac>] acpi_ds_exec_end_op+0x433/0x6c1
> [<000000001d69bcbf>] acpi_ps_parse_loop+0x926/0x9be
> [<000000005d6fa97d>] acpi_ps_parse_aml+0x1a2/0x4af
> [<00000000c4bef823>] acpi_ps_execute_table+0xbb/0x119
> [<00000000fd9632e4>] acpi_ns_execute_table+0x20c/0x260
> [<00000000e6ae17ac>] acpi_ns_parse_table+0x7d/0x1b3
> [<0000000008e1e148>] acpi_ns_load_table+0x8d/0x1c0
> [<000000009fc8346f>] acpi_tb_load_namespace+0x176/0x278
> [<0000000073f98b3b>] acpi_load_tables+0x6e/0xfd
> [<00000000d2ef13d2>] acpi_init+0x8c/0x340
> [<000000007da19d8d>] do_one_initcall+0x46/0x1fa
> [<0000000024681a1d>] kernel_init_freeable+0x1a2/0x237
>
> According to gdb, the offending code is
> object =
> acpi_ut_allocate_object_desc_dbg(module_name, line_number,
> component_id);
>
> As it is not possible to unload the acpi code to test that this is a real
> leak and not a false positive, and that only these 9 appear no matter how
> long the system is up, a kmemleak_not_leak(object) call is inserted.
>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: linux-acpi@vger.kernel.org
> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
> ---
> drivers/acpi/acpica/utobject.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/acpi/acpica/utobject.c b/drivers/acpi/acpica/utobject.c
> index 5b78fe08d7d7..ae6d8cc18cec 100644
> --- a/drivers/acpi/acpica/utobject.c
> +++ b/drivers/acpi/acpica/utobject.c
> @@ -8,6 +8,7 @@
> *****************************************************************************/
>
> #include <acpi/acpi.h>
> +#include <linux/kmemleak.h>
> #include "accommon.h"
> #include "acnamesp.h"
>
> @@ -70,6 +71,7 @@ union acpi_operand_object *acpi_ut_create_internal_object_dbg(const char
> if (!object) {
> return_PTR(NULL);
> }
> + kmemleak_not_leak(object);
>
> switch (type) {
> case ACPI_TYPE_REGION:
>
I've decided to apply this patch, thanks!
^ permalink raw reply
* [RESEND RFC PATCH] acpi/processor: Remove the check of duplicate processors
From: Dou Liyang @ 2018-05-28 9:08 UTC (permalink / raw)
To: linux-kernel, x86, linux-acpi; +Cc: tglx, mingo, rafael, lenb, hpa, Dou Liyang
We found there are some processors which have the same processor ID
but in different PXM in the ACPI namespace. such as this:
proc_id | pxm
--------------------
0 <-> 0
1 <-> 0
2 <-> 1
3 <-> 1
......
89 <-> 0
89 <-> 1
89 <-> 2
89 <-> 3
......
So we create a mechanism to validate them. make the processor(ID=89)
as invalid. And once a processor be hotplugged physically, we check its
processor id.
Commit 8e089eaa1999 ("acpi: Provide mechanism to validate processors in the ACPI tables")
Commit a77d6cd96849 ("acpi/processor: Check for duplicate processor ids at hotplug time")
Recently, I found the check mechanism has a bug, it didn't use the
acpi_processor_ids_walk() right and always gave us a wrong result.
First, I fixed it by modifying the value with AE_OK which is the
standard acpi_status value.
https://lkml.org/lkml/2018/3/20/273
But, now, I even think this check is useless. my reasons are following:
1). Based on the practical tests, It works well, and no bug be reported
2). Based on the code, the duplicate cases can be dealed with by
if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id))
That seems more reasonable, let's see the following case:
Before the patch, After the patch
the first processor(ID=89)
hot-add failed success
the others processor(ID=89)
hot-add failed failed
So, Remove the check code.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
---
drivers/acpi/acpi_processor.c | 126 ------------------------------------------
include/linux/acpi.h | 3 -
2 files changed, 129 deletions(-)
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 449d86d39965..8358708e0bbb 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -281,13 +281,6 @@ static int acpi_processor_get_info(struct acpi_device *device)
pr->acpi_id = value;
}
- if (acpi_duplicate_processor_id(pr->acpi_id)) {
- dev_err(&device->dev,
- "Failed to get unique processor _UID (0x%x)\n",
- pr->acpi_id);
- return -ENODEV;
- }
-
pr->phys_id = acpi_get_phys_id(pr->handle, device_declaration,
pr->acpi_id);
if (invalid_phys_cpuid(pr->phys_id))
@@ -579,127 +572,8 @@ static struct acpi_scan_handler processor_container_handler = {
.attach = acpi_processor_container_attach,
};
-/* The number of the unique processor IDs */
-static int nr_unique_ids __initdata;
-
-/* The number of the duplicate processor IDs */
-static int nr_duplicate_ids;
-
-/* Used to store the unique processor IDs */
-static int unique_processor_ids[] __initdata = {
- [0 ... NR_CPUS - 1] = -1,
-};
-
-/* Used to store the duplicate processor IDs */
-static int duplicate_processor_ids[] = {
- [0 ... NR_CPUS - 1] = -1,
-};
-
-static void __init processor_validated_ids_update(int proc_id)
-{
- int i;
-
- if (nr_unique_ids == NR_CPUS||nr_duplicate_ids == NR_CPUS)
- return;
-
- /*
- * Firstly, compare the proc_id with duplicate IDs, if the proc_id is
- * already in the IDs, do nothing.
- */
- for (i = 0; i < nr_duplicate_ids; i++) {
- if (duplicate_processor_ids[i] == proc_id)
- return;
- }
-
- /*
- * Secondly, compare the proc_id with unique IDs, if the proc_id is in
- * the IDs, put it in the duplicate IDs.
- */
- for (i = 0; i < nr_unique_ids; i++) {
- if (unique_processor_ids[i] == proc_id) {
- duplicate_processor_ids[nr_duplicate_ids] = proc_id;
- nr_duplicate_ids++;
- return;
- }
- }
-
- /*
- * Lastly, the proc_id is a unique ID, put it in the unique IDs.
- */
- unique_processor_ids[nr_unique_ids] = proc_id;
- nr_unique_ids++;
-}
-
-static acpi_status __init acpi_processor_ids_walk(acpi_handle handle,
- u32 lvl,
- void *context,
- void **rv)
-{
- acpi_status status;
- acpi_object_type acpi_type;
- unsigned long long uid;
- union acpi_object object = { 0 };
- struct acpi_buffer buffer = { sizeof(union acpi_object), &object };
-
- status = acpi_get_type(handle, &acpi_type);
- if (ACPI_FAILURE(status))
- return false;
-
- switch (acpi_type) {
- case ACPI_TYPE_PROCESSOR:
- status = acpi_evaluate_object(handle, NULL, NULL, &buffer);
- if (ACPI_FAILURE(status))
- goto err;
- uid = object.processor.proc_id;
- break;
-
- case ACPI_TYPE_DEVICE:
- status = acpi_evaluate_integer(handle, "_UID", NULL, &uid);
- if (ACPI_FAILURE(status))
- goto err;
- break;
- default:
- goto err;
- }
-
- processor_validated_ids_update(uid);
- return true;
-
-err:
- acpi_handle_info(handle, "Invalid processor object\n");
- return false;
-
-}
-
-static void __init acpi_processor_check_duplicates(void)
-{
- /* check the correctness for all processors in ACPI namespace */
- acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
- ACPI_UINT32_MAX,
- acpi_processor_ids_walk,
- NULL, NULL, NULL);
- acpi_get_devices(ACPI_PROCESSOR_DEVICE_HID, acpi_processor_ids_walk,
- NULL, NULL);
-}
-
-bool acpi_duplicate_processor_id(int proc_id)
-{
- int i;
-
- /*
- * compare the proc_id with duplicate IDs, if the proc_id is already
- * in the duplicate IDs, return true, otherwise, return false.
- */
- for (i = 0; i < nr_duplicate_ids; i++) {
- if (duplicate_processor_ids[i] == proc_id)
- return true;
- }
- return false;
-}
-
void __init acpi_processor_init(void)
{
- acpi_processor_check_duplicates();
acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
acpi_scan_add_handler(&processor_container_handler);
}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 15bfb15c2fa5..068dcfe6768b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -285,9 +285,6 @@ static inline bool invalid_phys_cpuid(phys_cpuid_t phys_id)
return phys_id == PHYS_CPUID_INVALID;
}
-/* Validate the processor object's proc_id */
-bool acpi_duplicate_processor_id(int proc_id);
-
#ifdef CONFIG_ACPI_HOTPLUG_CPU
/* Arch dependent functions for cpu hotplug support */
int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, u32 acpi_id,
--
2.14.3
^ permalink raw reply related
* Re: [PATCH 0/1] Remove the check of duplicate processors
From: Dou Liyang @ 2018-05-28 8:51 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
ACPI Devel Maling List, Thomas Gleixner, Ingo Molnar,
Rafael J. Wysocki, Len Brown, H. Peter Anvin, Peter Zijlstra
In-Reply-To: <CAJZ5v0hpEf3gxfAtfufGQVyGEWoszMfEciNKiYf0TrUWNXtiuQ@mail.gmail.com>
Hi Rafael,
At 05/28/2018 04:40 PM, Rafael J. Wysocki wrote:
>
> Can you please resend the patch with the above information in the
> changelog of it?
Yes, I resend the patch right now.
>
> That would make it easier to review and discuss it.
>
> Also I think that we need some sort of a check against duplicate IDs.
>
Thanks,
dou
>
>
^ permalink raw reply
* Re: [PATCH 0/1] Remove the check of duplicate processors
From: Rafael J. Wysocki @ 2018-05-28 8:40 UTC (permalink / raw)
To: Dou Liyang
Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
ACPI Devel Maling List, Thomas Gleixner, Ingo Molnar,
Rafael J. Wysocki, Len Brown, H. Peter Anvin, Peter Zijlstra
In-Reply-To: <20180517094658.18707-1-douly.fnst@cn.fujitsu.com>
On Thu, May 17, 2018 at 11:46 AM, Dou Liyang <douly.fnst@cn.fujitsu.com> wrote:
> There is a bug in CPU hotplug code, I have two simple fix method, but I can't
> ensure which one is better. So sent it out, seek help.
>
> We found some processors which have the same processor ID but in different
> PXM in ACPI namespace. such as this:
>
> proc_id | pxm
> --------------------
> 0 <-> 0
> 1 <-> 0
> 2 <-> 1
> 3 <-> 1
> ......
> 89 <-> 0
> 89 <-> 1
> 89 <-> 2
> 89 <-> 3
> ......
>
> So we create a mechanism to validate them. make the processor(ID=89)
> as invalid. And once a processor be hotplugged physically, we check its
> processor id.
>
> Commit 8e089eaa1999 ("acpi: Provide mechanism to validate processors in the ACPI tables")
> Commit a77d6cd96849 ("acpi/processor: Check for duplicate processor ids at hotplug time")
>
> Recently, I found the check mechanism has a bug, it didn't use the
> acpi_processor_ids_walk() right and always gave us a wrong result.
> First, I fixed it by modifying the value with AE_OK which is the
> standard acpi_status value.
>
> https://lkml.org/lkml/2018/3/20/273
>
> But, now, I even think this check is useless. my reasons are following:
>
> 1). Based on the practical effect, It works well, and no bug be reported
> 2). Based on the code, the duplicate cases can be dealed with by
>
> if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id))
>
> That seems more reasonable, let's see the following case:
>
> Before the patch, After the patch
>
> the first processor(ID=89)
> hot-add failed success
>
> the others processor(ID=89)
> hot-add failed failed
>
>
> So, What's your idea about it.
>
> Dou Liyang (1):
> acpi/processor: Remove the check of duplicates processor ids
Can you please resend the patch with the above information in the
changelog of it?
That would make it easier to review and discuss it.
Also I think that we need some sort of a check against duplicate IDs.
^ permalink raw reply
* Re: [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy
From: Rafael J. Wysocki @ 2018-05-27 9:36 UTC (permalink / raw)
To: Alexandru Gagniuc
Cc: ACPI Devel Maling List, alex_gagniuc, austin_bolen, shyam_iyer,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, the arch/x86 maintainers, Rafael J. Wysocki,
Len Brown, Mauro Carvalho Chehab, Robert Moore, Erik Schmauss,
Tyler Baicar, Will Deacon, James Morse, Jonathan (Zhixiong) Zhang,
Dongjiu Geng, EDAC-CO
In-Reply-To: <20180525155352.22350-1-mr.nuke.me@gmail.com>
On Fri, May 25, 2018 at 5:53 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
> FFS (firmware-first) handling through APEI seems to have developed a
> policy to panic() on any fatal errors. This policy is completely
> independent of the non-FFS case. It is also inconsistent with how the
> native error handlers, a number of which will recover the system from
> fatal errors.
>
> The purpose of this series is to obsolete this idiotic policy, with
> the motivation to enable identical handling of PCIe errors to native
> reporting.
>
>
> Rafael, this is copypaste from the previous patch series. I suspect
> you might have missed it last time, because you asked questions which
> were answered here. I've included it so you don't have to go digging
> old emails:
I didn't miss it, but I didn't like your answers.
Anyway, as a rule, no GHES/APEI patches are applied without an ACK
from either Boris or Tony, so you need to talk to them about the
patches.
Thanks,
Rafael
^ permalink raw reply
* Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
From: Xu Zaibo @ 2018-05-26 3:53 UTC (permalink / raw)
To: Jean-Philippe Brucker,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
Will Deacon, okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
liguozhu-C8/M+/jPZTeaMJb+Lgu22Q@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org
In-Reply-To: <fea420ff-e738-e2ed-ab1e-a3f4dde4b6ef-5wv7dgnIgG8@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 5992 bytes --]
Hi,
On 2018/5/25 17:47, Jean-Philippe Brucker wrote:
> On 25/05/18 03:39, Xu Zaibo wrote:
>> Hi,
>>
>> On 2018/5/24 23:04, Jean-Philippe Brucker wrote:
>>> On 24/05/18 13:35, Xu Zaibo wrote:
>>>>> Right, sva_init() must be called once for any device that intends to use
>>>>> bind(). For the second process though, group->sva_enabled will be true
>>>>> so we won't call sva_init() again, only bind().
>>>> Well, while I create mediated devices based on one parent device to support multiple
>>>> processes(a new process will create a new 'vfio_group' for the corresponding mediated device,
>>>> and 'sva_enabled' cannot work any more), in fact, *sva_init and *sva_shutdown are basically
>>>> working on parent device, so, as a result, I just only need sva initiation and shutdown on the
>>>> parent device only once. So I change the two as following:
>>>>
>>>> @@ -551,8 +565,18 @@ int iommu_sva_device_init(struct device *dev, unsigned long features,
>>>> if (features & ~IOMMU_SVA_FEAT_IOPF)
>>>> return -EINVAL;
>>>>
>>>> + /* If already exists, do nothing */
>>>> + mutex_lock(&dev->iommu_param->lock);
>>>> + if (dev->iommu_param->sva_param) {
>>>> + mutex_unlock(&dev->iommu_param->lock);
>>>> + return 0;
>>>> + }
>>>> + mutex_unlock(&dev->iommu_param->lock);
>>>>
>>>> if (features & IOMMU_SVA_FEAT_IOPF) {
>>>> ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
>>>>
>>>>
>>>> @@ -621,6 +646,14 @@ int iommu_sva_device_shutdown(struct device *dev)
>>>> if (!domain)
>>>> return -ENODEV;
>>>>
>>>> + /* If any other process is working on the device, shut down does nothing. */
>>>> + mutex_lock(&dev->iommu_param->lock);
>>>> + if (!list_empty(&dev->iommu_param->sva_param->mm_list)) {
>>>> + mutex_unlock(&dev->iommu_param->lock);
>>>> + return 0;
>>>> + }
>>>> + mutex_unlock(&dev->iommu_param->lock);
>>> I don't think iommu-sva.c is the best place for this, it's probably
>>> better to implement an intermediate layer (the mediating driver), that
>>> calls iommu_sva_device_init() and iommu_sva_device_shutdown() once. Then
>>> vfio-pci would still call these functions itself, but for mdev the
>>> mediating driver keeps a refcount of groups, and calls device_shutdown()
>>> only when freeing the last mdev.
>>>
>>> A device driver (non mdev in this example) expects to be able to free
>>> all its resources after sva_device_shutdown() returns. Imagine the
>>> mm_list isn't empty (mm_exit() is running late), and instead of waiting
>>> in unbind_dev_all() below, we return 0 immediately. Then the calling
>>> driver frees its resources, and the mm_exit callback along with private
>>> data passed to bind() disappear. If a mm_exit() is still running in
>>> parallel, then it will try to access freed data and corrupt memory. So
>>> in this function if mm_list isn't empty, the only thing we can do is wait.
>>>
>> I still don't understand why we should 'unbind_dev_all', is it possible
>> to do a 'unbind_dev_pasid'?
> Not in sva_device_shutdown(), it needs to clean up everything. For
> example you want to physically unplug the device, or assign it to a VM.
> To prevent any leak sva_device_shutdown() needs to remove all bonds. In
> theory there shouldn't be any, since either the driver did unbind_dev(),
> or all process exited. This is a safety net.
>
>> Then we can do other things instead of waiting that user may not like. :)
> They may not like it, but it's for their own good :) At the moment we're
> waiting that:
>
> * All exit_mm() callback for this device have finished. If we don't wait
> then the caller will free the private data passed to bind and the
> mm_exit() callback while they are still being used.
>
> * All page requests targeting this device are dealt with. If we don't
> wait then some requests, that are lingering in the IOMMU PRI queue,
> may hit the next contexts bound to this device, possibly in a
> different VM. It may not be too risky (though probably exploitable in
> some way), but is incredibly messy.
>
> All of this is bounded in time, and normally should be over pretty fast
> unless the device driver's exit_mm() does something strange. If the
> driver did the right thing, there shouldn't be any wait here (although
> there may be one in unbind_dev() for the same reasons - prevent use
> after free).
>
I guess there may be some misunderstandings :).
From the current patches, '/iommu_sva_device_shutdown/' is called by
'/vfio_iommu_sva_shutdown/', which
is mainly used by '/vfio_iommu_type1_detach_group/' that is finally
called by processes' release of vfio facilitiy
automatically or called by 'VFIO_GROUP_UNSET_CONTAINER' manually in the
processes.
So, just image that 2 processes is working on the device with IOPF
feature, and the 2 do the following to enable SVA:
/ 1.open("/dev/vfio/vfio") ;//
//
// 2.open the group of the devcie by calling open("/dev/vfio/x"), but
now, //
// I think the second processes will fail to open the group because
current VFIO thinks that one group only can be used by one process/vm,
at present, I use mediated device to create more groups on the
parent device to support multiple processes;//
/ //
/ 3.VFIO_GROUP_SET_CONTAINER;/
/ 4.VFIO_SET_IOMMU;
/
/// 5.VFIO_IOMMU_BIND;
6.Do some works with the hardware working unit filled by PASID on the
device;
7.//VFIO_IOMMU_UNBIND;
//***8.VFIO_GROUP_UNSET_CONTAINER;---here, have to sleep to wait another
process to finish works of the step 6;
* 9. close(group); close(containner);
/So, my idea is: If it is possible to just release the params or
facilities that only belong to the current process while the process
shutdown the device,
and while the last process releases them all. Then, as in the above step
8, we
don't need to wait, or maybe wait for a very long time if the other
process is doing lots of work on the device.
Thanks
Zaibo
>
[-- Attachment #1.2: Type: text/html, Size: 46082 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply
* Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
From: Kenneth Lee @ 2018-05-26 2:24 UTC (permalink / raw)
To: Jonathan Cameron
Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Will Deacon,
okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
liguozhu-C8/M+/jPZTeaMJb+Lgu22Q,
ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
Ilias Apalodimas, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org
In-Reply-To: <20180525093959.000040a7-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
On Fri, May 25, 2018 at 09:39:59AM +0100, Jonathan Cameron wrote:
> Date: Fri, 25 May 2018 09:39:59 +0100
> From: Jonathan Cameron <Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> To: Ilias Apalodimas <ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> CC: Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>,
> "xieyisheng1-hv44wF8Li93QT0dZR+AlfA@public.gmane.org" <xieyisheng1-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>, "kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
> <kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
> <linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org" <xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
> Will Deacon <Will.Deacon-5wv7dgnIgG8@public.gmane.org>, "okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org"
> <okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>, "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
> "yi.l.liu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org" <yi.l.liu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>, "ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org"
> <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>, "tn-nYOzD4b6Jr9Wk0Htik3J/w@public.gmane.org" <tn-nYOzD4b6Jr9Wk0Htik3J/w@public.gmane.org>,
> "joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org" <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>, "robdclark-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org"
> <robdclark-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, "bharatku-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org" <bharatku-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>,
> "linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" <linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
> "liudongdong3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org" <liudongdong3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>, "rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org"
> <rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>, "devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
> <devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "kevin.tian-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org"
> <kevin.tian-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>, Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
> "alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org" <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
> "rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org" <rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>, "thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org"
> <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>, "linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org"
> <linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org>, "shunyong.yang-PT9Dzx9SjPiXmMXjJBpWqg@public.gmane.org"
> <shunyong.yang-PT9Dzx9SjPiXmMXjJBpWqg@public.gmane.org>, "dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org"
> <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>, "liubo95-hv44wF8Li93QT0dZR+AlfA@public.gmane.org" <liubo95-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
> "jcrouse-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org" <jcrouse-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>,
> "iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org" <iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
> Robin Murphy <Robin.Murphy-5wv7dgnIgG8@public.gmane.org>, "christian.koenig-5C7GfCeVMHo@public.gmane.org"
> <christian.koenig-5C7GfCeVMHo@public.gmane.org>, "nwatters-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org"
> <nwatters-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>, "baolu.lu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org"
> <baolu.lu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>, liguozhu-C8/M+/jPZTeaMJb+Lgu22Q@public.gmane.org,
> kenneth-lee-2012-H32Fclmsjq1BDgjK7y7TUQ@public.gmane.org
> Subject: Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
> Message-ID: <20180525093959.000040a7-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>
> +CC Kenneth Lee
>
> On Fri, 25 May 2018 09:33:11 +0300
> Ilias Apalodimas <ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>
> > On Thu, May 24, 2018 at 04:04:39PM +0100, Jean-Philippe Brucker wrote:
> > > On 24/05/18 12:50, Ilias Apalodimas wrote:
> > > >> Interesting, I hadn't thought about this use-case before. At first I
> > > >> thought you were talking about mdev devices assigned to VMs, but I think
> > > >> you're referring to mdevs assigned to userspace drivers instead? Out of
> > > >> curiosity, is it only theoretical or does someone actually need this?
> > > >
> > > > There has been some non upstreamed efforts to have mdev and produce userspace
> > > > drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
> > > > we did a proof of concept for ethernet interfaces. At the time we choose not to
> > > > involve the IOMMU for the reason you mentioned, but having it there would be
> > > > good.
> > >
> > > I'm guessing there were good reasons to do it that way but I wonder, is
> > > it not simpler to just have the kernel driver create a /dev/foo, with a
> > > standard ioctl/mmap/poll interface? Here VFIO adds a layer of
> > > indirection, and since the mediating driver has to implement these
> > > operations already, what is gained?
> > The best reason i can come up with is "common code". You already have one API
> > doing that for you so we replicate it in a /dev file?
> > The mdev approach still needs extentions to support what we tried to do (i.e
> > mdev bus might need yo have access on iommu_ops), but as far as i undestand it's
> > a possible case.
Hi, Jean, Please allow me to share my understanding here:
https://zhuanlan.zhihu.com/p/35489035
The reason we do not use the /dev/foo scheme is that the devices to be
shared are programmable accelerators. We cannot fix up the kernel driver for them.
> > >
> > > Thanks,
> > > Jean
>
>
--
-Kenneth Lee (Hisilicon)
^ permalink raw reply
* Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
From: Kenneth Lee @ 2018-05-26 2:24 UTC (permalink / raw)
To: Jonathan Cameron
Cc: xieyisheng1@huawei.com, liubo95@huawei.com, kvm@vger.kernel.org,
linux-pci@vger.kernel.org, xuzaibo@huawei.com, Will Deacon,
okaya@codeaurora.org, linux-mm@kvack.org, liguozhu,
yi.l.liu@intel.com, ashok.raj@intel.com, Jean-Philippe Brucker,
tn@semihalf.com, joro@8bytes.org,
iommu@lists.linux-foundation.org, bharatku@xilinx.com,
linux-acpi@vger.kernel.org, liudongdong3@huawei.com,
rfranz@cavium.com, devic
In-Reply-To: <20180525093959.000040a7@huawei.com>
On Fri, May 25, 2018 at 09:39:59AM +0100, Jonathan Cameron wrote:
> Date: Fri, 25 May 2018 09:39:59 +0100
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> To: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> CC: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
> "xieyisheng1@huawei.com" <xieyisheng1@huawei.com>, "kvm@vger.kernel.org"
> <kvm@vger.kernel.org>, "linux-pci@vger.kernel.org"
> <linux-pci@vger.kernel.org>, "xuzaibo@huawei.com" <xuzaibo@huawei.com>,
> Will Deacon <Will.Deacon@arm.com>, "okaya@codeaurora.org"
> <okaya@codeaurora.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>,
> "yi.l.liu@intel.com" <yi.l.liu@intel.com>, "ashok.raj@intel.com"
> <ashok.raj@intel.com>, "tn@semihalf.com" <tn@semihalf.com>,
> "joro@8bytes.org" <joro@8bytes.org>, "robdclark@gmail.com"
> <robdclark@gmail.com>, "bharatku@xilinx.com" <bharatku@xilinx.com>,
> "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
> "liudongdong3@huawei.com" <liudongdong3@huawei.com>, "rfranz@cavium.com"
> <rfranz@cavium.com>, "devicetree@vger.kernel.org"
> <devicetree@vger.kernel.org>, "kevin.tian@intel.com"
> <kevin.tian@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>,
> "alex.williamson@redhat.com" <alex.williamson@redhat.com>,
> "rgummal@xilinx.com" <rgummal@xilinx.com>, "thunder.leizhen@huawei.com"
> <thunder.leizhen@huawei.com>, "linux-arm-kernel@lists.infradead.org"
> <linux-arm-kernel@lists.infradead.org>, "shunyong.yang@hxt-semitech.com"
> <shunyong.yang@hxt-semitech.com>, "dwmw2@infradead.org"
> <dwmw2@infradead.org>, "liubo95@huawei.com" <liubo95@huawei.com>,
> "jcrouse@codeaurora.org" <jcrouse@codeaurora.org>,
> "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,
> Robin Murphy <Robin.Murphy@arm.com>, "christian.koenig@amd.com"
> <christian.koenig@amd.com>, "nwatters@codeaurora.org"
> <nwatters@codeaurora.org>, "baolu.lu@linux.intel.com"
> <baolu.lu@linux.intel.com>, liguozhu@hisilicon.com,
> kenneth-lee-2012@foxmail.com
> Subject: Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
> Message-ID: <20180525093959.000040a7@huawei.com>
>
> +CC Kenneth Lee
>
> On Fri, 25 May 2018 09:33:11 +0300
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
>
> > On Thu, May 24, 2018 at 04:04:39PM +0100, Jean-Philippe Brucker wrote:
> > > On 24/05/18 12:50, Ilias Apalodimas wrote:
> > > >> Interesting, I hadn't thought about this use-case before. At first I
> > > >> thought you were talking about mdev devices assigned to VMs, but I think
> > > >> you're referring to mdevs assigned to userspace drivers instead? Out of
> > > >> curiosity, is it only theoretical or does someone actually need this?
> > > >
> > > > There has been some non upstreamed efforts to have mdev and produce userspace
> > > > drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
> > > > we did a proof of concept for ethernet interfaces. At the time we choose not to
> > > > involve the IOMMU for the reason you mentioned, but having it there would be
> > > > good.
> > >
> > > I'm guessing there were good reasons to do it that way but I wonder, is
> > > it not simpler to just have the kernel driver create a /dev/foo, with a
> > > standard ioctl/mmap/poll interface? Here VFIO adds a layer of
> > > indirection, and since the mediating driver has to implement these
> > > operations already, what is gained?
> > The best reason i can come up with is "common code". You already have one API
> > doing that for you so we replicate it in a /dev file?
> > The mdev approach still needs extentions to support what we tried to do (i.e
> > mdev bus might need yo have access on iommu_ops), but as far as i undestand it's
> > a possible case.
Hi, Jean, Please allow me to share my understanding here:
https://zhuanlan.zhihu.com/p/35489035
The reason we do not use the /dev/foo scheme is that the devices to be
shared are programmable accelerators. We cannot fix up the kernel driver for them.
> > >
> > > Thanks,
> > > Jean
>
>
--
-Kenneth Lee (Hisilicon)
^ permalink raw reply
* Re: [PATCH v2 07/40] iommu: Add a page fault handler
From: Jacob Pan @ 2018-05-26 0:35 UTC (permalink / raw)
To: Jean-Philippe Brucker
Cc: ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
Will Deacon,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
rgummal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <bdf9f221-ab97-2168-d072-b7f6a0dba840-5wv7dgnIgG8@public.gmane.org>
On Thu, 24 May 2018 12:44:38 +0100
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:
> On 23/05/18 00:35, Jacob Pan wrote:
> >>>> + /* Insert *before* the last fault */
> >>>> + list_move(&fault->head, &group->faults);
> >>>> + }
> >>>> +
> >>> If you already sorted the group list with last fault at the end,
> >>> why do you need a separate entry to track the last fault?
> >>
> >> Not sure I understand your question, sorry. Do you mean why the
> >> iopf_group.last_fault? Just to avoid one more kzalloc.
> >>
> > kind of :) what i thought was that why can't the last_fault
> > naturally be the last entry in your group fault list? then there is
> > no need for special treatment in terms of allocation of the last
> > fault. just my preference.
>
> But we need a kzalloc for the last fault anyway, so I thought I'd just
> piggy-back on the group allocation. And if I don't add the last fault
> at the end of group->faults, then it's iopf_handle_group that requires
> special treatment. I'm still not sure I understood your question
> though, could you send me a patch that does it?
>
> >>>> +
> >>>> + queue->flush(queue->flush_arg, dev);
> >>>> +
> >>>> + /*
> >>>> + * No need to clear the partial list. All PRGs
> >>>> containing the PASID that
> >>>> + * needs to be decommissioned are whole (the device
> >>>> driver made sure of
> >>>> + * it before this function was called). They have been
> >>>> submitted to the
> >>>> + * queue by the above flush().
> >>>> + */
> >>> So you are saying device driver need to make sure LPIG PRQ is
> >>> submitted in the flush call above such that partial list is
> >>> cleared?
> >>
> >> Not exactly, it's the IOMMU driver that makes sure all LPIG in its
> >> queues are submitted by the above flush call. In more details the
> >> flow is:
> >>
> >> * Either device driver calls unbind()/sva_device_shutdown(), or the
> >> process exits.
> >> * If the device driver called, then it already told the device to
> >> stop using the PASID. Otherwise we use the mm_exit() callback to
> >> tell the device driver to stop using the PASID.
Sorry I still need more clarification. For the PASID termination
initiated by vfio unbind, I don't see device driver given a chance to
stop PASID. Seems just call __iommu_sva_unbind_device() which already
assume device stopped issuing DMA with the PASID.
So it is the vfio unbind caller responsible for doing driver callback
to stop DMA on a given PASID?
> >> * In either case, when receiving a stop request from the driver,
> >> the device sends the LPIGs to the IOMMU queue.
> >> * Then, the flush call above ensures that the IOMMU reports the
> >> LPIG with iommu_report_device_fault.
> >> * While submitting all LPIGs for this PASID to the work queue,
> >> ipof_queue_fault also picked up all partial faults, so the partial
> >> list is clean.
> >>
> >> Maybe I should improve this comment?
> >>
> > thanks for explaining. LPIG submission is done by device
> > asynchronously w.r.t. driver stopping/decommission PASID.
>
> Hmm, it should really be synchronous, otherwise there is no way to
> know when the PASID can be decommissioned. We need a guarantee such
> as the one in 6.20.1 of the PCIe spec, "Managing PASID TLP Prefix
> Usage":
>
> "When the stop request mechanism indicates completion, the Function
> has:
> * Completed all Non-Posted Requests associated with this PASID.
> * Flushed to the host all Posted Requests addressing host memory in
> all TCs that were used by the PASID."
>
> That's in combination with "The function shall [...] finish
> transmitting any multi-page Page Request Messages for this PASID
> (i.e. send the Page Request Message with the L bit Set)." from the
> ATS spec.
>
I am not contesting on the device side, what I meant was from the
host IOMMU driver perspective, LPIG is received via IOMMU host queue,
therefore asynchronous. Not sure about ARM, but on VT-d LPIG submission
could meet queue full condition. So per VT-d spec, iommu will generate a
successful auto response to the device. At this point, assume we
already stopped the given PASID on the device, there might not be
another LPIG sent for the device. Therefore, you could have a partial
list. I think we can just drop the requests in the partial list for
that PASID until the PASID gets re-allocated.
> (If I remember correctly a PRI Page Request is a Posted Request.) Only
> after this stop request completes can the driver call unbind(), or
> return from exit_mm(). Then we know that if there was a LPIG for that
> PASID, it is in the IOMMU's PRI queue (or already completed) once we
> call flush().
>
agreed.
> > so if we were to use this
> > flow on vt-d, which does not stall page request queue, then we
> > should use the iommu model specific flush() callback to ensure LPIG
> > is received? There could be queue full condition and retry. I am
> > just trying to understand how and where we can make sure LPIG is
> > received and all groups are whole.
>
> For SMMU in patch 30, the flush() callback waits until the PRI queue
> is empty or, when the PRI thread is running in parallel, until the
> thread has done a full circle (handled as many faults as the queue
> size). It's really unpleasant to implement because the flush()
> callback takes a lock to inspect the hardware state, but I don't
> think we have a choice.
>
yes, vt-d has similar situation in page request queue. one option is to
track queue head (SW update) to make sure one complete cycle when queue
tail(HW update) crosses. Or we(suggested by Ashok Raj) can take a
snapshot of the entire queue and process (drops PRQs belong to the
terminated PASID) without holding the queue.
Thanks,
Jacob
^ permalink raw reply
* [PATCH v7 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel
In-Reply-To: <20180525155352.22350-1-mr.nuke.me@gmail.com>
As previously noted, the policy to panic on any "Fatal" GHES error is
not suitable for several classes of errors. The most notable is
error containment. The correct policy is to achieve identical behavior
to native error handling -- i.e. when not reported through GHES. This,
in special cases, may not be possible, as we have to exit NMIs, which
requires these special considerations
PCIe AER errors are contained and reported at the root port. On DPC
capable hardware, containment can be done by all downstream ports. DPC
also has the added advantage of preventing future errors. Since these
errors stop at the root port, we can do all the work we need to exit
NMI and reach the error handler.
This patch does away with the mindless crashing of the system, and
correctly invokes the AER handler. When AER is not enabled, or the
firmware doesn't provide sufficient information to identify the source
of the error, the original panic() behavior is maintained.
Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
drivers/acpi/apei/ghes.c | 43 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 41 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1b22e18168f5..f7126f6d8d52 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -425,7 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_FATAL does not make it to this handler
+ * GHES_SEV_FATAL -> AER_FATAL
*/
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -837,6 +837,45 @@ static inline void ghes_sea_remove(struct ghes *ghes) { }
static struct llist_head ghes_estatus_llist;
static struct irq_work ghes_proc_irq_work;
+/* PCIe AER errors are safe if AER section contains enough info. */
+static int ghes_pcie_has_safe_handler(struct acpi_hest_generic_data *gdata)
+{
+ struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+ if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+ IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
+ return true;
+
+ return false;
+}
+
+/*
+ * Do we have an error handler that we can safely reach? We're concerned with
+ * being able to notify an error handler by crossing the NMI/IRQ boundary,
+ * being able to schedule_work, and so forth.
+ */
+static int ghes_has_fatal_handler(struct ghes *ghes)
+{
+ int worst_sev, sec_sev;
+ bool safe = true;
+ struct acpi_hest_generic_data *gdata;
+ const guid_t *section_type;
+ const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+ apei_estatus_for_each_section(estatus, gdata) {
+ section_type = (guid_t *)gdata->section_type;
+
+ if (guid_equal(section_type, &CPER_SEC_PCIE))
+ safe = ghes_pcie_has_safe_handler(gdata);
+
+ if (!safe)
+ break;
+ }
+
+ return safe;
+}
+
/*
* NMI may be triggered on any CPU, so ghes_in_nmi is used for
* having only one concurrent reader.
@@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
}
sev = ghes_cper_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_FATAL) {
+ if ((sev >= GHES_SEV_FATAL) && !ghes_has_fatal_handler(ghes)) {
oops_begin();
ghes_print_queued_estatus();
__ghes_panic(ghes);
--
2.14.3
^ permalink raw reply related
* [PATCH v7 2/3] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel
In-Reply-To: <20180525155352.22350-1-mr.nuke.me@gmail.com>
ghes_severity() is a misnomer in this case, as it implies the severity
of the entire GHES structure. Instead, it maps one CPER value to one
GHES_SEV* value. ghes_cper_severity() is clearer.
Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
drivers/acpi/apei/ghes.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 26a41bbe222b..1b22e18168f5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -271,7 +271,7 @@ static void ghes_fini(struct ghes *ghes)
unmap_gen_v2(ghes);
}
-static inline int ghes_severity(int severity)
+static inline int ghes_cper_severity(int severity)
{
switch (severity) {
case CPER_SEV_INFORMATIONAL:
@@ -388,7 +388,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
unsigned long pfn;
int flags = -1;
- int sec_sev = ghes_severity(gdata->error_severity);
+ int sec_sev = ghes_cper_severity(gdata->error_severity);
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
@@ -467,10 +467,10 @@ static void ghes_do_proc(struct ghes *ghes,
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";
- sev = ghes_severity(estatus->error_severity);
+ sev = ghes_cper_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_type = (guid_t *)gdata->section_type;
- sec_sev = ghes_severity(gdata->error_severity);
+ sec_sev = ghes_cper_severity(gdata->error_severity);
if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
fru_id = (guid_t *)gdata->fru_id;
@@ -511,7 +511,7 @@ static void __ghes_print_estatus(const char *pfx,
char pfx_seq[64];
if (pfx == NULL) {
- if (ghes_severity(estatus->error_severity) <=
+ if (ghes_cper_severity(estatus->error_severity) <=
GHES_SEV_CORRECTED)
pfx = KERN_WARNING;
else
@@ -533,7 +533,7 @@ static int ghes_print_estatus(const char *pfx,
static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
struct ratelimit_state *ratelimit;
- if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
+ if (ghes_cper_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
ratelimit = &ratelimit_corrected;
else
ratelimit = &ratelimit_uncorrected;
@@ -704,9 +704,8 @@ static int ghes_proc(struct ghes *ghes)
if (rc)
goto out;
- if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_FATAL) {
+ if (ghes_cper_severity(ghes->estatus->error_severity) >= GHES_SEV_FATAL)
__ghes_panic(ghes);
- }
if (!ghes_estatus_cached(ghes->estatus)) {
if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
@@ -944,7 +943,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}
- sev = ghes_severity(ghes->estatus->error_severity);
+ sev = ghes_cper_severity(ghes->estatus->error_severity);
if (sev >= GHES_SEV_FATAL) {
oops_begin();
ghes_print_queued_estatus();
--
2.14.3
^ permalink raw reply related
* [PATCH v7 1/3] acpi: apei: Rename GHES_SEV_PANIC to GHES_SEV_FATAL
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel
In-Reply-To: <20180525155352.22350-1-mr.nuke.me@gmail.com>
'GHES_SEV_PANIC' implies that the kernel must panic. That was true
many years ago when fatal errors could not be handled and recovered.
However, this is no longer the case with PCIe AER and DPC errors. The
latter class of errors are contained at the hardware level.
'GHES_SEV_PANIC' is confusing because it implies a policy to crash the
system on fatal errors. Drop this questionable policy, and rename the
enum to 'GHES_SEV_FATAL' to better convey the meaning.
Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
arch/x86/kernel/cpu/mcheck/mce-apei.c | 2 +-
drivers/acpi/apei/ghes.c | 11 +++++------
drivers/edac/ghes_edac.c | 2 +-
include/acpi/ghes.h | 2 +-
4 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 2eee85379689..cbec89f5cdf0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -53,7 +53,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
- if (severity >= GHES_SEV_PANIC) {
+ if (severity >= GHES_SEV_FATAL) {
m.status |= MCI_STATUS_PCC;
m.tsc = rdtsc();
}
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..26a41bbe222b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -281,10 +281,10 @@ static inline int ghes_severity(int severity)
case CPER_SEV_RECOVERABLE:
return GHES_SEV_RECOVERABLE;
case CPER_SEV_FATAL:
- return GHES_SEV_PANIC;
+ return GHES_SEV_FATAL;
default:
/* Unknown, go panic */
- return GHES_SEV_PANIC;
+ return GHES_SEV_FATAL;
}
}
@@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- * panic.
+ * GHES_SEV_FATAL does not make it to this handler
*/
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -705,7 +704,7 @@ static int ghes_proc(struct ghes *ghes)
if (rc)
goto out;
- if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+ if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_FATAL) {
__ghes_panic(ghes);
}
@@ -946,7 +945,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
}
sev = ghes_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_PANIC) {
+ if (sev >= GHES_SEV_FATAL) {
oops_begin();
ghes_print_queued_estatus();
__ghes_panic(ghes);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 68b6ee18bea6..8455758327d4 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -220,7 +220,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
case GHES_SEV_RECOVERABLE:
type = HW_EVENT_ERR_UNCORRECTED;
break;
- case GHES_SEV_PANIC:
+ case GHES_SEV_FATAL:
type = HW_EVENT_ERR_FATAL;
break;
default:
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8feb0c866ee0..322f7ede24bd 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -49,7 +49,7 @@ enum {
GHES_SEV_NO = 0x0,
GHES_SEV_CORRECTED = 0x1,
GHES_SEV_RECOVERABLE = 0x2,
- GHES_SEV_PANIC = 0x3,
+ GHES_SEV_FATAL = 0x3,
};
/* From drivers/edac/ghes_edac.c */
--
2.14.3
^ permalink raw reply related
* [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel
FFS (firmware-first) handling through APEI seems to have developed a
policy to panic() on any fatal errors. This policy is completely
independent of the non-FFS case. It is also inconsistent with how the
native error handlers, a number of which will recover the system from
fatal errors.
The purpose of this series is to obsolete this idiotic policy, with
the motivation to enable identical handling of PCIe errors to native
reporting.
Rafael, this is copypaste from the previous patch series. I suspect
you might have missed it last time, because you asked questions which
were answered here. I've included it so you don't have to go digging
old emails:
"
The purpose of these changes is to see if we can safely de-escalate
the situation and notify the appropriate error handler. Since FFS
reports errors through NMIs or other non-standard mechanism, we have
to be just a little more careful with reporting the error.
We're concerned with things, such as being able to cross the NMI/IRQ
boundary, or being able to safely schedule work and notify the
appropriate subsystem. Once the notification is sent, our job is done.
I'm explicitly _NOT_ concerned with whether the error is handled or
not, especially since such concern reduces to a call to __ghes_panic().
There are rare cases that prevent us from de-escalating to lesser
contexts, such as uncorrectable memory errors in kernel. In these sort
of cases, trying to leave the NMI might cause a triple fault. James
Morse explained this very well when discussing v1 of this series. In
and only in such cases, we are justified to panic().
Once the error is safely sent its merry way, it's really up to the
error handler to panic() or continue. For example, aer_recover_queue()
might for ungodly reasons fail. However, it's up to the AER code to
decide whether failing to queue an error for handling is panic worthy.
"
Changes since v6:
- Fixed silly compilation warning
- Dropped concept of
Changes since v5:
- Removed zoological references from commit message
Changes since v4:
- Fix Freudian slip and use GHES_ instead of CPER_ enum
- Rephrased comments to clarify what we don't care about
Changes since v3:
- Renamed ghes_severity to something more concrete
- Reorganized code to make it look like more than just a rename
- Remembered to remove last patch in the series
Changes since v2:
- Due to popular request, simple is chosen over flexible
- Removed splitting of handlers into irq safe portion.
- Change behavior only for PCIe errors
Changes since v1:
- Due to popular request, the panic() is left in the NMI handler
- GHES AER handler is split into NMI and non-NMI portions
- ghes_notify_nmi() does not panic on deferrable errors
- The handlers are put in a mapping and given a common call signature
Alexandru Gagniuc (3):
acpi: apei: Rename GHES_SEV_PANIC to GHES_SEV_FATAL
acpi: apei: Rename ghes_severity() to ghes_cper_severity()
acpi: apei: Do not panic() on PCIe errors reported through GHES
arch/x86/kernel/cpu/mcheck/mce-apei.c | 2 +-
drivers/acpi/apei/ghes.c | 65 +++++++++++++++++++++++++++--------
drivers/edac/ghes_edac.c | 2 +-
include/acpi/ghes.h | 2 +-
4 files changed, 54 insertions(+), 17 deletions(-)
--
2.14.3
^ permalink raw reply
* RE: [PATCH v2 39/40] iommu/arm-smmu-v3: Add support for PRI
From: Bharat Kumar Gogada @ 2018-05-25 14:08 UTC (permalink / raw)
To: Jean-Philippe Brucker,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-acpi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: xuzaibo-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
will.deacon-5wv7dgnIgG8@public.gmane.org,
okaya-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
rfranz-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
Ravikiran Gummaluri,
ilias.apalodimas-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
christian.koenig-5C7GfCeVMHo@public.gmane.org
In-Reply-To: <20180511190641.23008-40-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
>
> For PCI devices that support it, enable the PRI capability and handle PRI Page
> Requests with the generic fault handler. It is enabled on demand by
> iommu_sva_device_init().
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
>
> ---
> v1->v2:
> * Terminate the page request and disable PRI if no handler is registered
> * Enable and disable PRI in sva_device_init/shutdown, instead of
> add/remove_device
> ---
> drivers/iommu/arm-smmu-v3.c | 192 +++++++++++++++++++++++++++-------
> --
> 1 file changed, 145 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 6cb69ace371b..0edbb8d19579 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -248,6 +248,7 @@
> #define STRTAB_STE_1_S1COR GENMASK_ULL(5, 4)
> #define STRTAB_STE_1_S1CSH GENMASK_ULL(7, 6)
>
> +#define STRTAB_STE_1_PPAR (1UL << 18)
> #define STRTAB_STE_1_S1STALLD (1UL << 27)
>
> #define STRTAB_STE_1_EATS GENMASK_ULL(29, 28)
> @@ -309,6 +310,9 @@
> #define CMDQ_PRI_0_SID GENMASK_ULL(63, 32)
> #define CMDQ_PRI_1_GRPID GENMASK_ULL(8, 0)
> #define CMDQ_PRI_1_RESP GENMASK_ULL(13, 12)
> +#define CMDQ_PRI_1_RESP_FAILURE
> FIELD_PREP(CMDQ_PRI_1_RESP, 0UL)
> +#define CMDQ_PRI_1_RESP_INVALID
> FIELD_PREP(CMDQ_PRI_1_RESP, 1UL)
> +#define CMDQ_PRI_1_RESP_SUCCESS
> FIELD_PREP(CMDQ_PRI_1_RESP, 2UL)
>
> #define CMDQ_RESUME_0_SID GENMASK_ULL(63, 32)
> #define CMDQ_RESUME_0_ACTION_RETRY (1UL << 12)
> @@ -383,12 +387,6 @@ module_param_named(disable_ats_check,
> disable_ats_check, bool, S_IRUGO);
> MODULE_PARM_DESC(disable_ats_check,
> "By default, the SMMU checks whether each incoming transaction
> marked as translated is allowed by the stream configuration. This option
> disables the check.");
>
> -enum pri_resp {
> - PRI_RESP_DENY = 0,
> - PRI_RESP_FAIL = 1,
> - PRI_RESP_SUCC = 2,
> -};
> -
> enum arm_smmu_msi_index {
> EVTQ_MSI_INDEX,
> GERROR_MSI_INDEX,
> @@ -471,7 +469,7 @@ struct arm_smmu_cmdq_ent {
> u32 sid;
> u32 ssid;
> u16 grpid;
> - enum pri_resp resp;
> + enum page_response_code resp;
> } pri;
>
> #define CMDQ_OP_RESUME 0x44
> @@ -556,6 +554,7 @@ struct arm_smmu_strtab_ent {
> struct arm_smmu_s2_cfg *s2_cfg;
>
> bool can_stall;
> + bool prg_resp_needs_ssid;
> };
>
> struct arm_smmu_strtab_cfg {
> @@ -907,14 +906,18 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd,
> struct arm_smmu_cmdq_ent *ent)
> cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SID, ent->pri.sid);
> cmd[1] |= FIELD_PREP(CMDQ_PRI_1_GRPID, ent->pri.grpid);
> switch (ent->pri.resp) {
> - case PRI_RESP_DENY:
> - case PRI_RESP_FAIL:
> - case PRI_RESP_SUCC:
> + case IOMMU_PAGE_RESP_FAILURE:
> + cmd[1] |= CMDQ_PRI_1_RESP_FAILURE;
> + break;
> + case IOMMU_PAGE_RESP_INVALID:
> + cmd[1] |= CMDQ_PRI_1_RESP_INVALID;
> + break;
> + case IOMMU_PAGE_RESP_SUCCESS:
> + cmd[1] |= CMDQ_PRI_1_RESP_SUCCESS;
> break;
> default:
> return -EINVAL;
> }
> - cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
> break;
> case CMDQ_OP_RESUME:
> cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_SID, ent-
> >resume.sid); @@ -1114,8 +1117,15 @@ static int
> arm_smmu_page_response(struct device *dev,
> cmd.resume.sid = sid;
> cmd.resume.stag = resp->page_req_group_id;
> cmd.resume.resp = resp->resp_code;
> + } else if (master->can_fault) {
> + cmd.opcode = CMDQ_OP_PRI_RESP;
> + cmd.substream_valid = resp->pasid_present &&
> + master->ste.prg_resp_needs_ssid;
> + cmd.pri.sid = sid;
> + cmd.pri.ssid = resp->pasid;
> + cmd.pri.grpid = resp->page_req_group_id;
> + cmd.pri.resp = resp->resp_code;
> } else {
> - /* TODO: put PRI response here */
> return -ENODEV;
> }
>
> @@ -1236,6 +1246,9 @@ static void arm_smmu_write_strtab_ent(struct
> arm_smmu_device *smmu, u32 sid,
> FIELD_PREP(STRTAB_STE_1_S1CSH,
> ARM_SMMU_SH_ISH) |
> FIELD_PREP(STRTAB_STE_1_STRW, strw));
>
> + if (ste->prg_resp_needs_ssid)
> + dst[1] |= STRTAB_STE_1_PPAR;
> +
> if (smmu->features & ARM_SMMU_FEAT_STALLS &&
> !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE) &&
> !ste->can_stall)
> @@ -1471,39 +1484,54 @@ static irqreturn_t arm_smmu_evtq_thread(int
> irq, void *dev)
>
> static void arm_smmu_handle_ppr(struct arm_smmu_device *smmu, u64
> *evt) {
> - u32 sid, ssid;
> - u16 grpid;
> - bool ssv, last;
> -
> - sid = FIELD_GET(PRIQ_0_SID, evt[0]);
> - ssv = FIELD_GET(PRIQ_0_SSID_V, evt[0]);
> - ssid = ssv ? FIELD_GET(PRIQ_0_SSID, evt[0]) : 0;
> - last = FIELD_GET(PRIQ_0_PRG_LAST, evt[0]);
> - grpid = FIELD_GET(PRIQ_1_PRG_IDX, evt[1]);
> -
> - dev_info(smmu->dev, "unexpected PRI request received:\n");
> - dev_info(smmu->dev,
> - "\tsid 0x%08x.0x%05x: [%u%s] %sprivileged %s%s%s access
> at iova 0x%016llx\n",
> - sid, ssid, grpid, last ? "L" : "",
> - evt[0] & PRIQ_0_PERM_PRIV ? "" : "un",
> - evt[0] & PRIQ_0_PERM_READ ? "R" : "",
> - evt[0] & PRIQ_0_PERM_WRITE ? "W" : "",
> - evt[0] & PRIQ_0_PERM_EXEC ? "X" : "",
> - evt[1] & PRIQ_1_ADDR_MASK);
> -
> - if (last) {
> - struct arm_smmu_cmdq_ent cmd = {
> - .opcode =
> CMDQ_OP_PRI_RESP,
> - .substream_valid = ssv,
> - .pri = {
> - .sid = sid,
> - .ssid = ssid,
> - .grpid = grpid,
> - .resp = PRI_RESP_DENY,
> - },
> + u32 sid = FIELD_PREP(PRIQ_0_SID, evt[0]);
> +
> + struct arm_smmu_master_data *master;
> + struct iommu_fault_event fault = {
> + .type = IOMMU_FAULT_PAGE_REQ,
> + .last_req = FIELD_GET(PRIQ_0_PRG_LAST,
> evt[0]),
> + .pasid_valid = FIELD_GET(PRIQ_0_SSID_V, evt[0]),
> + .pasid = FIELD_GET(PRIQ_0_SSID, evt[0]),
> + .page_req_group_id = FIELD_GET(PRIQ_1_PRG_IDX,
> evt[1]),
> + .addr = evt[1] & PRIQ_1_ADDR_MASK,
> + };
> +
> + if (evt[0] & PRIQ_0_PERM_READ)
> + fault.prot |= IOMMU_FAULT_READ;
> + if (evt[0] & PRIQ_0_PERM_WRITE)
> + fault.prot |= IOMMU_FAULT_WRITE;
> + if (evt[0] & PRIQ_0_PERM_EXEC)
> + fault.prot |= IOMMU_FAULT_EXEC;
> + if (evt[0] & PRIQ_0_PERM_PRIV)
> + fault.prot |= IOMMU_FAULT_PRIV;
> +
> + /* Discard Stop PASID marker, it isn't used */
> + if (!(fault.prot & (IOMMU_FAULT_READ|IOMMU_FAULT_WRITE)) &&
> + fault.last_req)
> + return;
> +
> + master = arm_smmu_find_master(smmu, sid);
> + if (WARN_ON(!master))
> + return;
> +
> + if (iommu_report_device_fault(master->dev, &fault)) {
> + /*
> + * No handler registered, so subsequent faults won't produce
> + * better results. Try to disable PRI.
> + */
> + struct page_response_msg page_response = {
> + .addr = fault.addr,
> + .pasid = fault.pasid,
> + .pasid_present = fault.pasid_valid,
> + .page_req_group_id = fault.page_req_group_id,
> + .resp_code =
> IOMMU_PAGE_RESP_FAILURE,
> };
>
> - arm_smmu_cmdq_issue_cmd(smmu, &cmd);
> + dev_warn(master->dev,
> + "PPR 0x%x:0x%llx 0x%x: nobody cared, disabling
> PRI\n",
> + fault.pasid_valid ? fault.pasid : 0, fault.addr,
> + fault.prot);
> + arm_smmu_page_response(master->dev, &page_response);
> }
> }
>
> @@ -1529,6 +1557,11 @@ static irqreturn_t arm_smmu_priq_thread(int irq,
> void *dev)
> }
>
> if (queue_sync_prod(q) == -EOVERFLOW)
> + /*
> + * TODO: flush pending faults, since the SMMU might
> have
> + * auto-responded to the Last request of a pending
> + * group
> + */
> dev_err(smmu->dev, "PRIQ overflow detected --
> requests lost\n");
> } while (!queue_empty(q));
>
> @@ -1577,7 +1610,8 @@ static int arm_smmu_flush_queues(void *cookie,
> struct device *dev)
> master = dev->iommu_fwspec->iommu_priv;
> if (master->ste.can_stall)
> arm_smmu_flush_queue(smmu, &smmu->evtq.q,
> "evtq");
> - /* TODO: add support for PRI */
> + else if (master->can_fault)
> + arm_smmu_flush_queue(smmu, &smmu->priq.q,
> "priq");
> return 0;
> }
>
> @@ -2301,6 +2335,59 @@ arm_smmu_iova_to_phys(struct iommu_domain
> *domain, dma_addr_t iova)
> return ops->iova_to_phys(ops, iova);
> }
>
> +static int arm_smmu_enable_pri(struct arm_smmu_master_data *master) {
> + int ret, pos;
> + struct pci_dev *pdev;
> + /*
> + * TODO: find a good inflight PPR number. We should divide the PRI
> queue
> + * by the number of PRI-capable devices, but it's impossible to know
> + * about current and future (hotplugged) devices. So we're at risk of
> + * dropping PPRs (and leaking pending requests in the FQ).
> + */
> + size_t max_inflight_pprs = 16;
> + struct arm_smmu_device *smmu = master->smmu;
> +
> + if (!(smmu->features & ARM_SMMU_FEAT_PRI) ||
> !dev_is_pci(master->dev))
> + return -ENOSYS;
> +
> + pdev = to_pci_dev(master->dev);
> +
> + ret = pci_reset_pri(pdev);
> + if (ret)
> + return ret;
> +
> + ret = pci_enable_pri(pdev, max_inflight_pprs);
> + if (ret) {
> + dev_err(master->dev, "cannot enable PRI: %d\n", ret);
> + return ret;
> + }
> +
> + master->can_fault = true;
> + master->ste.prg_resp_needs_ssid =
> pci_prg_resp_requires_prefix(pdev);
Any reason why this is not cleared in arm_smmu_disable_pri ?
> +
> + dev_dbg(master->dev, "enabled PRI\n");
> +
> + return 0;
> +}
> +
> +static void arm_smmu_disable_pri(struct arm_smmu_master_data
> *master) {
> + struct pci_dev *pdev;
> +
> + if (!dev_is_pci(master->dev))
> + return;
> +
> + pdev = to_pci_dev(master->dev);
> +
> + if (!pdev->pri_enabled)
> + return;
> +
> + pci_disable_pri(pdev);
> + dev_dbg(master->dev, "disabled PRI\n");
> + master->can_fault = false;
> +}
> +
> static int arm_smmu_sva_init(struct device *dev, struct iommu_sva_param
> *param) {
> int ret;
> @@ -2314,11 +2401,15 @@ static int arm_smmu_sva_init(struct device
> *dev, struct iommu_sva_param *param)
> return -EINVAL;
>
> if (param->features & IOMMU_SVA_FEAT_IOPF) {
> - if (!master->can_fault)
> - return -EINVAL;
> + arm_smmu_enable_pri(master);
> + if (!master->can_fault) {
> + ret = -ENODEV;
> + goto err_disable_pri;
> + }
> +
> ret = iopf_queue_add_device(master->smmu->iopf_queue,
> dev);
> if (ret)
> - return ret;
> + goto err_disable_pri;
> }
>
> if (!param->max_pasid)
> @@ -2329,11 +2420,17 @@ static int arm_smmu_sva_init(struct device
> *dev, struct iommu_sva_param *param)
> param->max_pasid = min(param->max_pasid, (1U << master-
> >ssid_bits) - 1);
>
> return 0;
> +
> +err_disable_pri:
> + arm_smmu_disable_pri(master);
> +
> + return ret;
> }
>
> static void arm_smmu_sva_shutdown(struct device *dev,
> struct iommu_sva_param *param)
> {
> + arm_smmu_disable_pri(dev->iommu_fwspec->iommu_priv);
> iopf_queue_remove_device(dev);
> }
>
> @@ -2671,6 +2768,7 @@ static void arm_smmu_remove_device(struct
> device *dev)
> iommu_group_remove_device(dev);
> arm_smmu_remove_master(smmu, master);
> iommu_device_unlink(&smmu->iommu, dev);
> + arm_smmu_disable_pri(master);
> arm_smmu_disable_ats(master);
> kfree(master);
> iommu_fwspec_free(dev);
> --
> 2.17.0
^ permalink raw reply
* Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
From: Jean-Philippe Brucker @ 2018-05-25 9:47 UTC (permalink / raw)
To: Xu Zaibo, linux-arm-kernel@lists.infradead.org,
linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org,
devicetree@vger.kernel.org, iommu@lists.linux-foundation.org,
kvm@vger.kernel.org, linux-mm@kvack.org
Cc: bharatku@xilinx.com, ashok.raj@intel.com, dwmw2@infradead.org,
ilias.apalodimas@linaro.org, Will Deacon, okaya@codeaurora.org,
rgummal@xilinx.com, rfranz@cavium.com, liguozhu@hisilicon.com,
christian.koenig@amd.com
In-Reply-To: <5B077765.30703@huawei.com>
On 25/05/18 03:39, Xu Zaibo wrote:
> Hi,
>
> On 2018/5/24 23:04, Jean-Philippe Brucker wrote:
>> On 24/05/18 13:35, Xu Zaibo wrote:
>>>> Right, sva_init() must be called once for any device that intends to use
>>>> bind(). For the second process though, group->sva_enabled will be true
>>>> so we won't call sva_init() again, only bind().
>>> Well, while I create mediated devices based on one parent device to support multiple
>>> processes(a new process will create a new 'vfio_group' for the corresponding mediated device,
>>> and 'sva_enabled' cannot work any more), in fact, *sva_init and *sva_shutdown are basically
>>> working on parent device, so, as a result, I just only need sva initiation and shutdown on the
>>> parent device only once. So I change the two as following:
>>>
>>> @@ -551,8 +565,18 @@ int iommu_sva_device_init(struct device *dev, unsigned long features,
>>> if (features & ~IOMMU_SVA_FEAT_IOPF)
>>> return -EINVAL;
>>>
>>> + /* If already exists, do nothing */
>>> + mutex_lock(&dev->iommu_param->lock);
>>> + if (dev->iommu_param->sva_param) {
>>> + mutex_unlock(&dev->iommu_param->lock);
>>> + return 0;
>>> + }
>>> + mutex_unlock(&dev->iommu_param->lock);
>>>
>>> if (features & IOMMU_SVA_FEAT_IOPF) {
>>> ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
>>>
>>>
>>> @@ -621,6 +646,14 @@ int iommu_sva_device_shutdown(struct device *dev)
>>> if (!domain)
>>> return -ENODEV;
>>>
>>> + /* If any other process is working on the device, shut down does nothing. */
>>> + mutex_lock(&dev->iommu_param->lock);
>>> + if (!list_empty(&dev->iommu_param->sva_param->mm_list)) {
>>> + mutex_unlock(&dev->iommu_param->lock);
>>> + return 0;
>>> + }
>>> + mutex_unlock(&dev->iommu_param->lock);
>> I don't think iommu-sva.c is the best place for this, it's probably
>> better to implement an intermediate layer (the mediating driver), that
>> calls iommu_sva_device_init() and iommu_sva_device_shutdown() once. Then
>> vfio-pci would still call these functions itself, but for mdev the
>> mediating driver keeps a refcount of groups, and calls device_shutdown()
>> only when freeing the last mdev.
>>
>> A device driver (non mdev in this example) expects to be able to free
>> all its resources after sva_device_shutdown() returns. Imagine the
>> mm_list isn't empty (mm_exit() is running late), and instead of waiting
>> in unbind_dev_all() below, we return 0 immediately. Then the calling
>> driver frees its resources, and the mm_exit callback along with private
>> data passed to bind() disappear. If a mm_exit() is still running in
>> parallel, then it will try to access freed data and corrupt memory. So
>> in this function if mm_list isn't empty, the only thing we can do is wait.
>>
> I still don't understand why we should 'unbind_dev_all', is it possible
> to do a 'unbind_dev_pasid'?
Not in sva_device_shutdown(), it needs to clean up everything. For
example you want to physically unplug the device, or assign it to a VM.
To prevent any leak sva_device_shutdown() needs to remove all bonds. In
theory there shouldn't be any, since either the driver did unbind_dev(),
or all process exited. This is a safety net.
> Then we can do other things instead of waiting that user may not like. :)
They may not like it, but it's for their own good :) At the moment we're
waiting that:
* All exit_mm() callback for this device have finished. If we don't wait
then the caller will free the private data passed to bind and the
mm_exit() callback while they are still being used.
* All page requests targeting this device are dealt with. If we don't
wait then some requests, that are lingering in the IOMMU PRI queue,
may hit the next contexts bound to this device, possibly in a
different VM. It may not be too risky (though probably exploitable in
some way), but is incredibly messy.
All of this is bounded in time, and normally should be over pretty fast
unless the device driver's exit_mm() does something strange. If the
driver did the right thing, there shouldn't be any wait here (although
there may be one in unbind_dev() for the same reasons - prevent use
after free).
Thanks,
Jean
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox