* [PATCH] [PATCH] Revert 'vfio: Delete container_q'
@ 2022-10-08 9:50 chenxiang
2022-10-10 18:27 ` Jason Gunthorpe
0 siblings, 1 reply; 5+ messages in thread
From: chenxiang @ 2022-10-08 9:50 UTC (permalink / raw)
To: jgg, alex.williamson
Cc: cohuck, kvm, baolu.lu, kevin.tian, linuxarm, Xiang Chen
From: Xiang Chen <chenxiang66@hisilicon.com>
We find a issue on ARM64 platform with HNS3 VF SRIOV enabled (VFIO
passthrough in qemu):
kill the qemu thread, then echo 0 > sriov_numvfs to disable sriov
immediately, sometimes we will see following warnings:
[ 284.102752] ------------[ cut here ]------------
[ 284.113278] iommu driver failed to attach the default/blocking domain
[ 284.113294] WARNING: CPU: 14 PID: 12393 at drivers/iommu/iommu.c:1959
__iommu_group_set_core_domain+0x44/0x54
[ 284.130298] Modules linked in: hisi_zip kp_ktools(O) hisi_qm uacce vfio_iommu_type1
vfio_pci vfio_pci_core vfio_virqfd vfio hns3_cae(O) ip6table_filter ip6_tables
iptable_filter arm_spe_pmu hns_roce_hw_v2 hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu
hisi_uncore_l3c_pmu spi_dw_mmio hisi_uncore_pmu fuse hclge crct10dif_ce sbsa_gwdt
hisi_sas_v3_hw hns3 hnae3 hisi_sas_main xhci_pci hisi_dma xhci_pci_renesas libsas
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: uacce]
[ 284.172994] CPU: 14 PID: 12393 Comm: qemu-system-aar Kdump: loaded Tainted: G
O 5.19.0-rc4+ #1
[ 284.183380] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 2280-V2 CS
V5.B221.01 12/09/2021
[ 284.193515] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 284.200721] pc : __iommu_group_set_core_domain+0x44/0x54
[ 284.206286] lr : __iommu_group_set_core_domain+0x44/0x54
[ 284.211843] sp : ffff80001d99bbc0
[ 284.215406] x29: ffff80001d99bbc0 x28: ffff002097cb6000 x27: ffff0020aa704c98
[ 284.222780] x26: ffff0020aa704c80 x25: ffff0020868c70b0 x24: ffff0020aa705600
[ 284.230152] x23: ffff00208405c3e0 x22: ffff0020868c7020 x21: ffff80001d99bc58
[ 284.237523] x20: ffff002086c4b858 x19: ffff002086c4b800 x18: 0000000000000030
[ 284.244889] x17: 0000000000000004 x16: 0000000000000000 x15: ffffffffffffffff
[ 284.252253] x14: 0000000000000000 x13: 6e69616d6f642067 x12: 6e696b636f6c622f
[ 284.259617] x11: fffffffffff26c30 x10: fffffffffff26be8 x9 : ffffba21e09dc1f0
[ 284.266979] x8 : ffff202f8f7c0000 x7 : ffff202f8fa80000 x6 : 0000000000006d20
[ 284.274340] x5 : ffff0027dfbe69b0 x4 : 0000000000000000 x3 : 0000000000000027
[ 284.281697] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff002097cb6000
[ 284.289051] Call trace:
[ 284.291732] __iommu_group_set_core_domain+0x44/0x54
[ 284.296924] iommu_detach_group+0x34/0x50
[ 284.301161] vfio_iommu_type1_detach_group+0xe4/0x610 [vfio_iommu_type1]
[ 284.308085] __vfio_group_unset_container+0x4c/0x1b0 [vfio]
[ 284.313882] vfio_group_fops_release+0x54/0xac [vfio]
[ 284.319155] __fput+0x78/0x220
[ 284.322434] ____fput+0x1c/0x30
[ 284.325790] task_work_run+0x88/0xc0
[ 284.329573] do_exit+0x310/0x970
[ 284.333004] do_group_exit+0x40/0xac
[ 284.336771] __wake_up_parent+0x0/0x3c
[ 284.340706] invoke_syscall+0x50/0x120
[ 284.344634] el0_svc_common.constprop.0+0x188/0x190
[ 284.349683] do_el0_svc+0x38/0xc4
[ 284.353169] el0_svc+0x2c/0xb4
[ 284.356390] el0t_64_sync_handler+0x1ac/0x1b0
[ 284.360905] el0t_64_sync+0x19c/0x1a0
[ 284.364729] ---[ end trace 0000000000000000 ]---
We find it is caused by commit (dc15f82f5329 ("vfio: Delete container_q")).
If killing the qemu thread, the function call relationship is as follows:
kill qemu -> vfio_instance_finalize -> syscall VFIO_GROUP_UNSET_CONTAINER ->
vfio_group_unset_container -> vfio_iommu_type1_detach_group ->
__iommu_group_set_core_domain -> arm_smmu_attach_dev (it refers to iommu_fwspec and arm_smmu_master)
If echo 0 > sriov_numvfs, the function call relationship is as follows:
echo 0 > sriov_numvfs -> pci_disable_sriov -> device_del -> iommu_bus_notifier (BUS_NOTIFY_REMOVED_DEVICE) -> iommu_release_device ->
arm_smmu_release_device (it will kfree arm_smmu_master including iommu_fwspec)
After removing container_q, arm_smmu_release_dev() caused by disabling
sriov may occur before arm_smmuv3_attach_dev() called by echo 0 > sriov_numvfs,
and arm_smmu_attach_dev() may refer to freed iommu_fwspec, so it causes
above warnings.
Revert the patch to avoid the issue.
Fixes: dc15f82f5329 ("vfio: Delete container_q")
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
---
drivers/vfio/vfio_main.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 7cb56c3..890fdf9 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -73,6 +73,7 @@ struct vfio_group {
struct mutex device_lock;
struct list_head vfio_next;
struct list_head container_next;
+ wait_queue_head_t container_q;
enum vfio_group_type type;
unsigned int dev_counter;
struct rw_semaphore group_rwsem;
@@ -367,6 +368,7 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
init_rwsem(&group->group_rwsem);
INIT_LIST_HEAD(&group->device_list);
mutex_init(&group->device_lock);
+ init_waitqueue_head(&group->container_q);
group->iommu_group = iommu_group;
/* put in vfio_group_release() */
iommu_group_ref_get(iommu_group);
@@ -699,6 +701,9 @@ void vfio_unregister_group_dev(struct vfio_device *device)
group->dev_counter--;
mutex_unlock(&group->device_lock);
+ if (list_empty(&group->device_list))
+ wait_event(group->container_q, !group->container);
+
if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
iommu_group_remove_device(device->dev);
@@ -946,6 +951,7 @@ static void __vfio_group_unset_container(struct vfio_group *group)
iommu_group_release_dma_owner(group->iommu_group);
group->container = NULL;
+ wake_up(&group->container_q);
group->container_users = 0;
list_del(&group->container_next);
--
2.8.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] [PATCH] Revert 'vfio: Delete container_q'
2022-10-08 9:50 [PATCH] [PATCH] Revert 'vfio: Delete container_q' chenxiang
@ 2022-10-10 18:27 ` Jason Gunthorpe
2022-10-11 2:30 ` chenxiang (M)
0 siblings, 1 reply; 5+ messages in thread
From: Jason Gunthorpe @ 2022-10-10 18:27 UTC (permalink / raw)
To: chenxiang; +Cc: alex.williamson, cohuck, kvm, baolu.lu, kevin.tian, linuxarm
On Sat, Oct 08, 2022 at 05:50:31PM +0800, chenxiang wrote:
> From: Xiang Chen <chenxiang66@hisilicon.com>
>
> We find a issue on ARM64 platform with HNS3 VF SRIOV enabled (VFIO
> passthrough in qemu):
> kill the qemu thread, then echo 0 > sriov_numvfs to disable sriov
> immediately, sometimes we will see following warnings:
I suspect this is fixed in vfio-next now, in a different way. Please check
> After removing container_q, arm_smmu_release_dev() caused by disabling
> sriov may occur before arm_smmuv3_attach_dev() called by echo 0 > sriov_numvfs,
> and arm_smmu_attach_dev() may refer to freed iommu_fwspec, so it causes
> above warnings.
Which is the same effective issue s390 hit already.
It is interesting that container_q was solving this, that seems to be
a inadverent side effect nobody noticed.
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] [PATCH] Revert 'vfio: Delete container_q'
2022-10-10 18:27 ` Jason Gunthorpe
@ 2022-10-11 2:30 ` chenxiang (M)
2022-10-11 2:49 ` Tian, Kevin
0 siblings, 1 reply; 5+ messages in thread
From: chenxiang (M) @ 2022-10-11 2:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: alex.williamson, cohuck, kvm, baolu.lu, kevin.tian, linuxarm
Hi Jason,
在 2022/10/11 2:27, Jason Gunthorpe 写道:
> On Sat, Oct 08, 2022 at 05:50:31PM +0800, chenxiang wrote:
>> From: Xiang Chen <chenxiang66@hisilicon.com>
>>
>> We find a issue on ARM64 platform with HNS3 VF SRIOV enabled (VFIO
>> passthrough in qemu):
>> kill the qemu thread, then echo 0 > sriov_numvfs to disable sriov
>> immediately, sometimes we will see following warnings:
> I suspect this is fixed in vfio-next now, in a different way. Please check
Can you point out which patches fix it?
I need to merge back those patches to our version, then have a test.
>
>> After removing container_q, arm_smmu_release_dev() caused by disabling
>> sriov may occur before arm_smmuv3_attach_dev() called by echo 0 > sriov_numvfs,
>> and arm_smmu_attach_dev() may refer to freed iommu_fwspec, so it causes
>> above warnings.
> Which is the same effective issue s390 hit already.
>
> It is interesting that container_q was solving this, that seems to be
> a inadverent side effect nobody noticed.
>
> Jason
> .
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH] [PATCH] Revert 'vfio: Delete container_q'
2022-10-11 2:30 ` chenxiang (M)
@ 2022-10-11 2:49 ` Tian, Kevin
2022-10-14 1:18 ` chenxiang (M)
0 siblings, 1 reply; 5+ messages in thread
From: Tian, Kevin @ 2022-10-11 2:49 UTC (permalink / raw)
To: chenxiang (M), Jason Gunthorpe
Cc: alex.williamson@redhat.com, cohuck@redhat.com,
kvm@vger.kernel.org, baolu.lu@linux.intel.com,
linuxarm@huawei.com
> From: chenxiang (M) <chenxiang66@hisilicon.com>
> Sent: Tuesday, October 11, 2022 10:31 AM
>
> Hi Jason,
>
>
> 在 2022/10/11 2:27, Jason Gunthorpe 写道:
> > On Sat, Oct 08, 2022 at 05:50:31PM +0800, chenxiang wrote:
> >> From: Xiang Chen <chenxiang66@hisilicon.com>
> >>
> >> We find a issue on ARM64 platform with HNS3 VF SRIOV enabled (VFIO
> >> passthrough in qemu):
> >> kill the qemu thread, then echo 0 > sriov_numvfs to disable sriov
> >> immediately, sometimes we will see following warnings:
> > I suspect this is fixed in vfio-next now, in a different way. Please check
>
> Can you point out which patches fix it?
> I need to merge back those patches to our version, then have a test.
>
commit ca5f21b2574903a7430fcb3590e534d92b2fa816
Author: Jason Gunthorpe <jgg@nvidia.com>
Date: Thu Sep 22 21:06:10 2022 -0300
vfio: Follow a strict lifetime for struct iommu_group
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] [PATCH] Revert 'vfio: Delete container_q'
2022-10-11 2:49 ` Tian, Kevin
@ 2022-10-14 1:18 ` chenxiang (M)
0 siblings, 0 replies; 5+ messages in thread
From: chenxiang (M) @ 2022-10-14 1:18 UTC (permalink / raw)
To: Tian, Kevin, Jason Gunthorpe
Cc: alex.williamson@redhat.com, cohuck@redhat.com,
kvm@vger.kernel.org, baolu.lu@linux.intel.com,
linuxarm@huawei.com
Hi,
在 2022/10/11 10:49, Tian, Kevin 写道:
>> From: chenxiang (M) <chenxiang66@hisilicon.com>
>> Sent: Tuesday, October 11, 2022 10:31 AM
>>
>> Hi Jason,
>>
>>
>> 在 2022/10/11 2:27, Jason Gunthorpe 写道:
>>> On Sat, Oct 08, 2022 at 05:50:31PM +0800, chenxiang wrote:
>>>> From: Xiang Chen <chenxiang66@hisilicon.com>
>>>>
>>>> We find a issue on ARM64 platform with HNS3 VF SRIOV enabled (VFIO
>>>> passthrough in qemu):
>>>> kill the qemu thread, then echo 0 > sriov_numvfs to disable sriov
>>>> immediately, sometimes we will see following warnings:
>>> I suspect this is fixed in vfio-next now, in a different way. Please check
>> Can you point out which patches fix it?
>> I need to merge back those patches to our version, then have a test.
>>
> commit ca5f21b2574903a7430fcb3590e534d92b2fa816
> Author: Jason Gunthorpe <jgg@nvidia.com>
> Date: Thu Sep 22 21:06:10 2022 -0300
>
> vfio: Follow a strict lifetime for struct iommu_group
I merge back the patch and have a test, and it solves the issue. Thanks!
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-10-14 1:18 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-08 9:50 [PATCH] [PATCH] Revert 'vfio: Delete container_q' chenxiang
2022-10-10 18:27 ` Jason Gunthorpe
2022-10-11 2:30 ` chenxiang (M)
2022-10-11 2:49 ` Tian, Kevin
2022-10-14 1:18 ` chenxiang (M)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox