public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
       [not found]   ` <20220927140541.6f727b01.alex.williamson@redhat.com>
@ 2022-10-04 15:19     ` Christian Borntraeger
  2022-10-04 15:40       ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-04 15:19 UTC (permalink / raw)
  To: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman
  Cc: Jason Gunthorpe, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390

Am 27.09.22 um 22:05 schrieb Alex Williamson:
> On Mon, 26 Sep 2022 13:03:56 -0400
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
>>> The iommu_group comes from the struct device that a driver has been bound
>>> to and then created a struct vfio_device against. To keep the iommu layer
>>> sane we want to have a simple rule that only an attached driver should be
>>> using the iommu API. Particularly only an attached driver should hold
>>> ownership.
>>>
>>> In VFIO's case since it uses the group APIs and it shares between
>>> different drivers it is a bit more complicated, but the principle still
>>> holds.
>>>
>>> Solve this by waiting for all users of the vfio_group to stop before
>>> allowing vfio_unregister_group_dev() to complete. This is done with a new
>>> completion to know when the users go away and an additional refcount to
>>> keep track of how many device drivers are sharing the vfio group. The last
>>> driver to be unregistered will clean up the group.
>>>
>>> This solves crashes in the S390 iommu driver that come because VFIO ends
>>> up racing releasing ownership (which attaches the default iommu_domain to
>>> the device) with the removal of that same device from the iommu
>>> driver. This is a side case that iommu drivers should not have to cope
>>> with.
>>>
>>>     iommu driver failed to attach the default/blocking domain
>>>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>>>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>>>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>>>     Hardware name: IBM 3931 A01 782 (LPAR)
>>>     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>>>                R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>>>                00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>>>                00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>>>                00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>>>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>>>                            000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>>>                           #000000095bb10d24: af000000            mc      0,0
>>>                           >000000095bb10d28: b904002a            lgr     %r2,%r10
>>>                            000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>>>                            000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>>>                            000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>>>                            000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>>>     Call Trace:
>>>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>>>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>>     pci 0004:00:00.0: Removing from iommu group 4
>>>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>>      [<000000095be6c072>] system_call+0x82/0xb0
>>>     Last Breaking-Event-Address:
>>>      [<000000095be4bf80>] __warn_printk+0x60/0x68
>>>
>>> It indicates that domain->ops->attach_dev() failed because the driver has
>>> already passed the point of destructing the device.
>>>
>>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>> ---
>>>   drivers/vfio/vfio.h      |  8 +++++
>>>   drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>>>   2 files changed, 53 insertions(+), 23 deletions(-)
>>>
>>> v2
>>>   - Rebase on the vfio struct device series and the container.c series
>>>   - Drop patches 1 & 2, we need to have working error unwind, so another
>>>     test is not a problem
>>>   - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>>>     that it forms a strict pairing with the two allocation functions.
>>>   - Drop the iommu patch from the series, it needs more work and discussion
>>> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
>>>
>>> This could probably use another quick sanity test due to all the rebasing,
>>> Alex if you are happy let's wait for Matthew.
>>>    
>>
>> I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
> 
> Thanks all.  Applied to vfio next branch for v6.1.  Thanks,

So now I have bisected this to a regression in our KVM CI for vfio-ap. Our testcase MultipleMdevAPMatrixTestCase hangs forever.
I see  virtnodedevd spinning 100% and "mdevctl stop --uuid=d70d7685-a1b5-47a1-bdea-336925e0a95d" seems to wait for something:

[  186.815543] task:mdevctl         state:D stack:    0 pid: 1639 ppid:  1604 flags:0x00000001
[  186.815546] Call Trace:
[  186.815547]  [<0000002baf277386>] __schedule+0x296/0x650
[  186.815549]  [<0000002baf2777a2>] schedule+0x62/0x108
[  186.815551]  [<0000002baf27db20>] schedule_timeout+0xc0/0x108
[  186.815553]  [<0000002baf278166>] __wait_for_common+0xc6/0x250
[  186.815556]  [<000003ff800c263a>] vfio_device_remove_group.isra.0+0xb2/0x118 [vfio]
[  186.815561]  [<000003ff805caadc>] vfio_ap_mdev_remove+0x2c/0x198 [vfio_ap]
[  186.815565]  [<0000002baef1d4de>] device_release_driver_internal+0x1c6/0x288
[  186.815570]  [<0000002baef1b27c>] bus_remove_device+0x10c/0x198
[  186.815572]  [<0000002baef14b54>] device_del+0x19c/0x3e0
[  186.815575]  [<000003ff800d9e3a>] mdev_device_remove+0xb2/0x108 [mdev]
[  186.815579]  [<000003ff800da096>] remove_store+0x7e/0x90 [mdev]
[  186.815581]  [<0000002baea53c30>] kernfs_fop_write_iter+0x138/0x210
[  186.815586]  [<0000002bae98e310>] vfs_write+0x1a0/0x2f0
[  186.815588]  [<0000002bae98e6d8>] ksys_write+0x70/0x100
[  186.815590]  [<0000002baf26fe2c>] __do_syscall+0x1d4/0x200
[  186.815593]  [<0000002baf27eb42>] system_call+0x82/0xb0


Any quick idea?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 15:19     ` [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group Christian Borntraeger
@ 2022-10-04 15:40       ` Jason Gunthorpe
  2022-10-04 15:44         ` Christian Borntraeger
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 15:40 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390

On Tue, Oct 04, 2022 at 05:19:07PM +0200, Christian Borntraeger wrote:
> Am 27.09.22 um 22:05 schrieb Alex Williamson:
> > On Mon, 26 Sep 2022 13:03:56 -0400
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> > 
> > > On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
> > > > The iommu_group comes from the struct device that a driver has been bound
> > > > to and then created a struct vfio_device against. To keep the iommu layer
> > > > sane we want to have a simple rule that only an attached driver should be
> > > > using the iommu API. Particularly only an attached driver should hold
> > > > ownership.
> > > > 
> > > > In VFIO's case since it uses the group APIs and it shares between
> > > > different drivers it is a bit more complicated, but the principle still
> > > > holds.
> > > > 
> > > > Solve this by waiting for all users of the vfio_group to stop before
> > > > allowing vfio_unregister_group_dev() to complete. This is done with a new
> > > > completion to know when the users go away and an additional refcount to
> > > > keep track of how many device drivers are sharing the vfio group. The last
> > > > driver to be unregistered will clean up the group.
> > > > 
> > > > This solves crashes in the S390 iommu driver that come because VFIO ends
> > > > up racing releasing ownership (which attaches the default iommu_domain to
> > > > the device) with the removal of that same device from the iommu
> > > > driver. This is a side case that iommu drivers should not have to cope
> > > > with.
> > > > 
> > > >     iommu driver failed to attach the default/blocking domain
> > > >     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
> > > >     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
> > > >     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
> > > >     Hardware name: IBM 3931 A01 782 (LPAR)
> > > >     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
> > > >                R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> > > >     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
> > > >                00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
> > > >                00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
> > > >                00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
> > > >     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
> > > >                            000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
> > > >                           #000000095bb10d24: af000000            mc      0,0
> > > >                           >000000095bb10d28: b904002a            lgr     %r2,%r10
> > > >                            000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
> > > >                            000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
> > > >                            000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
> > > >                            000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
> > > >     Call Trace:
> > > >      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
> > > >     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
> > > >      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
> > > >      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
> > > >      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
> > > >     pci 0004:00:00.0: Removing from iommu group 4
> > > >      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
> > > >      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
> > > >      [<000000095be6c072>] system_call+0x82/0xb0
> > > >     Last Breaking-Event-Address:
> > > >      [<000000095be4bf80>] __warn_printk+0x60/0x68
> > > > 
> > > > It indicates that domain->ops->attach_dev() failed because the driver has
> > > > already passed the point of destructing the device.
> > > > 
> > > > Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> > > > Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > > > Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > ---
> > > >   drivers/vfio/vfio.h      |  8 +++++
> > > >   drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
> > > >   2 files changed, 53 insertions(+), 23 deletions(-)
> > > > 
> > > > v2
> > > >   - Rebase on the vfio struct device series and the container.c series
> > > >   - Drop patches 1 & 2, we need to have working error unwind, so another
> > > >     test is not a problem
> > > >   - Fold iommu_group_remove_device() into vfio_device_remove_group() so
> > > >     that it forms a strict pairing with the two allocation functions.
> > > >   - Drop the iommu patch from the series, it needs more work and discussion
> > > > v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> > > > 
> > > > This could probably use another quick sanity test due to all the rebasing,
> > > > Alex if you are happy let's wait for Matthew.
> > > 
> > > I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
> > 
> > Thanks all.  Applied to vfio next branch for v6.1.  Thanks,
> 
> So now I have bisected this to a regression in our KVM CI for vfio-ap. Our testcase MultipleMdevAPMatrixTestCase hangs forever.
> I see  virtnodedevd spinning 100% and "mdevctl stop --uuid=d70d7685-a1b5-47a1-bdea-336925e0a95d" seems to wait for something:
> 
> [  186.815543] task:mdevctl         state:D stack:    0 pid: 1639 ppid:  1604 flags:0x00000001
> [  186.815546] Call Trace:
> [  186.815547]  [<0000002baf277386>] __schedule+0x296/0x650
> [  186.815549]  [<0000002baf2777a2>] schedule+0x62/0x108
> [  186.815551]  [<0000002baf27db20>] schedule_timeout+0xc0/0x108
> [  186.815553]  [<0000002baf278166>] __wait_for_common+0xc6/0x250
> [  186.815556]  [<000003ff800c263a>] vfio_device_remove_group.isra.0+0xb2/0x118 [vfio]
> [  186.815561]  [<000003ff805caadc>] vfio_ap_mdev_remove+0x2c/0x198 [vfio_ap]
> [  186.815565]  [<0000002baef1d4de>] device_release_driver_internal+0x1c6/0x288
> [  186.815570]  [<0000002baef1b27c>] bus_remove_device+0x10c/0x198
> [  186.815572]  [<0000002baef14b54>] device_del+0x19c/0x3e0
> [  186.815575]  [<000003ff800d9e3a>] mdev_device_remove+0xb2/0x108 [mdev]
> [  186.815579]  [<000003ff800da096>] remove_store+0x7e/0x90 [mdev]
> [  186.815581]  [<0000002baea53c30>] kernfs_fop_write_iter+0x138/0x210
> [  186.815586]  [<0000002bae98e310>] vfs_write+0x1a0/0x2f0
> [  186.815588]  [<0000002bae98e6d8>] ksys_write+0x70/0x100
> [  186.815590]  [<0000002baf26fe2c>] __do_syscall+0x1d4/0x200
> [  186.815593]  [<0000002baf27eb42>] system_call+0x82/0xb0

Does some userspace have the group FD open when it stucks like this,
eg what does fuser say?

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 15:40       ` Jason Gunthorpe
@ 2022-10-04 15:44         ` Christian Borntraeger
  2022-10-04 16:28           ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-04 15:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390



Am 04.10.22 um 17:40 schrieb Jason Gunthorpe:
> On Tue, Oct 04, 2022 at 05:19:07PM +0200, Christian Borntraeger wrote:
>> Am 27.09.22 um 22:05 schrieb Alex Williamson:
>>> On Mon, 26 Sep 2022 13:03:56 -0400
>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>
>>>> On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
>>>>> The iommu_group comes from the struct device that a driver has been bound
>>>>> to and then created a struct vfio_device against. To keep the iommu layer
>>>>> sane we want to have a simple rule that only an attached driver should be
>>>>> using the iommu API. Particularly only an attached driver should hold
>>>>> ownership.
>>>>>
>>>>> In VFIO's case since it uses the group APIs and it shares between
>>>>> different drivers it is a bit more complicated, but the principle still
>>>>> holds.
>>>>>
>>>>> Solve this by waiting for all users of the vfio_group to stop before
>>>>> allowing vfio_unregister_group_dev() to complete. This is done with a new
>>>>> completion to know when the users go away and an additional refcount to
>>>>> keep track of how many device drivers are sharing the vfio group. The last
>>>>> driver to be unregistered will clean up the group.
>>>>>
>>>>> This solves crashes in the S390 iommu driver that come because VFIO ends
>>>>> up racing releasing ownership (which attaches the default iommu_domain to
>>>>> the device) with the removal of that same device from the iommu
>>>>> driver. This is a side case that iommu drivers should not have to cope
>>>>> with.
>>>>>
>>>>>      iommu driver failed to attach the default/blocking domain
>>>>>      WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>>>>>      Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>>>>>      CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>>>>>      Hardware name: IBM 3931 A01 782 (LPAR)
>>>>>      Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>>>>>                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>>>>      Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>>>>>                 00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>>>>>                 00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>>>>>                 00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>>>>>      Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>>>>>                             000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>>>>>                            #000000095bb10d24: af000000            mc      0,0
>>>>>                            >000000095bb10d28: b904002a            lgr     %r2,%r10
>>>>>                             000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>>>>>                             000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>>>>>                             000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>>>>>                             000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>>>>>      Call Trace:
>>>>>       [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>>>>      ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>>>>       [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>>>>>       [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>>>>       [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>>>>      pci 0004:00:00.0: Removing from iommu group 4
>>>>>       [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>>>>       [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>>>>       [<000000095be6c072>] system_call+0x82/0xb0
>>>>>      Last Breaking-Event-Address:
>>>>>       [<000000095be4bf80>] __warn_printk+0x60/0x68
>>>>>
>>>>> It indicates that domain->ops->attach_dev() failed because the driver has
>>>>> already passed the point of destructing the device.
>>>>>
>>>>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>>>>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>>> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> ---
>>>>>    drivers/vfio/vfio.h      |  8 +++++
>>>>>    drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>>>>>    2 files changed, 53 insertions(+), 23 deletions(-)
>>>>>
>>>>> v2
>>>>>    - Rebase on the vfio struct device series and the container.c series
>>>>>    - Drop patches 1 & 2, we need to have working error unwind, so another
>>>>>      test is not a problem
>>>>>    - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>>>>>      that it forms a strict pairing with the two allocation functions.
>>>>>    - Drop the iommu patch from the series, it needs more work and discussion
>>>>> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
>>>>>
>>>>> This could probably use another quick sanity test due to all the rebasing,
>>>>> Alex if you are happy let's wait for Matthew.
>>>>
>>>> I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
>>>
>>> Thanks all.  Applied to vfio next branch for v6.1.  Thanks,
>>
>> So now I have bisected this to a regression in our KVM CI for vfio-ap. Our testcase MultipleMdevAPMatrixTestCase hangs forever.
>> I see  virtnodedevd spinning 100% and "mdevctl stop --uuid=d70d7685-a1b5-47a1-bdea-336925e0a95d" seems to wait for something:
>>
>> [  186.815543] task:mdevctl         state:D stack:    0 pid: 1639 ppid:  1604 flags:0x00000001
>> [  186.815546] Call Trace:
>> [  186.815547]  [<0000002baf277386>] __schedule+0x296/0x650
>> [  186.815549]  [<0000002baf2777a2>] schedule+0x62/0x108
>> [  186.815551]  [<0000002baf27db20>] schedule_timeout+0xc0/0x108
>> [  186.815553]  [<0000002baf278166>] __wait_for_common+0xc6/0x250
>> [  186.815556]  [<000003ff800c263a>] vfio_device_remove_group.isra.0+0xb2/0x118 [vfio]
>> [  186.815561]  [<000003ff805caadc>] vfio_ap_mdev_remove+0x2c/0x198 [vfio_ap]
>> [  186.815565]  [<0000002baef1d4de>] device_release_driver_internal+0x1c6/0x288
>> [  186.815570]  [<0000002baef1b27c>] bus_remove_device+0x10c/0x198
>> [  186.815572]  [<0000002baef14b54>] device_del+0x19c/0x3e0
>> [  186.815575]  [<000003ff800d9e3a>] mdev_device_remove+0xb2/0x108 [mdev]
>> [  186.815579]  [<000003ff800da096>] remove_store+0x7e/0x90 [mdev]
>> [  186.815581]  [<0000002baea53c30>] kernfs_fop_write_iter+0x138/0x210
>> [  186.815586]  [<0000002bae98e310>] vfs_write+0x1a0/0x2f0
>> [  186.815588]  [<0000002bae98e6d8>] ksys_write+0x70/0x100
>> [  186.815590]  [<0000002baf26fe2c>] __do_syscall+0x1d4/0x200
>> [  186.815593]  [<0000002baf27eb42>] system_call+0x82/0xb0
> 
> Does some userspace have the group FD open when it stucks like this,
> eg what does fuser say?

/proc/<virtnodedevd>/fd
51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
65272 0 lr-x------. 1 root root 64  4. Okt 17:42 21 -> anon_inode:inotify
65273 0 lr-x------. 1 root root 64  4. Okt 17:42 23 -> 'pipe:[30158]'
65274 0 lr-x------. 1 root root 64  4. Okt 17:42 25 -> 'pipe:[30159]'
51481 0 lrwx------. 1 root root 64  4. Okt 17:16 3 -> 'socket:[43590]'
65255 0 lrwx------. 1 root root 64  4. Okt 17:42 4 -> 'socket:[43591]'
65256 0 lrwx------. 1 root root 64  4. Okt 17:42 5 -> 'socket:[30947]'
65257 0 lrwx------. 1 root root 64  4. Okt 17:42 6 -> 'socket:[51483]'
65258 0 l-wx------. 1 root root 64  4. Okt 17:42 7 -> /run/virtnodedevd.pid
65259 0 lr-x------. 1 root root 64  4. Okt 17:42 8 -> 'pipe:[51484]'
65260 0 l-wx------. 1 root root 64  4. Okt 17:42 9 -> 'pipe:[51484]'

/proc/mdevctl/fd
59494 0 dr-x------. 2 root root  0  4. Okt 17:16 .
59493 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
59495 0 lrwx------. 1 root root 64  4. Okt 17:16 0 -> /dev/null
59496 0 l-wx------. 1 root root 64  4. Okt 17:16 1 -> 'pipe:[30158]'
59497 0 l-wx------. 1 root root 64  4. Okt 17:16 2 -> 'pipe:[30159]'
59498 0 l-wx------. 1 root root 64  4. Okt 17:16 3 -> /sys/devices/vfio_ap/matrix/d70d7685-a1b5-47a1-bdea-336925e0a95d/remove



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 15:44         ` Christian Borntraeger
@ 2022-10-04 16:28           ` Jason Gunthorpe
  2022-10-04 17:15             ` Christian Borntraeger
  2022-10-04 17:36             ` Christian Borntraeger
  0 siblings, 2 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 16:28 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390

On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:

> > Does some userspace have the group FD open when it stucks like this,
> > eg what does fuser say?
> 
> /proc/<virtnodedevd>/fd
> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'

Seems like a userspace bug to keep the group FD open after the /dev/
file has been deleted :|

What do you think about this?

commit a54a852b1484b1605917a8f4d80691db333b25ed
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Oct 4 13:14:37 2022 -0300

    vfio: Make the group FD disassociate from the iommu_group
    
    Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
    the pointer is NULL the vfio_group users promise not to touch the
    iommu_group. This allows a driver to be hot unplugged while userspace is
    keeping the group FD open.
    
    SPAPR mode is excluded from this behavior because of how it wrongly hacks
    part of its iommu interface through KVM. Due to this we loose control over
    what it is doing and cannot revoke the iommu_group usage in the IOMMU
    layer via vfio_group_detach_container().
    
    Thus, for SPAPR the group FDs must still be closed before a device can be
    hot unplugged.
    
    This fixes a userspace regression where we learned that virtnodedevd
    leaves a group FD open even though the /dev/ node for it has been deleted
    and all the drivers for it unplugged.
    
    Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
    Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 59a28251bb0b97..badc9d828cac20 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		}
 
 		/* Ensure the FD is a vfio group FD.*/
-		if (!vfio_file_iommu_group(file)) {
+		if (!vfio_file_is_group(file)) {
 			fput(file);
 			ret = -EINVAL;
 			break;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 4d2de02f2ced6e..4e10a281420e66 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -59,6 +59,7 @@ struct vfio_group {
 	struct mutex			group_lock;
 	struct kvm			*kvm;
 	struct file			*opened_file;
+	bool				preserve_iommu_group;
 	struct swait_queue_head		opened_file_wait;
 	struct blocking_notifier_head	notifier;
 };
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 9b1e5fd5f7b73c..e725cf38886c09 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
+	/*
+	 * group->iommu_group from the vfio.group_list cannot be NULL
+	 * under the vfio.group_lock.
+	 */
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
 			refcount_inc(&group->drivers);
@@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
 
 	mutex_destroy(&group->device_lock);
 	mutex_destroy(&group->group_lock);
-	iommu_group_put(group->iommu_group);
+	WARN_ON(group->iommu_group);
 	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
 	kfree(group);
 }
@@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 static void vfio_device_remove_group(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group;
 
 	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
 		iommu_group_remove_device(device->dev);
@@ -266,12 +271,25 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	cdev_device_del(&group->cdev, &group->dev);
 
 	/*
-	 * Before we allow the last driver in the group to be unplugged the
-	 * group must be sanitized so nothing else is or can reference it. This
-	 * is because the group->iommu_group pointer should only be used so long
-	 * as a device driver is attached to a device in the group.
+	 * Revoke all users of group->iommu_group. At this point we know there
+	 * are no devices active because we are unplugging the last one. Setting
+	 * iommu_group to NULL blocks all new users.
 	 */
-	while (group->opened_file) {
+	WARN_ON(group->notifier.head);
+	if (group->container)
+		vfio_group_detach_container(group);
+	iommu_group = group->iommu_group;
+	group->iommu_group = NULL;
+
+	/*
+	 * Normally we can set the iommu_group to NULL above and that will
+	 * prevent any users from touching it. However, the SPAPR kvm path takes
+	 * a reference to the iommu_group and keeps using it in arch code out
+	 * side our control. So if this path is triggred we have no choice but
+	 * to wait for the group FD to be closed to be sure everyone has stopped
+	 * touching the group.
+	 */
+	while (group->preserve_iommu_group && group->opened_file) {
 		mutex_unlock(&vfio.group_lock);
 		swait_event_idle_exclusive(group->opened_file_wait,
 					   !group->opened_file);
@@ -288,8 +306,8 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	WARN_ON(!list_empty(&group->device_list));
 	WARN_ON(group->container || group->container_users);
 	WARN_ON(group->notifier.head);
-	group->iommu_group = NULL;
 
+	iommu_group_put(iommu_group);
 	put_device(&group->dev);
 }
 
@@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
 
 	existing_device = vfio_group_get_device(group, device->dev);
 	if (existing_device) {
+		/*
+		 * group->iommu_group is non-NULL because we hold the drivers
+		 * refcount.
+		 */
 		dev_WARN(device->dev, "Device already exists on group %d\n",
 			 iommu_group_id(group->iommu_group));
 		vfio_device_put_registration(existing_device);
@@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
 		ret = -EINVAL;
 		goto out_unlock;
 	}
+	if (!group->iommu_group) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
 	container = vfio_container_from_file(f.file);
 	ret = -EINVAL;
 	if (container) {
@@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
 	status.flags = 0;
 
 	mutex_lock(&group->group_lock);
+	if (!group->iommu_group) {
+		mutex_unlock(&group->group_lock);
+		return -ENODEV;
+	}
+
 	if (group->container)
 		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
 				VFIO_GROUP_FLAGS_VIABLE;
@@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 	filep->private_data = NULL;
 
 	mutex_lock(&group->group_lock);
-	/*
-	 * Device FDs hold a group file reference, therefore the group release
-	 * is only called when there are no open devices.
-	 */
-	WARN_ON(group->notifier.head);
-	if (group->container)
-		vfio_group_detach_container(group);
 	group->opened_file = NULL;
 	mutex_unlock(&group->group_lock);
 	swake_up_one(&group->opened_file_wait);
@@ -1553,17 +1578,34 @@ static const struct file_operations vfio_device_fops = {
  * @file: VFIO group file
  *
  * The returned iommu_group is valid as long as a ref is held on the file.
+ * This function is deprecated, only the SPAPR path in kvm should call it.
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file)
 {
 	struct vfio_group *group = file->private_data;
+	struct iommu_group *iommu_group = NULL;
+
+	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
+		return NULL;
 
 	if (file->f_op != &vfio_group_fops)
 		return NULL;
-	return group->iommu_group;
+
+	mutex_lock(&group->group_lock);
+	if (group->iommu_group) {
+		iommu_group = group->iommu_group;
+		group->preserve_iommu_group = true;
+	}
+	mutex_unlock(&group->group_lock);
+	return iommu_group;
 }
 EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
 
+bool vfio_file_is_group(struct file *file)
+{
+	return (file->f_op == &vfio_group_fops;
+}
+
 /**
  * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
  *        is always CPU cache coherent
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 73bcb92179a224..bd9faaab85de18 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
  * External user API
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file);
+bool vfio_file_is_group(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index ce1b01d02c5197..6ecd3aca047375 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -77,6 +77,23 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 	return ret;
 }
 
+static bool kvm_vfio_file_is_group(struct file *file)
+{
+	bool (*fn)(struct file *file);
+	bool ret;
+
+	fn = symbol_get(vfio_file_is_group);
+	if (!fn)
+		return false;
+
+	ret = fn(file);
+
+	symbol_put(vfio_file_is_group);
+
+	return ret;
+}
+
+
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
 					     struct kvm_vfio_group *kvg)
@@ -136,7 +153,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 		return -EBADF;
 
 	/* Ensure the FD is a vfio group FD.*/
-	if (!kvm_vfio_file_iommu_group(filp)) {
+	if (!kvm_vfio_file_is_group(filp)) {
 		ret = -EINVAL;
 		goto err_fput;
 	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 16:28           ` Jason Gunthorpe
@ 2022-10-04 17:15             ` Christian Borntraeger
  2022-10-04 17:22               ` Jason Gunthorpe
  2022-10-04 17:36             ` Christian Borntraeger
  1 sibling, 1 reply; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-04 17:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390


Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
> 
>>> Does some userspace have the group FD open when it stucks like this,
>>> eg what does fuser say?
>>
>> /proc/<virtnodedevd>/fd
>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
> 
> Seems like a userspace bug to keep the group FD open after the /dev/
> file has been deleted :|
> 
> What do you think about this?

On top of which tree is this?

> 
> commit a54a852b1484b1605917a8f4d80691db333b25ed
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>      vfio: Make the group FD disassociate from the iommu_group
>      
>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>      the pointer is NULL the vfio_group users promise not to touch the
>      iommu_group. This allows a driver to be hot unplugged while userspace is
>      keeping the group FD open.
>      
>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>      part of its iommu interface through KVM. Due to this we loose control over
>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>      layer via vfio_group_detach_container().
>      
>      Thus, for SPAPR the group FDs must still be closed before a device can be
>      hot unplugged.
>      
>      This fixes a userspace regression where we learned that virtnodedevd
>      leaves a group FD open even though the /dev/ node for it has been deleted
>      and all the drivers for it unplugged.
>      
>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 59a28251bb0b97..badc9d828cac20 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>   		}
>   
>   		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_iommu_group(file)) {
> +		if (!vfio_file_is_group(file)) {
>   			fput(file);
>   			ret = -EINVAL;
>   			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 4d2de02f2ced6e..4e10a281420e66 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -59,6 +59,7 @@ struct vfio_group {
>   	struct mutex			group_lock;
>   	struct kvm			*kvm;
>   	struct file			*opened_file;
> +	bool				preserve_iommu_group;
>   	struct swait_queue_head		opened_file_wait;
>   	struct blocking_notifier_head	notifier;
>   };
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9b1e5fd5f7b73c..e725cf38886c09 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>   {
>   	struct vfio_group *group;
>   
> +	/*
> +	 * group->iommu_group from the vfio.group_list cannot be NULL
> +	 * under the vfio.group_lock.
> +	 */
>   	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>   		if (group->iommu_group == iommu_group) {
>   			refcount_inc(&group->drivers);
> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>   
>   	mutex_destroy(&group->device_lock);
>   	mutex_destroy(&group->group_lock);
> -	iommu_group_put(group->iommu_group);
> +	WARN_ON(group->iommu_group);
>   	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>   	kfree(group);
>   }
> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>   static void vfio_device_remove_group(struct vfio_device *device)
>   {
>   	struct vfio_group *group = device->group;
> +	struct iommu_group *iommu_group;
>   
>   	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>   		iommu_group_remove_device(device->dev);
> @@ -266,12 +271,25 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	cdev_device_del(&group->cdev, &group->dev);
>   
>   	/*
> -	 * Before we allow the last driver in the group to be unplugged the
> -	 * group must be sanitized so nothing else is or can reference it. This
> -	 * is because the group->iommu_group pointer should only be used so long
> -	 * as a device driver is attached to a device in the group.
> +	 * Revoke all users of group->iommu_group. At this point we know there
> +	 * are no devices active because we are unplugging the last one. Setting
> +	 * iommu_group to NULL blocks all new users.
>   	 */
> -	while (group->opened_file) {
> +	WARN_ON(group->notifier.head);
> +	if (group->container)
> +		vfio_group_detach_container(group);
> +	iommu_group = group->iommu_group;
> +	group->iommu_group = NULL;
> +
> +	/*
> +	 * Normally we can set the iommu_group to NULL above and that will
> +	 * prevent any users from touching it. However, the SPAPR kvm path takes
> +	 * a reference to the iommu_group and keeps using it in arch code out
> +	 * side our control. So if this path is triggred we have no choice but
> +	 * to wait for the group FD to be closed to be sure everyone has stopped
> +	 * touching the group.
> +	 */
> +	while (group->preserve_iommu_group && group->opened_file) {
>   		mutex_unlock(&vfio.group_lock);
>   		swait_event_idle_exclusive(group->opened_file_wait,
>   					   !group->opened_file);
> @@ -288,8 +306,8 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	WARN_ON(!list_empty(&group->device_list));
>   	WARN_ON(group->container || group->container_users);
>   	WARN_ON(group->notifier.head);
> -	group->iommu_group = NULL;
>   
> +	iommu_group_put(iommu_group);
>   	put_device(&group->dev);
>   }
>   
> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>   
>   	existing_device = vfio_group_get_device(group, device->dev);
>   	if (existing_device) {
> +		/*
> +		 * group->iommu_group is non-NULL because we hold the drivers
> +		 * refcount.
> +		 */
>   		dev_WARN(device->dev, "Device already exists on group %d\n",
>   			 iommu_group_id(group->iommu_group));
>   		vfio_device_put_registration(existing_device);
> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>   		ret = -EINVAL;
>   		goto out_unlock;
>   	}
> +	if (!group->iommu_group) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
>   	container = vfio_container_from_file(f.file);
>   	ret = -EINVAL;
>   	if (container) {
> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>   	status.flags = 0;
>   
>   	mutex_lock(&group->group_lock);
> +	if (!group->iommu_group) {
> +		mutex_unlock(&group->group_lock);
> +		return -ENODEV;
> +	}
> +
>   	if (group->container)
>   		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>   				VFIO_GROUP_FLAGS_VIABLE;
> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>   	filep->private_data = NULL;
>   
>   	mutex_lock(&group->group_lock);
> -	/*
> -	 * Device FDs hold a group file reference, therefore the group release
> -	 * is only called when there are no open devices.
> -	 */
> -	WARN_ON(group->notifier.head);
> -	if (group->container)
> -		vfio_group_detach_container(group);
>   	group->opened_file = NULL;
>   	mutex_unlock(&group->group_lock);
>   	swake_up_one(&group->opened_file_wait);
> @@ -1553,17 +1578,34 @@ static const struct file_operations vfio_device_fops = {
>    * @file: VFIO group file
>    *
>    * The returned iommu_group is valid as long as a ref is held on the file.
> + * This function is deprecated, only the SPAPR path in kvm should call it.
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file)
>   {
>   	struct vfio_group *group = file->private_data;
> +	struct iommu_group *iommu_group = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> +		return NULL;
>   
>   	if (file->f_op != &vfio_group_fops)
>   		return NULL;
> -	return group->iommu_group;
> +
> +	mutex_lock(&group->group_lock);
> +	if (group->iommu_group) {
> +		iommu_group = group->iommu_group;
> +		group->preserve_iommu_group = true;
> +	}
> +	mutex_unlock(&group->group_lock);
> +	return iommu_group;
>   }
>   EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>   
> +bool vfio_file_is_group(struct file *file)
> +{
> +	return (file->f_op == &vfio_group_fops;
> +}
> +
>   /**
>    * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>    *        is always CPU cache coherent
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 73bcb92179a224..bd9faaab85de18 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>    * External user API
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file);
> +bool vfio_file_is_group(struct file *file);
>   bool vfio_file_enforced_coherent(struct file *file);
>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>   bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index ce1b01d02c5197..6ecd3aca047375 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -77,6 +77,23 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>   	return ret;
>   }
>   
> +static bool kvm_vfio_file_is_group(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_group);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_group);
> +
> +	return ret;
> +}
> +
> +
>   #ifdef CONFIG_SPAPR_TCE_IOMMU
>   static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>   					     struct kvm_vfio_group *kvg)
> @@ -136,7 +153,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>   		return -EBADF;
>   
>   	/* Ensure the FD is a vfio group FD.*/
> -	if (!kvm_vfio_file_iommu_group(filp)) {
> +	if (!kvm_vfio_file_is_group(filp)) {
>   		ret = -EINVAL;
>   		goto err_fput;
>   	}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 17:15             ` Christian Borntraeger
@ 2022-10-04 17:22               ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 17:22 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390

On Tue, Oct 04, 2022 at 07:15:51PM +0200, Christian Borntraeger wrote:
> 
> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> > On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
> > 
> > > > Does some userspace have the group FD open when it stucks like this,
> > > > eg what does fuser say?
> > > 
> > > /proc/<virtnodedevd>/fd
> > > 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
> > > 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
> > > 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
> > > 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
> > > 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
> > > 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
> > > 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
> > > 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
> > > 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
> > > 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
> > > 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
> > > 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
> > > 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
> > > 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
> > > 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
> > > 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
> > 
> > Seems like a userspace bug to keep the group FD open after the /dev/
> > file has been deleted :|
> > 
> > What do you think about this?
> 
> On top of which tree is this?

It should apply on vfio-next

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 16:28           ` Jason Gunthorpe
  2022-10-04 17:15             ` Christian Borntraeger
@ 2022-10-04 17:36             ` Christian Borntraeger
  2022-10-04 17:48               ` Christian Borntraeger
  2022-10-04 18:22               ` Matthew Rosato
  1 sibling, 2 replies; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-04 17:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390



Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
> 
>>> Does some userspace have the group FD open when it stucks like this,
>>> eg what does fuser say?
>>
>> /proc/<virtnodedevd>/fd
>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
> 
> Seems like a userspace bug to keep the group FD open after the /dev/
> file has been deleted :|
> 
> What do you think about this?
> 
> commit a54a852b1484b1605917a8f4d80691db333b25ed
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>      vfio: Make the group FD disassociate from the iommu_group
>      
>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>      the pointer is NULL the vfio_group users promise not to touch the
>      iommu_group. This allows a driver to be hot unplugged while userspace is
>      keeping the group FD open.
>      
>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>      part of its iommu interface through KVM. Due to this we loose control over
>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>      layer via vfio_group_detach_container().
>      
>      Thus, for SPAPR the group FDs must still be closed before a device can be
>      hot unplugged.
>      
>      This fixes a userspace regression where we learned that virtnodedevd
>      leaves a group FD open even though the /dev/ node for it has been deleted
>      and all the drivers for it unplugged.
>      
>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Almost :-)

drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
  1606 |         return (file->f_op == &vfio_group_fops;
       |                ~                              ^
       |                                               )
drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
  1606 |         return (file->f_op == &vfio_group_fops;
       |                                                ^
       |                                                ;
  1607 | }
       | ~


With that fixed I get:

ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!

With that worked around (m -> y)


Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>

At least the vfio-ap part

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 17:36             ` Christian Borntraeger
@ 2022-10-04 17:48               ` Christian Borntraeger
  2022-10-04 18:22               ` Matthew Rosato
  1 sibling, 0 replies; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-04 17:48 UTC (permalink / raw)
  To: Matthew Rosato, Eric Farman
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Cornelia Huck, kvm, Qian Cai, Joerg Roedel, Marek Szyprowski,
	linux-s390, Jason Gunthorpe



Am 04.10.22 um 19:36 schrieb Christian Borntraeger:
> 
> 
> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
>> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
>>
>>>> Does some userspace have the group FD open when it stucks like this,
>>>> eg what does fuser say?
>>>
>>> /proc/<virtnodedevd>/fd
>>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
>>
>> Seems like a userspace bug to keep the group FD open after the /dev/
>> file has been deleted :|
>>
>> What do you think about this?
>>
>> commit a54a852b1484b1605917a8f4d80691db333b25ed
>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>
>>      vfio: Make the group FD disassociate from the iommu_group
>>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>      the pointer is NULL the vfio_group users promise not to touch the
>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>      keeping the group FD open.
>>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>      part of its iommu interface through KVM. Due to this we loose control over
>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>      layer via vfio_group_detach_container().
>>      Thus, for SPAPR the group FDs must still be closed before a device can be
>>      hot unplugged.
>>      This fixes a userspace regression where we learned that virtnodedevd
>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>      and all the drivers for it unplugged.
>>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Almost :-)
> 
> drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
> drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
>   1606 |         return (file->f_op == &vfio_group_fops;
>        |                ~                              ^
>        |                                               )
> drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
>   1606 |         return (file->f_op == &vfio_group_fops;
>        |                                                ^
>        |                                                ;
>   1607 | }
>        | ~
> 
> 
> With that fixed I get:
> 
> ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
> 
> With that worked around (m -> y)
> 
> 
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> 
> At least the vfio-ap part

Matthew, Eric, can you verify this on top of vfio-next with vfio-ccw and vfio pci?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 17:36             ` Christian Borntraeger
  2022-10-04 17:48               ` Christian Borntraeger
@ 2022-10-04 18:22               ` Matthew Rosato
  2022-10-04 18:56                 ` Eric Farman
  2022-10-05 13:46                 ` Matthew Rosato
  1 sibling, 2 replies; 19+ messages in thread
From: Matthew Rosato @ 2022-10-04 18:22 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390

On 10/4/22 1:36 PM, Christian Borntraeger wrote:
> 
> 
> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
>> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
>>
>>>> Does some userspace have the group FD open when it stucks like this,
>>>> eg what does fuser say?
>>>
>>> /proc/<virtnodedevd>/fd
>>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
>>
>> Seems like a userspace bug to keep the group FD open after the /dev/
>> file has been deleted :|
>>
>> What do you think about this?
>>
>> commit a54a852b1484b1605917a8f4d80691db333b25ed
>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>
>>      vfio: Make the group FD disassociate from the iommu_group
>>           Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>      the pointer is NULL the vfio_group users promise not to touch the
>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>      keeping the group FD open.
>>           SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>      part of its iommu interface through KVM. Due to this we loose control over
>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>      layer via vfio_group_detach_container().
>>           Thus, for SPAPR the group FDs must still be closed before a device can be
>>      hot unplugged.
>>           This fixes a userspace regression where we learned that virtnodedevd
>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>      and all the drivers for it unplugged.
>>           Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Almost :-)
> 
> drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
> drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
>  1606 |         return (file->f_op == &vfio_group_fops;
>       |                ~                              ^
>       |                                               )
> drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
>  1606 |         return (file->f_op == &vfio_group_fops;
>       |                                                ^
>       |                                                ;
>  1607 | }
>       | ~
> 
> 
> With that fixed I get:
> 
> ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
> 
> With that worked around (m -> y)


Looks like this can be solved with EXPORT_SYMBOL_GPL(vfio_file_is_group);

Also:

arch/s390/kvm/../../../virt/kvm/vfio.c:64:28: warning: ‘kvm_vfio_file_iommu_group’ defined but not used [-Wunused-function]
   64 | static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~

kvm_vfio_file_iommu_group looks like it is now SPAPR-only

> 
> 
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> 
> At least the vfio-ap part

Nope, with this s390 vfio-pci at least breaks:

[  132.943389] kernel BUG at lib/list_debug.c:53!
[  132.943406] monitor event: 0040 ilc:2 [#1] SMP 
[  132.943410] Modules linked in: vfio_pci kvm vfio_pci_core irqbypass vfio_virqfd vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defr
ag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ism smc ib_uverbs ib_core uvdevice s390_trng tape_3590 tape tape_class eadm_sch vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha mlx5_core aes_s390 des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sh
a256_s390 nvme_core sha1_s390 sha_common zfcp scsi_transport_fc pkey zcrypt rng_core autofs4 [last unloaded: vfio_pci]
[  132.943457] CPU: 12 PID: 4991 Comm: nose2 Tainted: G        W          6.0.0-rc4 #40
[  132.943460] Hardware name: IBM 3931 A01 782 (LPAR)
[  132.943462] Krnl PSW : 0704c00180000000 00000000cbc90568 (__list_del_entry_valid+0xd8/0xf0)
[  132.943469]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  132.943474] Krnl GPRS: 8000000000000001 0000000900000027 000000000000004e 00000000ccc1ffe0
[  132.943477]            00000000fffeffff 00000009fc290000 0000000000000000 0000000000000080
[  132.943480]            00000000acc86438 0000000000000000 00000000acc86420 00000000a1492800
[  132.943483]            00000000922a0000 000003ffb9dce260 00000000cbc90564 0000038004a6b9f8
[  132.943489] Krnl Code: 00000000cbc90558: c0200045eff3        larl    %r2,00000000cc54e53e
[  132.943489]            00000000cbc9055e: c0e50022c7d9        brasl   %r14,00000000cc0e9510
[  132.943489]           #00000000cbc90564: af000000            mc      0,0
[  132.943489]           >00000000cbc90568: b9040032            lgr     %r3,%r2
[  132.943489]            00000000cbc9056c: c0200045efd4        larl    %r2,00000000cc54e514
[  132.943489]            00000000cbc90572: c0e50022c7cf        brasl   %r14,00000000cc0e9510
[  132.943489]            00000000cbc90578: af000000            mc      0,0
[  132.943489]            00000000cbc9057c: 0707                bcr     0,%r7
[  132.943510] Call Trace:
[  132.943512]  [<00000000cbc90568>] __list_del_entry_valid+0xd8/0xf0 
[  132.943515] ([<00000000cbc90564>] __list_del_entry_valid+0xd4/0xf0)
[  132.943518]  [<000003ff8011a1b8>] vfio_group_detach_container+0x88/0x170 [vfio] 
[  132.943524]  [<000003ff801176c0>] vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
[  132.943529]  [<000003ff804f9e54>] vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
[  132.943535]  [<000003ff804ae1c4>] vfio_pci_remove+0x2c/0x40 [vfio_pci] 
[  132.943539]  [<00000000cbd58c3c>] pci_device_remove+0x3c/0x98 
[  132.943542]  [<00000000cbdbdbce>] device_release_driver_internal+0x1c6/0x288 
[  132.943545]  [<00000000cbd4e284>] pci_stop_bus_device+0x94/0xc0 
[  132.943549]  [<00000000cbd4e570>] pci_stop_and_remove_bus_device_locked+0x30/0x48 
[  132.943552]  [<00000000cb55d980>] zpci_bus_remove_device+0x68/0xa8 
[  132.943555]  [<00000000cb556e82>] zpci_deconfigure_device+0x3a/0xe0 
[  132.943558]  [<00000000cbd65d04>] power_write_file+0x7c/0x130 
[  132.943561]  [<00000000cb8fbc90>] kernfs_fop_write_iter+0x138/0x210 
[  132.943565]  [<00000000cb837344>] vfs_write+0x194/0x2e0 "
[  132.943568]  [<00000000cb8376fa>] ksys_write+0x6a/0xf8 
[  132.943571]  [<00000000cc0f918c>] __do_syscall+0x1d4/0x200 
[  132.943575]  [<00000000cc107e42>] system_call+0x82/0xb0 
[  132.943577] Last Breaking-Event-Address:
[  132.943579]  [<00000000cc0e955c>] _printk+0x4c/0x58
[  132.943585] Kernel panic - not syncing: Fatal exception: panic_on_oops

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 18:22               ` Matthew Rosato
@ 2022-10-04 18:56                 ` Eric Farman
  2022-10-05 13:46                 ` Matthew Rosato
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Farman @ 2022-10-04 18:56 UTC (permalink / raw)
  To: Matthew Rosato, Christian Borntraeger, Jason Gunthorpe
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Cornelia Huck, kvm, Qian Cai, Joerg Roedel, Marek Szyprowski,
	linux-s390

On Tue, 2022-10-04 at 14:22 -0400, Matthew Rosato wrote:
> On 10/4/22 1:36 PM, Christian Borntraeger wrote:
> > 
> > 
> > Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> > > On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger
> > > wrote:
> > > 
> > > > > Does some userspace have the group FD open when it stucks
> > > > > like this,
> > > > > eg what does fuser say?
> > > > 
> > > > /proc/<virtnodedevd>/fd
> > > > 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
> > > > 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
> > > > 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
> > > > 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 ->
> > > > 'socket:[51479]'
> > > > 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 ->
> > > > 'anon_inode:[eventfd]'
> > > > 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 ->
> > > > 'socket:[51485]'
> > > > 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 ->
> > > > 'socket:[51487]'
> > > > 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 ->
> > > > 'socket:[51486]'
> > > > 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 ->
> > > > 'anon_inode:[eventfd]'
> > > > 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 ->
> > > > 'socket:[60421]'
> > > > 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 ->
> > > > 'anon_inode:[eventfd]'
> > > > 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 ->
> > > > 'socket:[28008]'
> > > > 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 ->
> > > > /run/libvirt/nodedev/driver.pid
> > > > 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 ->
> > > > 'socket:[28818]'
> > > > 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 ->
> > > > 'socket:[51479]'
> > > > 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 ->
> > > > '/dev/vfio/3 (deleted)'
> > > 
> > > Seems like a userspace bug to keep the group FD open after the
> > > /dev/
> > > file has been deleted :|
> > > 
> > > What do you think about this?
> > > 
> > > commit a54a852b1484b1605917a8f4d80691db333b25ed
> > > Author: Jason Gunthorpe <jgg@ziepe.ca>
> > > Date:   Tue Oct 4 13:14:37 2022 -0300
> > > 
> > >      vfio: Make the group FD disassociate from the iommu_group
> > >           Allow the vfio_group struct to exist with a NULL
> > > iommu_group pointer. When
> > >      the pointer is NULL the vfio_group users promise not to
> > > touch the
> > >      iommu_group. This allows a driver to be hot unplugged while
> > > userspace is
> > >      keeping the group FD open.
> > >           SPAPR mode is excluded from this behavior because of
> > > how it wrongly hacks
> > >      part of its iommu interface through KVM. Due to this we
> > > loose control over
> > >      what it is doing and cannot revoke the iommu_group usage in
> > > the IOMMU
> > >      layer via vfio_group_detach_container().
> > >           Thus, for SPAPR the group FDs must still be closed
> > > before a device can be
> > >      hot unplugged.
> > >           This fixes a userspace regression where we learned that
> > > virtnodedevd
> > >      leaves a group FD open even though the /dev/ node for it has
> > > been deleted
> > >      and all the drivers for it unplugged.
> > >           Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime
> > > for struct iommu_group")
> > >      Reported-by: Christian Borntraeger
> > > <borntraeger@linux.ibm.com>
> > >      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > Almost :-)
> > 
> > drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
> > drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';'
> > token
> >  1606 |         return (file->f_op == &vfio_group_fops;
> >       |                ~                              ^
> >       |                                               )
> > drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}'
> > token
> >  1606 |         return (file->f_op == &vfio_group_fops;
> >       |                                                ^
> >       |                                                ;
> >  1607 | }
> >       | ~
> > 
> > 
> > With that fixed I get:
> > 
> > ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-
> > core.ko] undefined!
> > 
> > With that worked around (m -> y)
> 
> 
> Looks like this can be solved with
> EXPORT_SYMBOL_GPL(vfio_file_is_group);
> 
> Also:
> 
> arch/s390/kvm/../../../virt/kvm/vfio.c:64:28: warning:
> ‘kvm_vfio_file_iommu_group’ defined but not used [-Wunused-function]
>    64 | static struct iommu_group *kvm_vfio_file_iommu_group(struct
> file *file)
>       |                            ^~~~~~~~~~~~~~~~~~~~~~~~~
> 
> kvm_vfio_file_iommu_group looks like it is now SPAPR-only
> 
> > 
> > 
> > Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> > 
> > At least the vfio-ap part
> 

I can reproduce the problem with vfio-ccw, also blamed to this patch.
With the changes described above, things work as they did before.

Tested-by: Eric Farman <farman@linux.ibm.com> # vfio-ccw, vfio-ap

I can try a v2 when the below gets addressed.

> Nope, with this s390 vfio-pci at least breaks:
> 
> [  132.943389] kernel BUG at lib/list_debug.c:53!
> [  132.943406] monitor event: 0040 ilc:2 [#1] SMP 
> [  132.943410] Modules linked in: vfio_pci kvm vfio_pci_core
> irqbypass vfio_virqfd vhost_vsock vmw_vsock_virtio_transport_common
> vsock vhost vhost_iotlb nft_fib_inet nft_fib_ipv4 nft_fib_ipv6
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject
> nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defr
> ag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ism smc ib_uverbs
> ib_core uvdevice s390_trng tape_3590 tape tape_class eadm_sch
> vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs
> ghash_s390 prng chacha_s390 libchacha mlx5_core aes_s390 des_s390
> libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sh
> a256_s390 nvme_core sha1_s390 sha_common zfcp scsi_transport_fc pkey
> zcrypt rng_core autofs4 [last unloaded: vfio_pci]
> [  132.943457] CPU: 12 PID: 4991 Comm: nose2 Tainted: G       
> W          6.0.0-rc4 #40
> [  132.943460] Hardware name: IBM 3931 A01 782 (LPAR)
> [  132.943462] Krnl PSW : 0704c00180000000 00000000cbc90568
> (__list_del_entry_valid+0xd8/0xf0)
> [  132.943469]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3
> CC:0 PM:0 RI:0 EA:3
> [  132.943474] Krnl GPRS: 8000000000000001 0000000900000027
> 000000000000004e 00000000ccc1ffe0
> [  132.943477]            00000000fffeffff 00000009fc290000
> 0000000000000000 0000000000000080
> [  132.943480]            00000000acc86438 0000000000000000
> 00000000acc86420 00000000a1492800
> [  132.943483]            00000000922a0000 000003ffb9dce260
> 00000000cbc90564 0000038004a6b9f8
> [  132.943489] Krnl Code: 00000000cbc90558: c0200045eff3       
> larl    %r2,00000000cc54e53e
> [  132.943489]            00000000cbc9055e: c0e50022c7d9       
> brasl   %r14,00000000cc0e9510
> [  132.943489]           #00000000cbc90564: af000000           
> mc      0,0
> [  132.943489]           >00000000cbc90568: b9040032           
> lgr     %r3,%r2
> [  132.943489]            00000000cbc9056c: c0200045efd4       
> larl    %r2,00000000cc54e514
> [  132.943489]            00000000cbc90572: c0e50022c7cf       
> brasl   %r14,00000000cc0e9510
> [  132.943489]            00000000cbc90578: af000000           
> mc      0,0
> [  132.943489]            00000000cbc9057c: 0707               
> bcr     0,%r7
> [  132.943510] Call Trace:
> [  132.943512]  [<00000000cbc90568>] __list_del_entry_valid+0xd8/0xf0
> [  132.943515] ([<00000000cbc90564>]
> __list_del_entry_valid+0xd4/0xf0)
> [  132.943518]  [<000003ff8011a1b8>]
> vfio_group_detach_container+0x88/0x170 [vfio] 
> [  132.943524]  [<000003ff801176c0>]
> vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
> [  132.943529]  [<000003ff804f9e54>]
> vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
> [  132.943535]  [<000003ff804ae1c4>] vfio_pci_remove+0x2c/0x40
> [vfio_pci] 
> [  132.943539]  [<00000000cbd58c3c>] pci_device_remove+0x3c/0x98 
> [  132.943542]  [<00000000cbdbdbce>]
> device_release_driver_internal+0x1c6/0x288 
> [  132.943545]  [<00000000cbd4e284>] pci_stop_bus_device+0x94/0xc0 
> [  132.943549]  [<00000000cbd4e570>]
> pci_stop_and_remove_bus_device_locked+0x30/0x48 
> [  132.943552]  [<00000000cb55d980>] zpci_bus_remove_device+0x68/0xa8
> [  132.943555]  [<00000000cb556e82>]
> zpci_deconfigure_device+0x3a/0xe0 
> [  132.943558]  [<00000000cbd65d04>] power_write_file+0x7c/0x130 
> [  132.943561]  [<00000000cb8fbc90>]
> kernfs_fop_write_iter+0x138/0x210 
> [  132.943565]  [<00000000cb837344>] vfs_write+0x194/0x2e0 "
> [  132.943568]  [<00000000cb8376fa>] ksys_write+0x6a/0xf8 
> [  132.943571]  [<00000000cc0f918c>] __do_syscall+0x1d4/0x200 
> [  132.943575]  [<00000000cc107e42>] system_call+0x82/0xb0 
> [  132.943577] Last Breaking-Event-Address:
> [  132.943579]  [<00000000cc0e955c>] _printk+0x4c/0x58
> [  132.943585] Kernel panic - not syncing: Fatal exception:
> panic_on_oops


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 18:22               ` Matthew Rosato
  2022-10-04 18:56                 ` Eric Farman
@ 2022-10-05 13:46                 ` Matthew Rosato
  2022-10-05 13:57                   ` Jason Gunthorpe
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2022-10-05 13:46 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390

On 10/4/22 2:22 PM, Matthew Rosato wrote:
> On 10/4/22 1:36 PM, Christian Borntraeger wrote:
>>
>>
>> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
>>> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
>>>
>>>>> Does some userspace have the group FD open when it stucks like this,
>>>>> eg what does fuser say?
>>>>
>>>> /proc/<virtnodedevd>/fd
>>>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>>>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>>>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>>>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>>>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>>>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>>>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>>>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>>>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>>>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>>>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>>>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>>>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>>>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>>>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>>>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
>>>
>>> Seems like a userspace bug to keep the group FD open after the /dev/
>>> file has been deleted :|
>>>
>>> What do you think about this?
>>>
>>> commit a54a852b1484b1605917a8f4d80691db333b25ed
>>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>>
>>>      vfio: Make the group FD disassociate from the iommu_group
>>>           Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>>      the pointer is NULL the vfio_group users promise not to touch the
>>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>>      keeping the group FD open.
>>>           SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>>      part of its iommu interface through KVM. Due to this we loose control over
>>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>>      layer via vfio_group_detach_container().
>>>           Thus, for SPAPR the group FDs must still be closed before a device can be
>>>      hot unplugged.
>>>           This fixes a userspace regression where we learned that virtnodedevd
>>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>>      and all the drivers for it unplugged.
>>>           Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>
>> Almost :-)
>>
>> drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
>> drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
>>  1606 |         return (file->f_op == &vfio_group_fops;
>>       |                ~                              ^
>>       |                                               )
>> drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
>>  1606 |         return (file->f_op == &vfio_group_fops;
>>       |                                                ^
>>       |                                                ;
>>  1607 | }
>>       | ~
>>
>>
>> With that fixed I get:
>>
>> ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
>>
>> With that worked around (m -> y)
> 
> 
> Looks like this can be solved with EXPORT_SYMBOL_GPL(vfio_file_is_group);
> 
> Also:
> 
> arch/s390/kvm/../../../virt/kvm/vfio.c:64:28: warning: ‘kvm_vfio_file_iommu_group’ defined but not used [-Wunused-function]
>    64 | static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>       |                            ^~~~~~~~~~~~~~~~~~~~~~~~~
> 
> kvm_vfio_file_iommu_group looks like it is now SPAPR-only
> 
>>
>>
>> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>
>> At least the vfio-ap part
> 
> Nope, with this s390 vfio-pci at least breaks:
> 
> [  132.943389] kernel BUG at lib/list_debug.c:53!
> [  132.943406] monitor event: 0040 ilc:2 [#1] SMP 
> [  132.943410] Modules linked in: vfio_pci kvm vfio_pci_core irqbypass vfio_virqfd vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defr
> ag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ism smc ib_uverbs ib_core uvdevice s390_trng tape_3590 tape tape_class eadm_sch vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha mlx5_core aes_s390 des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sh
> a256_s390 nvme_core sha1_s390 sha_common zfcp scsi_transport_fc pkey zcrypt rng_core autofs4 [last unloaded: vfio_pci]
> [  132.943457] CPU: 12 PID: 4991 Comm: nose2 Tainted: G        W          6.0.0-rc4 #40
> [  132.943460] Hardware name: IBM 3931 A01 782 (LPAR)
> [  132.943462] Krnl PSW : 0704c00180000000 00000000cbc90568 (__list_del_entry_valid+0xd8/0xf0)
> [  132.943469]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  132.943474] Krnl GPRS: 8000000000000001 0000000900000027 000000000000004e 00000000ccc1ffe0
> [  132.943477]            00000000fffeffff 00000009fc290000 0000000000000000 0000000000000080
> [  132.943480]            00000000acc86438 0000000000000000 00000000acc86420 00000000a1492800
> [  132.943483]            00000000922a0000 000003ffb9dce260 00000000cbc90564 0000038004a6b9f8
> [  132.943489] Krnl Code: 00000000cbc90558: c0200045eff3        larl    %r2,00000000cc54e53e
> [  132.943489]            00000000cbc9055e: c0e50022c7d9        brasl   %r14,00000000cc0e9510
> [  132.943489]           #00000000cbc90564: af000000            mc      0,0
> [  132.943489]           >00000000cbc90568: b9040032            lgr     %r3,%r2
> [  132.943489]            00000000cbc9056c: c0200045efd4        larl    %r2,00000000cc54e514
> [  132.943489]            00000000cbc90572: c0e50022c7cf        brasl   %r14,00000000cc0e9510
> [  132.943489]            00000000cbc90578: af000000            mc      0,0
> [  132.943489]            00000000cbc9057c: 0707                bcr     0,%r7
> [  132.943510] Call Trace:
> [  132.943512]  [<00000000cbc90568>] __list_del_entry_valid+0xd8/0xf0 
> [  132.943515] ([<00000000cbc90564>] __list_del_entry_valid+0xd4/0xf0)
> [  132.943518]  [<000003ff8011a1b8>] vfio_group_detach_container+0x88/0x170 [vfio] 
> [  132.943524]  [<000003ff801176c0>] vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
> [  132.943529]  [<000003ff804f9e54>] vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
> [  132.943535]  [<000003ff804ae1c4>] vfio_pci_remove+0x2c/0x40 [vfio_pci] 
> [  132.943539]  [<00000000cbd58c3c>] pci_device_remove+0x3c/0x98 
> [  132.943542]  [<00000000cbdbdbce>] device_release_driver_internal+0x1c6/0x288 
> [  132.943545]  [<00000000cbd4e284>] pci_stop_bus_device+0x94/0xc0 
> [  132.943549]  [<00000000cbd4e570>] pci_stop_and_remove_bus_device_locked+0x30/0x48 
> [  132.943552]  [<00000000cb55d980>] zpci_bus_remove_device+0x68/0xa8 
> [  132.943555]  [<00000000cb556e82>] zpci_deconfigure_device+0x3a/0xe0 
> [  132.943558]  [<00000000cbd65d04>] power_write_file+0x7c/0x130 
> [  132.943561]  [<00000000cb8fbc90>] kernfs_fop_write_iter+0x138/0x210 
> [  132.943565]  [<00000000cb837344>] vfs_write+0x194/0x2e0 "
> [  132.943568]  [<00000000cb8376fa>] ksys_write+0x6a/0xf8 
> [  132.943571]  [<00000000cc0f918c>] __do_syscall+0x1d4/0x200 
> [  132.943575]  [<00000000cc107e42>] system_call+0x82/0xb0 
> [  132.943577] Last Breaking-Event-Address:
> [  132.943579]  [<00000000cc0e955c>] _printk+0x4c/0x58
> [  132.943585] Kernel panic - not syncing: Fatal exception: panic_on_oops

(again, with the follow-up applied) Besides the panic above I just noticed there is also this warning that immediately precedes and is perhaps more useful.  Re: what triggers the WARN, both group->owner and group->owner_cnt are already 0:

[  375.262923] WARNING: CPU: 8 PID: 5182 at drivers/iommu/iommu.c:3211 iommu_group_release_dma_owner+0x38/0x90
[  375.262932] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf
_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs ism smc ib_core uvdevice s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 mlx5_core libdes sha3_512_s390 sha3_256_s390
 nvme sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
[  375.262969] CPU: 8 PID: 5182 Comm: nose2 Not tainted 6.0.0-rc4 #50
[  375.262971] Hardware name: IBM 3931 A01 782 (LPAR)
[  375.262972] Krnl PSW : 0704c00180000000 00000001d1cd8b34 (iommu_group_release_dma_owner+0x3c/0x90)
[  375.262976]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  375.262979] Krnl GPRS: 8000000000000001 0000000000000000 00000000844dd058 00000000ba60c200
[  375.262981]            00000000fffeffff 00000009fc290000 0000000000000000 0000000000000080
[  375.262984]            00000000b1a6d840 00000000b1a6d858 00000000844dd058 00000000844dd000
[  375.262986]            00000000ba60c200 000003ff962ce260 00000001d1cd8b26 0000038003d0f9d8
[  375.262994] Krnl Code: 00000001d1cd8b26: e310b0c00012        lt      %r1,192(%r11)
[  375.262994]            00000001d1cd8b2c: a774000c            brc     7,00000001d1cd8b44
[  375.262994]           #00000001d1cd8b30: af000000            mc      0,0
[  375.262994]           >00000001d1cd8b34: b904002a            lgr     %r2,%r10
[  375.262994]            00000001d1cd8b38: ebaff0a00004        lmg     %r10,%r15,160(%r15)
[  375.262994]            00000001d1cd8b3e: c0f4001aa84d        brcl    15,00000001d202dbd8
[  375.262994]            00000001d1cd8b44: e310b0c80002        ltg     %r1,200(%r11)
[  375.262994]            00000001d1cd8b4a: a784fff3            brc     8,00000001d1cd8b30
[  375.263048] Call Trace:
[  375.263051]  [<00000001d1cd8b34>] iommu_group_release_dma_owner+0x3c/0x90 
[  375.263058]  [<000003ff801431c8>] vfio_group_detach_container+0x98/0x1a0 [vfio] 
[  375.263067]  [<000003ff801406c0>] vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
[  375.263071]  [<000003ff80540e54>] vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
[  375.263079]  [<000003ff804f31c4>] vfio_pci_remove+0x2c/0x40 [vfio_pci] 
[  375.263082]  [<00000001d1c84c3c>] pci_device_remove+0x3c/0x98 
[  375.263085]  [<00000001d1ce9bce>] device_release_driver_internal+0x1c6/0x288 
[  375.263090]  [<00000001d1c7a284>] pci_stop_bus_device+0x94/0xc0 
[  375.263093]  [<00000001d1c7a570>] pci_stop_and_remove_bus_device_locked+0x30/0x48 
[  375.263096]  [<00000001d1489980>] zpci_bus_remove_device+0x68/0xa8 
[  375.263100]  [<00000001d1482e82>] zpci_deconfigure_device+0x3a/0xe0 
[  375.263104]  [<00000001d1c91d04>] power_write_file+0x7c/0x130 
[  375.263108]  [<00000001d1827c90>] kernfs_fop_write_iter+0x138/0x210 
[  375.263114]  [<00000001d1763344>] vfs_write+0x194/0x2e0 
[  375.263119]  [<00000001d17636fa>] ksys_write+0x6a/0xf8 
[  375.263121]  [<00000001d202518c>] __do_syscall+0x1d4/0x200 
[  375.263127]  [<00000001d2033e42>] system_call+0x82/0xb0 
[  375.263132] Last Breaking-Event-Address:
[  375.263132]  [<00000001d202f394>] mutex_lock+0x1c/0x28

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:46                 ` Matthew Rosato
@ 2022-10-05 13:57                   ` Jason Gunthorpe
  2022-10-05 14:00                     ` Christian Borntraeger
                                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2022-10-05 13:57 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:

 
> (again, with the follow-up applied) Besides the panic above I just
> noticed there is also this warning that immediately precedes and is
> perhaps more useful.  Re: what triggers the WARN, both group->owner
> and group->owner_cnt are already 0

And this is after the 2nd try that fixes the locking?

This shows that vfio_group_detach_container() is called twice (which
was my guess), hoever this looks to be impossible as both calls are
protected by 'if (group->container)' and the function NULL's
group->container and it is all under the proper lock.

My guess was that missing locking caused the two cases to race and
trigger WARN, but the locking should fix that.

So I'm at a loss, can you investigate a bit?

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:57                   ` Jason Gunthorpe
@ 2022-10-05 14:00                     ` Christian Borntraeger
  2022-10-05 14:01                     ` Jason Gunthorpe
  2022-10-05 14:01                     ` Matthew Rosato
  2 siblings, 0 replies; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-05 14:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390



Am 05.10.22 um 15:57 schrieb Jason Gunthorpe:
> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
> 
>   
>> (again, with the follow-up applied) Besides the panic above I just
>> noticed there is also this warning that immediately precedes and is
>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>> and group->owner_cnt are already 0
> 
> And this is after the 2nd try that fixes the locking?
> 
> This shows that vfio_group_detach_container() is called twice (which
> was my guess), hoever this looks to be impossible as both calls are
> protected by 'if (group->container)' and the function NULL's
> group->container and it is all under the proper lock.
> 
> My guess was that missing locking caused the two cases to race and
> trigger WARN, but the locking should fix that.
> 
> So I'm at a loss, can you investigate a bit?

So where is your 2nd version (and what was the first). I only saw one fix.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:57                   ` Jason Gunthorpe
  2022-10-05 14:00                     ` Christian Borntraeger
@ 2022-10-05 14:01                     ` Jason Gunthorpe
  2022-10-05 14:19                       ` Christian Borntraeger
  2022-10-05 14:21                       ` Matthew Rosato
  2022-10-05 14:01                     ` Matthew Rosato
  2 siblings, 2 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2022-10-05 14:01 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On Wed, Oct 05, 2022 at 10:57:28AM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
> 
>  
> > (again, with the follow-up applied) Besides the panic above I just
> > noticed there is also this warning that immediately precedes and is
> > perhaps more useful.  Re: what triggers the WARN, both group->owner
> > and group->owner_cnt are already 0
> 
> And this is after the 2nd try that fixes the locking?
> 
> This shows that vfio_group_detach_container() is called twice (which
> was my guess), hoever this looks to be impossible as both calls are
> protected by 'if (group->container)' and the function NULL's
> group->container and it is all under the proper lock.
> 
> My guess was that missing locking caused the two cases to race and
> trigger WARN, but the locking should fix that.
> 
> So I'm at a loss, can you investigate a bit?

Huh, perhaps I'm loosing my mind, but I'm sure I sent this out, but it
is not in the archive. This v2 fixes the missing locking and the rest
of the remarks.

commit f8b993620af72fa5f15bd4c1515868013c1c173d
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Oct 4 13:14:37 2022 -0300

    vfio: Make the group FD disassociate from the iommu_group
    
    Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
    the pointer is NULL the vfio_group users promise not to touch the
    iommu_group. This allows a driver to be hot unplugged while userspace is
    keeping the group FD open.
    
    SPAPR mode is excluded from this behavior because of how it wrongly hacks
    part of its iommu interface through KVM. Due to this we loose control over
    what it is doing and cannot revoke the iommu_group usage in the IOMMU
    layer via vfio_group_detach_container().
    
    Thus, for SPAPR the group FDs must still be closed before a device can be
    hot unplugged.
    
    This fixes a userspace regression where we learned that virtnodedevd
    leaves a group FD open even though the /dev/ node for it has been deleted
    and all the drivers for it unplugged.
    
    Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
    Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 59a28251bb0b97..badc9d828cac20 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		}
 
 		/* Ensure the FD is a vfio group FD.*/
-		if (!vfio_file_iommu_group(file)) {
+		if (!vfio_file_is_group(file)) {
 			fput(file);
 			ret = -EINVAL;
 			break;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 4d2de02f2ced6e..4e10a281420e66 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -59,6 +59,7 @@ struct vfio_group {
 	struct mutex			group_lock;
 	struct kvm			*kvm;
 	struct file			*opened_file;
+	bool				preserve_iommu_group;
 	struct swait_queue_head		opened_file_wait;
 	struct blocking_notifier_head	notifier;
 };
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 9b1e5fd5f7b73c..13d22bd84afc47 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
+	/*
+	 * group->iommu_group from the vfio.group_list cannot be NULL
+	 * under the vfio.group_lock.
+	 */
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
 			refcount_inc(&group->drivers);
@@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
 
 	mutex_destroy(&group->device_lock);
 	mutex_destroy(&group->group_lock);
-	iommu_group_put(group->iommu_group);
+	WARN_ON(group->iommu_group);
 	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
 	kfree(group);
 }
@@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 static void vfio_device_remove_group(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group;
 
 	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
 		iommu_group_remove_device(device->dev);
@@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	 */
 	cdev_device_del(&group->cdev, &group->dev);
 
+	mutex_lock(&group->group_lock);
+	/*
+	 * These data structures all have paired operations that can only be
+	 * undone when the caller holds a live reference on the device. Since
+	 * all pairs must be undone these WARN_ON's indicate some caller did not
+	 * properly hold the group reference.l.
+	 */
+	WARN_ON(!list_empty(&group->device_list));
+	WARN_ON(group->notifier.head);
+
+	/*
+	 * Revoke all users of group->iommu_group. At this point we know there
+	 * are no devices active because we are unplugging the last one. Setting
+	 * iommu_group to NULL blocks all new users.
+	 */
+	if (group->container)
+		vfio_group_detach_container(group);
+	iommu_group = group->iommu_group;
+	group->iommu_group = NULL;
+	mutex_unlock(&group->group_lock);
+
 	/*
-	 * Before we allow the last driver in the group to be unplugged the
-	 * group must be sanitized so nothing else is or can reference it. This
-	 * is because the group->iommu_group pointer should only be used so long
-	 * as a device driver is attached to a device in the group.
+	 * Normally we can set the iommu_group to NULL above and that will
+	 * prevent any users from touching it. However, the SPAPR kvm path takes
+	 * a reference to the iommu_group and keeps using it in arch code out
+	 * side our control. So if this path is triggred we have no choice but
+	 * to wait for the group FD to be closed to be sure everyone has stopped
+	 * touching the group.
 	 */
-	while (group->opened_file) {
+	while (group->preserve_iommu_group && group->opened_file) {
 		mutex_unlock(&vfio.group_lock);
 		swait_event_idle_exclusive(group->opened_file_wait,
 					   !group->opened_file);
@@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	}
 	mutex_unlock(&vfio.group_lock);
 
-	/*
-	 * These data structures all have paired operations that can only be
-	 * undone when the caller holds a live reference on the group. Since all
-	 * pairs must be undone these WARN_ON's indicate some caller did not
-	 * properly hold the group reference.
-	 */
-	WARN_ON(!list_empty(&group->device_list));
-	WARN_ON(group->container || group->container_users);
-	WARN_ON(group->notifier.head);
-	group->iommu_group = NULL;
-
+	iommu_group_put(iommu_group);
 	put_device(&group->dev);
 }
 
@@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
 
 	existing_device = vfio_group_get_device(group, device->dev);
 	if (existing_device) {
+		/*
+		 * group->iommu_group is non-NULL because we hold the drivers
+		 * refcount.
+		 */
 		dev_WARN(device->dev, "Device already exists on group %d\n",
 			 iommu_group_id(group->iommu_group));
 		vfio_device_put_registration(existing_device);
@@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
 		ret = -EINVAL;
 		goto out_unlock;
 	}
+	if (!group->iommu_group) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
 	container = vfio_container_from_file(f.file);
 	ret = -EINVAL;
 	if (container) {
@@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
 	status.flags = 0;
 
 	mutex_lock(&group->group_lock);
+	if (!group->iommu_group) {
+		mutex_unlock(&group->group_lock);
+		return -ENODEV;
+	}
+
 	if (group->container)
 		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
 				VFIO_GROUP_FLAGS_VIABLE;
@@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 	filep->private_data = NULL;
 
 	mutex_lock(&group->group_lock);
-	/*
-	 * Device FDs hold a group file reference, therefore the group release
-	 * is only called when there are no open devices.
-	 */
-	WARN_ON(group->notifier.head);
-	if (group->container)
-		vfio_group_detach_container(group);
 	group->opened_file = NULL;
 	mutex_unlock(&group->group_lock);
 	swake_up_one(&group->opened_file_wait);
@@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
  * @file: VFIO group file
  *
  * The returned iommu_group is valid as long as a ref is held on the file.
+ * This function is deprecated, only the SPAPR path in kvm should call it.
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file)
 {
 	struct vfio_group *group = file->private_data;
+	struct iommu_group *iommu_group = NULL;
+
+	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
+		return NULL;
 
 	if (file->f_op != &vfio_group_fops)
 		return NULL;
-	return group->iommu_group;
+
+	mutex_lock(&vfio.group_lock);
+	mutex_lock(&group->group_lock);
+	if (group->iommu_group) {
+		iommu_group = group->iommu_group;
+		group->preserve_iommu_group = true;
+	}
+	mutex_unlock(&group->group_lock);
+	mutex_unlock(&vfio.group_lock);
+	return iommu_group;
 }
 EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
 
+/**
+ * vfio_file_is_group - True if the file is usable with VFIO aPIS
+ * @file: VFIO group file
+ */
+bool vfio_file_is_group(struct file *file)
+{
+	return file->f_op == &vfio_group_fops;
+}
+EXPORT_SYMBOL_GPL(vfio_file_is_group);
+
 /**
  * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
  *        is always CPU cache coherent
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 73bcb92179a224..bd9faaab85de18 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
  * External user API
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file);
+bool vfio_file_is_group(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index ce1b01d02c5197..54aec3b0559c70 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
 	return ret;
 }
 
+static bool kvm_vfio_file_is_group(struct file *file)
+{
+	bool (*fn)(struct file *file);
+	bool ret;
+
+	fn = symbol_get(vfio_file_is_group);
+	if (!fn)
+		return false;
+
+	ret = fn(file);
+
+	symbol_put(vfio_file_is_group);
+
+	return ret;
+}
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
 static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 {
 	struct iommu_group *(*fn)(struct file *file);
@@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 	return ret;
 }
 
-#ifdef CONFIG_SPAPR_TCE_IOMMU
 static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
 					     struct kvm_vfio_group *kvg)
 {
@@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 		return -EBADF;
 
 	/* Ensure the FD is a vfio group FD.*/
-	if (!kvm_vfio_file_iommu_group(filp)) {
+	if (!kvm_vfio_file_is_group(filp)) {
 		ret = -EINVAL;
 		goto err_fput;
 	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:57                   ` Jason Gunthorpe
  2022-10-05 14:00                     ` Christian Borntraeger
  2022-10-05 14:01                     ` Jason Gunthorpe
@ 2022-10-05 14:01                     ` Matthew Rosato
  2 siblings, 0 replies; 19+ messages in thread
From: Matthew Rosato @ 2022-10-05 14:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On 10/5/22 9:57 AM, Jason Gunthorpe wrote:
> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
> 
>  
>> (again, with the follow-up applied) Besides the panic above I just
>> noticed there is also this warning that immediately precedes and is
>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>> and group->owner_cnt are already 0
> 
> And this is after the 2nd try that fixes the locking?

Maybe I missed that email?  I've only seen one version, happy to try another.  Please resend?

> 
> This shows that vfio_group_detach_container() is called twice (which
> was my guess), hoever this looks to be impossible as both calls are
> protected by 'if (group->container)' and the function NULL's
> group->container and it is all under the proper lock.
> 
> My guess was that missing locking caused the two cases to race and
> trigger WARN, but the locking should fix that.
> 
> So I'm at a loss, can you investigate a bit?
> 
> Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:01                     ` Jason Gunthorpe
@ 2022-10-05 14:19                       ` Christian Borntraeger
  2022-10-06 11:55                         ` Christian Borntraeger
  2022-10-05 14:21                       ` Matthew Rosato
  1 sibling, 1 reply; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-05 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390



Am 05.10.22 um 16:01 schrieb Jason Gunthorpe:
[..]
> commit f8b993620af72fa5f15bd4c1515868013c1c173d
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>      vfio: Make the group FD disassociate from the iommu_group
>      
>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>      the pointer is NULL the vfio_group users promise not to touch the
>      iommu_group. This allows a driver to be hot unplugged while userspace is
>      keeping the group FD open.
>      
>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>      part of its iommu interface through KVM. Due to this we loose control over
>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>      layer via vfio_group_detach_container().
>      
>      Thus, for SPAPR the group FDs must still be closed before a device can be
>      hot unplugged.
>      
>      This fixes a userspace regression where we learned that virtnodedevd
>      leaves a group FD open even though the /dev/ node for it has been deleted
>      and all the drivers for it unplugged.
>      
>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Looks better now (I also did a quick check with vfio-pci on s390)

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 59a28251bb0b97..badc9d828cac20 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>   		}
>   
>   		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_iommu_group(file)) {
> +		if (!vfio_file_is_group(file)) {
>   			fput(file);
>   			ret = -EINVAL;
>   			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 4d2de02f2ced6e..4e10a281420e66 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -59,6 +59,7 @@ struct vfio_group {
>   	struct mutex			group_lock;
>   	struct kvm			*kvm;
>   	struct file			*opened_file;
> +	bool				preserve_iommu_group;
>   	struct swait_queue_head		opened_file_wait;
>   	struct blocking_notifier_head	notifier;
>   };
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9b1e5fd5f7b73c..13d22bd84afc47 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>   {
>   	struct vfio_group *group;
>   
> +	/*
> +	 * group->iommu_group from the vfio.group_list cannot be NULL
> +	 * under the vfio.group_lock.
> +	 */
>   	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>   		if (group->iommu_group == iommu_group) {
>   			refcount_inc(&group->drivers);
> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>   
>   	mutex_destroy(&group->device_lock);
>   	mutex_destroy(&group->group_lock);
> -	iommu_group_put(group->iommu_group);
> +	WARN_ON(group->iommu_group);
>   	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>   	kfree(group);
>   }
> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>   static void vfio_device_remove_group(struct vfio_device *device)
>   {
>   	struct vfio_group *group = device->group;
> +	struct iommu_group *iommu_group;
>   
>   	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>   		iommu_group_remove_device(device->dev);
> @@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	 */
>   	cdev_device_del(&group->cdev, &group->dev);
>   
> +	mutex_lock(&group->group_lock);
> +	/*
> +	 * These data structures all have paired operations that can only be
> +	 * undone when the caller holds a live reference on the device. Since
> +	 * all pairs must be undone these WARN_ON's indicate some caller did not
> +	 * properly hold the group reference.l.
> +	 */
> +	WARN_ON(!list_empty(&group->device_list));
> +	WARN_ON(group->notifier.head);
> +
> +	/*
> +	 * Revoke all users of group->iommu_group. At this point we know there
> +	 * are no devices active because we are unplugging the last one. Setting
> +	 * iommu_group to NULL blocks all new users.
> +	 */
> +	if (group->container)
> +		vfio_group_detach_container(group);
> +	iommu_group = group->iommu_group;
> +	group->iommu_group = NULL;
> +	mutex_unlock(&group->group_lock);
> +
>   	/*
> -	 * Before we allow the last driver in the group to be unplugged the
> -	 * group must be sanitized so nothing else is or can reference it. This
> -	 * is because the group->iommu_group pointer should only be used so long
> -	 * as a device driver is attached to a device in the group.
> +	 * Normally we can set the iommu_group to NULL above and that will
> +	 * prevent any users from touching it. However, the SPAPR kvm path takes
> +	 * a reference to the iommu_group and keeps using it in arch code out
> +	 * side our control. So if this path is triggred we have no choice but
> +	 * to wait for the group FD to be closed to be sure everyone has stopped
> +	 * touching the group.
>   	 */
> -	while (group->opened_file) {
> +	while (group->preserve_iommu_group && group->opened_file) {
>   		mutex_unlock(&vfio.group_lock);
>   		swait_event_idle_exclusive(group->opened_file_wait,
>   					   !group->opened_file);
> @@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	}
>   	mutex_unlock(&vfio.group_lock);
>   
> -	/*
> -	 * These data structures all have paired operations that can only be
> -	 * undone when the caller holds a live reference on the group. Since all
> -	 * pairs must be undone these WARN_ON's indicate some caller did not
> -	 * properly hold the group reference.
> -	 */
> -	WARN_ON(!list_empty(&group->device_list));
> -	WARN_ON(group->container || group->container_users);
> -	WARN_ON(group->notifier.head);
> -	group->iommu_group = NULL;
> -
> +	iommu_group_put(iommu_group);
>   	put_device(&group->dev);
>   }
>   
> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>   
>   	existing_device = vfio_group_get_device(group, device->dev);
>   	if (existing_device) {
> +		/*
> +		 * group->iommu_group is non-NULL because we hold the drivers
> +		 * refcount.
> +		 */
>   		dev_WARN(device->dev, "Device already exists on group %d\n",
>   			 iommu_group_id(group->iommu_group));
>   		vfio_device_put_registration(existing_device);
> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>   		ret = -EINVAL;
>   		goto out_unlock;
>   	}
> +	if (!group->iommu_group) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
>   	container = vfio_container_from_file(f.file);
>   	ret = -EINVAL;
>   	if (container) {
> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>   	status.flags = 0;
>   
>   	mutex_lock(&group->group_lock);
> +	if (!group->iommu_group) {
> +		mutex_unlock(&group->group_lock);
> +		return -ENODEV;
> +	}
> +
>   	if (group->container)
>   		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>   				VFIO_GROUP_FLAGS_VIABLE;
> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>   	filep->private_data = NULL;
>   
>   	mutex_lock(&group->group_lock);
> -	/*
> -	 * Device FDs hold a group file reference, therefore the group release
> -	 * is only called when there are no open devices.
> -	 */
> -	WARN_ON(group->notifier.head);
> -	if (group->container)
> -		vfio_group_detach_container(group);
>   	group->opened_file = NULL;
>   	mutex_unlock(&group->group_lock);
>   	swake_up_one(&group->opened_file_wait);
> @@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
>    * @file: VFIO group file
>    *
>    * The returned iommu_group is valid as long as a ref is held on the file.
> + * This function is deprecated, only the SPAPR path in kvm should call it.
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file)
>   {
>   	struct vfio_group *group = file->private_data;
> +	struct iommu_group *iommu_group = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> +		return NULL;
>   
>   	if (file->f_op != &vfio_group_fops)
>   		return NULL;
> -	return group->iommu_group;
> +
> +	mutex_lock(&vfio.group_lock);
> +	mutex_lock(&group->group_lock);
> +	if (group->iommu_group) {
> +		iommu_group = group->iommu_group;
> +		group->preserve_iommu_group = true;
> +	}
> +	mutex_unlock(&group->group_lock);
> +	mutex_unlock(&vfio.group_lock);
> +	return iommu_group;
>   }
>   EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>   
> +/**
> + * vfio_file_is_group - True if the file is usable with VFIO aPIS
> + * @file: VFIO group file
> + */
> +bool vfio_file_is_group(struct file *file)
> +{
> +	return file->f_op == &vfio_group_fops;
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_is_group);
> +
>   /**
>    * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>    *        is always CPU cache coherent
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 73bcb92179a224..bd9faaab85de18 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>    * External user API
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file);
> +bool vfio_file_is_group(struct file *file);
>   bool vfio_file_enforced_coherent(struct file *file);
>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>   bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index ce1b01d02c5197..54aec3b0559c70 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
>   	return ret;
>   }
>   
> +static bool kvm_vfio_file_is_group(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_group);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_group);
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>   static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>   {
>   	struct iommu_group *(*fn)(struct file *file);
> @@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>   	return ret;
>   }
>   
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>   static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>   					     struct kvm_vfio_group *kvg)
>   {
> @@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>   		return -EBADF;
>   
>   	/* Ensure the FD is a vfio group FD.*/
> -	if (!kvm_vfio_file_iommu_group(filp)) {
> +	if (!kvm_vfio_file_is_group(filp)) {
>   		ret = -EINVAL;
>   		goto err_fput;
>   	}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:01                     ` Jason Gunthorpe
  2022-10-05 14:19                       ` Christian Borntraeger
@ 2022-10-05 14:21                       ` Matthew Rosato
  2022-10-05 15:40                         ` Matthew Rosato
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2022-10-05 14:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On 10/5/22 10:01 AM, Jason Gunthorpe wrote:
> On Wed, Oct 05, 2022 at 10:57:28AM -0300, Jason Gunthorpe wrote:
>> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
>>
>>  
>>> (again, with the follow-up applied) Besides the panic above I just
>>> noticed there is also this warning that immediately precedes and is
>>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>>> and group->owner_cnt are already 0
>>
>> And this is after the 2nd try that fixes the locking?
>>
>> This shows that vfio_group_detach_container() is called twice (which
>> was my guess), hoever this looks to be impossible as both calls are
>> protected by 'if (group->container)' and the function NULL's
>> group->container and it is all under the proper lock.
>>
>> My guess was that missing locking caused the two cases to race and
>> trigger WARN, but the locking should fix that.
>>
>> So I'm at a loss, can you investigate a bit?
> 
> Huh, perhaps I'm loosing my mind, but I'm sure I sent this out, but it
> is not in the archive. This v2 fixes the missing locking and the rest
> of the remarks.

Ah, here we go.  OK, initial testing with vfio-pci on this version and I note that

1) the warning/crash is gone
2) the iommu group ID no longer increments

I next will take it through the longer series of tests that would crash before 'vfio: Follow a strict lifetime for struct iommu_group' but this looks good so far.

> 
> commit f8b993620af72fa5f15bd4c1515868013c1c173d
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>     vfio: Make the group FD disassociate from the iommu_group
>     
>     Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>     the pointer is NULL the vfio_group users promise not to touch the
>     iommu_group. This allows a driver to be hot unplugged while userspace is
>     keeping the group FD open.
>     
>     SPAPR mode is excluded from this behavior because of how it wrongly hacks
>     part of its iommu interface through KVM. Due to this we loose control over
>     what it is doing and cannot revoke the iommu_group usage in the IOMMU
>     layer via vfio_group_detach_container().
>     
>     Thus, for SPAPR the group FDs must still be closed before a device can be
>     hot unplugged.
>     
>     This fixes a userspace regression where we learned that virtnodedevd
>     leaves a group FD open even though the /dev/ node for it has been deleted
>     and all the drivers for it unplugged.
>     
>     Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>     Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>     Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 59a28251bb0b97..badc9d828cac20 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  		}
>  
>  		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_iommu_group(file)) {
> +		if (!vfio_file_is_group(file)) {
>  			fput(file);
>  			ret = -EINVAL;
>  			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 4d2de02f2ced6e..4e10a281420e66 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -59,6 +59,7 @@ struct vfio_group {
>  	struct mutex			group_lock;
>  	struct kvm			*kvm;
>  	struct file			*opened_file;
> +	bool				preserve_iommu_group;
>  	struct swait_queue_head		opened_file_wait;
>  	struct blocking_notifier_head	notifier;
>  };
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9b1e5fd5f7b73c..13d22bd84afc47 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>  {
>  	struct vfio_group *group;
>  
> +	/*
> +	 * group->iommu_group from the vfio.group_list cannot be NULL
> +	 * under the vfio.group_lock.
> +	 */
>  	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>  		if (group->iommu_group == iommu_group) {
>  			refcount_inc(&group->drivers);
> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>  
>  	mutex_destroy(&group->device_lock);
>  	mutex_destroy(&group->group_lock);
> -	iommu_group_put(group->iommu_group);
> +	WARN_ON(group->iommu_group);
>  	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>  	kfree(group);
>  }
> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>  static void vfio_device_remove_group(struct vfio_device *device)
>  {
>  	struct vfio_group *group = device->group;
> +	struct iommu_group *iommu_group;
>  
>  	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>  		iommu_group_remove_device(device->dev);
> @@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
>  	 */
>  	cdev_device_del(&group->cdev, &group->dev);
>  
> +	mutex_lock(&group->group_lock);
> +	/*
> +	 * These data structures all have paired operations that can only be
> +	 * undone when the caller holds a live reference on the device. Since
> +	 * all pairs must be undone these WARN_ON's indicate some caller did not
> +	 * properly hold the group reference.l.
> +	 */
> +	WARN_ON(!list_empty(&group->device_list));
> +	WARN_ON(group->notifier.head);
> +
> +	/*
> +	 * Revoke all users of group->iommu_group. At this point we know there
> +	 * are no devices active because we are unplugging the last one. Setting
> +	 * iommu_group to NULL blocks all new users.
> +	 */
> +	if (group->container)
> +		vfio_group_detach_container(group);
> +	iommu_group = group->iommu_group;
> +	group->iommu_group = NULL;
> +	mutex_unlock(&group->group_lock);
> +
>  	/*
> -	 * Before we allow the last driver in the group to be unplugged the
> -	 * group must be sanitized so nothing else is or can reference it. This
> -	 * is because the group->iommu_group pointer should only be used so long
> -	 * as a device driver is attached to a device in the group.
> +	 * Normally we can set the iommu_group to NULL above and that will
> +	 * prevent any users from touching it. However, the SPAPR kvm path takes
> +	 * a reference to the iommu_group and keeps using it in arch code out
> +	 * side our control. So if this path is triggred we have no choice but
> +	 * to wait for the group FD to be closed to be sure everyone has stopped
> +	 * touching the group.
>  	 */
> -	while (group->opened_file) {
> +	while (group->preserve_iommu_group && group->opened_file) {
>  		mutex_unlock(&vfio.group_lock);
>  		swait_event_idle_exclusive(group->opened_file_wait,
>  					   !group->opened_file);
> @@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
>  	}
>  	mutex_unlock(&vfio.group_lock);
>  
> -	/*
> -	 * These data structures all have paired operations that can only be
> -	 * undone when the caller holds a live reference on the group. Since all
> -	 * pairs must be undone these WARN_ON's indicate some caller did not
> -	 * properly hold the group reference.
> -	 */
> -	WARN_ON(!list_empty(&group->device_list));
> -	WARN_ON(group->container || group->container_users);
> -	WARN_ON(group->notifier.head);
> -	group->iommu_group = NULL;
> -
> +	iommu_group_put(iommu_group);
>  	put_device(&group->dev);
>  }
>  
> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>  
>  	existing_device = vfio_group_get_device(group, device->dev);
>  	if (existing_device) {
> +		/*
> +		 * group->iommu_group is non-NULL because we hold the drivers
> +		 * refcount.
> +		 */
>  		dev_WARN(device->dev, "Device already exists on group %d\n",
>  			 iommu_group_id(group->iommu_group));
>  		vfio_device_put_registration(existing_device);
> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>  		ret = -EINVAL;
>  		goto out_unlock;
>  	}
> +	if (!group->iommu_group) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
>  	container = vfio_container_from_file(f.file);
>  	ret = -EINVAL;
>  	if (container) {
> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>  	status.flags = 0;
>  
>  	mutex_lock(&group->group_lock);
> +	if (!group->iommu_group) {
> +		mutex_unlock(&group->group_lock);
> +		return -ENODEV;
> +	}
> +
>  	if (group->container)
>  		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>  				VFIO_GROUP_FLAGS_VIABLE;
> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>  	filep->private_data = NULL;
>  
>  	mutex_lock(&group->group_lock);
> -	/*
> -	 * Device FDs hold a group file reference, therefore the group release
> -	 * is only called when there are no open devices.
> -	 */
> -	WARN_ON(group->notifier.head);
> -	if (group->container)
> -		vfio_group_detach_container(group);
>  	group->opened_file = NULL;
>  	mutex_unlock(&group->group_lock);
>  	swake_up_one(&group->opened_file_wait);
> @@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
>   * @file: VFIO group file
>   *
>   * The returned iommu_group is valid as long as a ref is held on the file.
> + * This function is deprecated, only the SPAPR path in kvm should call it.
>   */
>  struct iommu_group *vfio_file_iommu_group(struct file *file)
>  {
>  	struct vfio_group *group = file->private_data;
> +	struct iommu_group *iommu_group = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> +		return NULL;
>  
>  	if (file->f_op != &vfio_group_fops)
>  		return NULL;
> -	return group->iommu_group;
> +
> +	mutex_lock(&vfio.group_lock);
> +	mutex_lock(&group->group_lock);
> +	if (group->iommu_group) {
> +		iommu_group = group->iommu_group;
> +		group->preserve_iommu_group = true;
> +	}
> +	mutex_unlock(&group->group_lock);
> +	mutex_unlock(&vfio.group_lock);
> +	return iommu_group;
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>  
> +/**
> + * vfio_file_is_group - True if the file is usable with VFIO aPIS
> + * @file: VFIO group file
> + */
> +bool vfio_file_is_group(struct file *file)
> +{
> +	return file->f_op == &vfio_group_fops;
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_is_group);
> +
>  /**
>   * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>   *        is always CPU cache coherent
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 73bcb92179a224..bd9faaab85de18 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>   * External user API
>   */
>  struct iommu_group *vfio_file_iommu_group(struct file *file);
> +bool vfio_file_is_group(struct file *file);
>  bool vfio_file_enforced_coherent(struct file *file);
>  void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index ce1b01d02c5197..54aec3b0559c70 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
>  	return ret;
>  }
>  
> +static bool kvm_vfio_file_is_group(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_group);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_group);
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>  static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>  {
>  	struct iommu_group *(*fn)(struct file *file);
> @@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>  	return ret;
>  }
>  
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>  static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>  					     struct kvm_vfio_group *kvg)
>  {
> @@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>  		return -EBADF;
>  
>  	/* Ensure the FD is a vfio group FD.*/
> -	if (!kvm_vfio_file_iommu_group(filp)) {
> +	if (!kvm_vfio_file_is_group(filp)) {
>  		ret = -EINVAL;
>  		goto err_fput;
>  	}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:21                       ` Matthew Rosato
@ 2022-10-05 15:40                         ` Matthew Rosato
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Rosato @ 2022-10-05 15:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On 10/5/22 10:21 AM, Matthew Rosato wrote:
> On 10/5/22 10:01 AM, Jason Gunthorpe wrote:
>> On Wed, Oct 05, 2022 at 10:57:28AM -0300, Jason Gunthorpe wrote:
>>> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
>>>
>>>  
>>>> (again, with the follow-up applied) Besides the panic above I just
>>>> noticed there is also this warning that immediately precedes and is
>>>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>>>> and group->owner_cnt are already 0
>>>
>>> And this is after the 2nd try that fixes the locking?
>>>
>>> This shows that vfio_group_detach_container() is called twice (which
>>> was my guess), hoever this looks to be impossible as both calls are
>>> protected by 'if (group->container)' and the function NULL's
>>> group->container and it is all under the proper lock.
>>>
>>> My guess was that missing locking caused the two cases to race and
>>> trigger WARN, but the locking should fix that.
>>>
>>> So I'm at a loss, can you investigate a bit?
>>
>> Huh, perhaps I'm loosing my mind, but I'm sure I sent this out, but it
>> is not in the archive. This v2 fixes the missing locking and the rest
>> of the remarks.
> 
> Ah, here we go.  OK, initial testing with vfio-pci on this version and I note that
> 
> 1) the warning/crash is gone
> 2) the iommu group ID no longer increments
> 
> I next will take it through the longer series of tests that would crash before 'vfio: Follow a strict lifetime for struct iommu_group' but this looks good so far.
> 

OK, this also looks good - thanks!  Besides the vfio-pci testing on s390 I also ran some brief tests against both vfio-ccw and vfio-ap.

Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:19                       ` Christian Borntraeger
@ 2022-10-06 11:55                         ` Christian Borntraeger
  0 siblings, 0 replies; 19+ messages in thread
From: Christian Borntraeger @ 2022-10-06 11:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tony Krowiak, Jason J . Herne, Marc Hartmayer, Eric Farman,
	Cornelia Huck, kvm, Qian Cai, Joerg Roedel, Marek Szyprowski,
	linux-s390, Matthew Rosato, Jason Gunthorpe

> Am 05.10.22 um 16:01 schrieb Jason Gunthorpe:
> [..]
>> commit f8b993620af72fa5f15bd4c1515868013c1c173d
>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>
>>      vfio: Make the group FD disassociate from the iommu_group
>>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>      the pointer is NULL the vfio_group users promise not to touch the
>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>      keeping the group FD open.
>>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>      part of its iommu interface through KVM. Due to this we loose control over
>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>      layer via vfio_group_detach_container().
>>      Thus, for SPAPR the group FDs must still be closed before a device can be
>>      hot unplugged.
>>      This fixes a userspace regression where we learned that virtnodedevd
>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>      and all the drivers for it unplugged.
>>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Looks better now (I also did a quick check with vfio-pci on s390)
> 
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

Alex I think it would be good to schedule this for vfio-next as well. Do you want Jason
to send this patch as a proper patch outline of this mail thread?

> 
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 59a28251bb0b97..badc9d828cac20 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>>           }
>>           /* Ensure the FD is a vfio group FD.*/
>> -        if (!vfio_file_iommu_group(file)) {
>> +        if (!vfio_file_is_group(file)) {
>>               fput(file);
>>               ret = -EINVAL;
>>               break;
>> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
>> index 4d2de02f2ced6e..4e10a281420e66 100644
>> --- a/drivers/vfio/vfio.h
>> +++ b/drivers/vfio/vfio.h
>> @@ -59,6 +59,7 @@ struct vfio_group {
>>       struct mutex            group_lock;
>>       struct kvm            *kvm;
>>       struct file            *opened_file;
>> +    bool                preserve_iommu_group;
>>       struct swait_queue_head        opened_file_wait;
>>       struct blocking_notifier_head    notifier;
>>   };
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index 9b1e5fd5f7b73c..13d22bd84afc47 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>>   {
>>       struct vfio_group *group;
>> +    /*
>> +     * group->iommu_group from the vfio.group_list cannot be NULL
>> +     * under the vfio.group_lock.
>> +     */
>>       list_for_each_entry(group, &vfio.group_list, vfio_next) {
>>           if (group->iommu_group == iommu_group) {
>>               refcount_inc(&group->drivers);
>> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>>       mutex_destroy(&group->device_lock);
>>       mutex_destroy(&group->group_lock);
>> -    iommu_group_put(group->iommu_group);
>> +    WARN_ON(group->iommu_group);
>>       ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>>       kfree(group);
>>   }
>> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>>   static void vfio_device_remove_group(struct vfio_device *device)
>>   {
>>       struct vfio_group *group = device->group;
>> +    struct iommu_group *iommu_group;
>>       if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>>           iommu_group_remove_device(device->dev);
>> @@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
>>        */
>>       cdev_device_del(&group->cdev, &group->dev);
>> +    mutex_lock(&group->group_lock);
>> +    /*
>> +     * These data structures all have paired operations that can only be
>> +     * undone when the caller holds a live reference on the device. Since
>> +     * all pairs must be undone these WARN_ON's indicate some caller did not
>> +     * properly hold the group reference.l.
>> +     */
>> +    WARN_ON(!list_empty(&group->device_list));
>> +    WARN_ON(group->notifier.head);
>> +
>> +    /*
>> +     * Revoke all users of group->iommu_group. At this point we know there
>> +     * are no devices active because we are unplugging the last one. Setting
>> +     * iommu_group to NULL blocks all new users.
>> +     */
>> +    if (group->container)
>> +        vfio_group_detach_container(group);
>> +    iommu_group = group->iommu_group;
>> +    group->iommu_group = NULL;
>> +    mutex_unlock(&group->group_lock);
>> +
>>       /*
>> -     * Before we allow the last driver in the group to be unplugged the
>> -     * group must be sanitized so nothing else is or can reference it. This
>> -     * is because the group->iommu_group pointer should only be used so long
>> -     * as a device driver is attached to a device in the group.
>> +     * Normally we can set the iommu_group to NULL above and that will
>> +     * prevent any users from touching it. However, the SPAPR kvm path takes
>> +     * a reference to the iommu_group and keeps using it in arch code out
>> +     * side our control. So if this path is triggred we have no choice but
>> +     * to wait for the group FD to be closed to be sure everyone has stopped
>> +     * touching the group.
>>        */
>> -    while (group->opened_file) {
>> +    while (group->preserve_iommu_group && group->opened_file) {
>>           mutex_unlock(&vfio.group_lock);
>>           swait_event_idle_exclusive(group->opened_file_wait,
>>                          !group->opened_file);
>> @@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
>>       }
>>       mutex_unlock(&vfio.group_lock);
>> -    /*
>> -     * These data structures all have paired operations that can only be
>> -     * undone when the caller holds a live reference on the group. Since all
>> -     * pairs must be undone these WARN_ON's indicate some caller did not
>> -     * properly hold the group reference.
>> -     */
>> -    WARN_ON(!list_empty(&group->device_list));
>> -    WARN_ON(group->container || group->container_users);
>> -    WARN_ON(group->notifier.head);
>> -    group->iommu_group = NULL;
>> -
>> +    iommu_group_put(iommu_group);
>>       put_device(&group->dev);
>>   }
>> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>>       existing_device = vfio_group_get_device(group, device->dev);
>>       if (existing_device) {
>> +        /*
>> +         * group->iommu_group is non-NULL because we hold the drivers
>> +         * refcount.
>> +         */
>>           dev_WARN(device->dev, "Device already exists on group %d\n",
>>                iommu_group_id(group->iommu_group));
>>           vfio_device_put_registration(existing_device);
>> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>>           ret = -EINVAL;
>>           goto out_unlock;
>>       }
>> +    if (!group->iommu_group) {
>> +        ret = -ENODEV;
>> +        goto out_unlock;
>> +    }
>> +
>>       container = vfio_container_from_file(f.file);
>>       ret = -EINVAL;
>>       if (container) {
>> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>>       status.flags = 0;
>>       mutex_lock(&group->group_lock);
>> +    if (!group->iommu_group) {
>> +        mutex_unlock(&group->group_lock);
>> +        return -ENODEV;
>> +    }
>> +
>>       if (group->container)
>>           status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>>                   VFIO_GROUP_FLAGS_VIABLE;
>> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>>       filep->private_data = NULL;
>>       mutex_lock(&group->group_lock);
>> -    /*
>> -     * Device FDs hold a group file reference, therefore the group release
>> -     * is only called when there are no open devices.
>> -     */
>> -    WARN_ON(group->notifier.head);
>> -    if (group->container)
>> -        vfio_group_detach_container(group);
>>       group->opened_file = NULL;
>>       mutex_unlock(&group->group_lock);
>>       swake_up_one(&group->opened_file_wait);
>> @@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
>>    * @file: VFIO group file
>>    *
>>    * The returned iommu_group is valid as long as a ref is held on the file.
>> + * This function is deprecated, only the SPAPR path in kvm should call it.
>>    */
>>   struct iommu_group *vfio_file_iommu_group(struct file *file)
>>   {
>>       struct vfio_group *group = file->private_data;
>> +    struct iommu_group *iommu_group = NULL;
>> +
>> +    if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
>> +        return NULL;
>>       if (file->f_op != &vfio_group_fops)
>>           return NULL;
>> -    return group->iommu_group;
>> +
>> +    mutex_lock(&vfio.group_lock);
>> +    mutex_lock(&group->group_lock);
>> +    if (group->iommu_group) {
>> +        iommu_group = group->iommu_group;
>> +        group->preserve_iommu_group = true;
>> +    }
>> +    mutex_unlock(&group->group_lock);
>> +    mutex_unlock(&vfio.group_lock);
>> +    return iommu_group;
>>   }
>>   EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>> +/**
>> + * vfio_file_is_group - True if the file is usable with VFIO aPIS
>> + * @file: VFIO group file
>> + */
>> +bool vfio_file_is_group(struct file *file)
>> +{
>> +    return file->f_op == &vfio_group_fops;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_file_is_group);
>> +
>>   /**
>>    * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>>    *        is always CPU cache coherent
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 73bcb92179a224..bd9faaab85de18 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>>    * External user API
>>    */
>>   struct iommu_group *vfio_file_iommu_group(struct file *file);
>> +bool vfio_file_is_group(struct file *file);
>>   bool vfio_file_enforced_coherent(struct file *file);
>>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>>   bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index ce1b01d02c5197..54aec3b0559c70 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
>>       return ret;
>>   }
>> +static bool kvm_vfio_file_is_group(struct file *file)
>> +{
>> +    bool (*fn)(struct file *file);
>> +    bool ret;
>> +
>> +    fn = symbol_get(vfio_file_is_group);
>> +    if (!fn)
>> +        return false;
>> +
>> +    ret = fn(file);
>> +
>> +    symbol_put(vfio_file_is_group);
>> +
>> +    return ret;
>> +}
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>   static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>>   {
>>       struct iommu_group *(*fn)(struct file *file);
>> @@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>>       return ret;
>>   }
>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>>   static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>>                            struct kvm_vfio_group *kvg)
>>   {
>> @@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>>           return -EBADF;
>>       /* Ensure the FD is a vfio group FD.*/
>> -    if (!kvm_vfio_file_iommu_group(filp)) {
>> +    if (!kvm_vfio_file_is_group(filp)) {
>>           ret = -EINVAL;
>>           goto err_fput;
>>       }

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-10-06 11:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <0-v2-a3c5f4429e2a+55-iommu_group_lifetime_jgg@nvidia.com>
     [not found] ` <4cb6e49e-554e-57b3-e2d3-bc911d99083f@linux.ibm.com>
     [not found]   ` <20220927140541.6f727b01.alex.williamson@redhat.com>
2022-10-04 15:19     ` [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group Christian Borntraeger
2022-10-04 15:40       ` Jason Gunthorpe
2022-10-04 15:44         ` Christian Borntraeger
2022-10-04 16:28           ` Jason Gunthorpe
2022-10-04 17:15             ` Christian Borntraeger
2022-10-04 17:22               ` Jason Gunthorpe
2022-10-04 17:36             ` Christian Borntraeger
2022-10-04 17:48               ` Christian Borntraeger
2022-10-04 18:22               ` Matthew Rosato
2022-10-04 18:56                 ` Eric Farman
2022-10-05 13:46                 ` Matthew Rosato
2022-10-05 13:57                   ` Jason Gunthorpe
2022-10-05 14:00                     ` Christian Borntraeger
2022-10-05 14:01                     ` Jason Gunthorpe
2022-10-05 14:19                       ` Christian Borntraeger
2022-10-06 11:55                         ` Christian Borntraeger
2022-10-05 14:21                       ` Matthew Rosato
2022-10-05 15:40                         ` Matthew Rosato
2022-10-05 14:01                     ` Matthew Rosato

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox