Re: [RFC] UAF in acrn_irqfd_assign() and vfio_virqfd_enable()

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Al Viro <viro@zeniv.linux.org.uk>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Fei Li <fei1.li@intel.com>,
	kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC] UAF in acrn_irqfd_assign() and vfio_virqfd_enable()
Date: Wed, 12 Jun 2024 03:16:53 +0100	[thread overview]
Message-ID: <20240612021653.GF1629371@ZenIV> (raw)
In-Reply-To: <20240611170438.508a2612.alex.williamson@redhat.com>

On Tue, Jun 11, 2024 at 05:04:38PM -0600, Alex Williamson wrote:

> > OK, so the memory safety in there depends upon
> > 	* external exclusion wrt vfio_virqfd_disable() on caller-specific
> > locks (vfio_pci_core_device::ioeventfds_lock for vfio_pci_rdwr.c,
> > vfio_pci_core_device::igate for the rest?  What about the path via
> > vfio_pci_core_disable()?)
> 
> This is only called when the device is closed, therefore there's no
> userspace access to generate a race.

Umm...  Let's see if I got confused in RTFS:

	1.  Calls of vfio_pci_core_disable() come from assorted ->close_device()
instances and failure exits in ->open_device() ones.
	2.  ->open_device() is called by vfio_df_device_first_open() from
vfio_df_open().  That's done under device->dev_set->lock.
	3.  ->close_device() is called by vfio_df_device_last_close() from
vfio_df_close(), under the same lock.
	4.  vfio_df_open() comes from vfio_df_ioctl_bind_iommufd() or from
vfio_df_group_open().  vfio_df_close() is done by the failure exits in those
two, as well as by vfio_df_unbind_iommufd() and vfio_df_group_close().
	5.  vfio_df_bind_iommufd() handles VFIO_DEVICE_BIND_IOMMUFD in
vfio_device_fops->unlocked_ioctl(); only works for !df->group and only
once, unless I'm misreading vfio_df_open().  No other ioctls are allowed
until that's done and vfio_df_unbind_iommufd() is done in ->release(),
in case of !df->group.  vfio_df_close() is done there, in case we had
a successful BIND_IOMMUFD done at some point.  Multiple files can be opened
for the same device; once one of them have done BIND_IOMMUFD, BIND_IOMMUFD
on any of them will fail until the first caller gets closed.  Once that
happens, others can get BIND_IOMMUFD; until then ioctls don't work for
them at all (IOW, BIND_IOMMUFD grants the ability to do ioctls only for
the opened file it had been done on).
	6.  vfio_df_group_open() and vfio_df_close() is about the other
way to get files with such ->f_op - VFIO_GROUP_GET_DEVICE_FD handling
in vfio_group_fops.unlocked_ioctl().  That gets an anon-inode file
with vfio_device_fops and shoves it into descriptor table.  Presumably
vfio_device_get_from_name() always returns a device with non-NULL
device->group (it would better, or the things would get really confused).
vfio_df_group_open() is done fist, with vfio_df_group_close() on failure
exit *and* in ->release() of those suckers (again, assuming we do get
non-NULL ->group).

	OK, that seems to be enough - anything done in ->ioctl() would
be completed between vfio_df_open() and vfio_df_close(), so we do have
the exclusion.	Is the above correct?

FWIW, this was a confusing bit:
        vfio_pci_core_disable(vdev);

        mutex_lock(&vdev->igate);
        if (vdev->err_trigger) {
                eventfd_ctx_put(vdev->err_trigger);
                vdev->err_trigger = NULL;
        }
        if (vdev->req_trigger) {
                eventfd_ctx_put(vdev->req_trigger);
                vdev->req_trigger = NULL;
        }
        mutex_unlock(&vdev->igate);
in vfio_pci_core_close_device().  Since we need an exclusion on ->igate
there for something, and since it's one of the locks used to serialize
vfio_virqfd_enable()...

> > 	* no EPOLLHUP on eventfd while the file is pinned.  That's what
> >         /*
> >          * Do not drop the file until the irqfd is fully initialized,
> >          * otherwise we might race against the EPOLLHUP.
> >          */
> > in there (that "irqfd" is a typo for "kirqfd", right?) refers to.
> 
> Sorry, I'm not fully grasping your comment.  "irqfd" is not a typo
> here, "kirqfd" seems to be a Xen thing.  I believe the comment is
> referring to holding a reference to the fd until everything is in place
> to cleanup correctly if the user process is killed mid-setup.  Thanks,

*blink*

s/kirqfd/virqfd/, sorry.  In the comment earlier in the same function:
         * virqfds can be released by closing the eventfd or directly
	   ^^^^^^^
         * through ioctl.  These are both done through a workqueue, so
         * we update the pointer to the virqfd under lock to avoid
				    ^^^ ^^^^^^
         * pushing multiple jobs to release the same virqfd.
					    ^^^ ^^^^ ^^^^^^
In the second comment you have
         * Do not drop the file until the irqfd is fully initialized,
				      ^^^ ^^^^^
         * otherwise we might race against the EPOLLHUP.

	If that refers to the same object, the comment makes sense - once
you've called vfs_poll(), EPOLLHUP wakeup would have your new instance of
struct virqfd freed, so accessing it (e.g. in
                        schedule_work(&virqfd->inject);
) is only safe because EPOLLHUP won't come until eventfd_release(), which
will not happen as long as you don't drop the file reference that sits in
irqfd.file.  That's the reason why struct fd instance can't be released
until you are done with setting the struct virqfd instance up.

	If that reading is what you intended, "irqfd" in the second
comment ought to be "virqfd", to be consistent with the reference to the
same thing in the earlier comment.  If that's not what you meant... what
is that comment really about?  Killing the user process mid-setup won't
actually do anything until your thread is out of whatever syscall it
had been in (ioctl(2), usually); the dangerous scenario would be having
another thread close the same descriptor after you've done fdput().

	The thing you need to avoid is having all references to
eventfd file dropped.  For that the reference in descriptor table must
be gone.  Sure, killing the process might do that - once all threads
get to exit_files() and drop their references to descriptor table.
Then put_files_struct() from the last of them will call close_files()
and drop all file references you had in the table.

	But that can happen only when all threads have gotten through
the signal delivery.  Including the one that was in the middle of
vfio_virqfd_enable().  And _that_ won't happen until return from that
function.

	So having the caller killed mid-setup is not an issue.	Another
thread sharing the same descriptor table and calling close(fd) (or
dup2(something, fd), or anything else that would close that descriptor)
would be.  _That_ is what is prevented by the fdget()/fdput() pair -
between those we are guaranteed that file reference will stay pinned.
If descriptor table is shared, fdget() will clone the reference and store
it in irqfd.file, so the file remains pinned no matter what happens
to descriptor table, until fdput() drops the reference.  If the table
is _not_ shared, the reference in it won't go away until we are done,
so we can borrow that into irqfd.file (and do nothing on fdput()).

	Anyway, I believe that what you have there is actually safe.
Analysis could be less convoluted, but then I might've missed simpler
reasons why everything works.

	It really needs comments in there - as it is, two drivers have
copied that scheme without bothering with exclusion (commit f8941e6c4c71
"xen: privcmd: Add support for irqfd" last year and commit aa3b483ff1d7
"virt: acrn: Introduce irqfd" three years ago) with, AFAICT, real UAF
in each ;-/

next prev parent reply	other threads:[~2024-06-12  2:16 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-07  1:56 [PATCHES][RFC] rework of struct fd handling Al Viro
2024-06-07  1:59 ` [PATCH 01/19] powerpc: fix a file leak in kvm_vcpu_ioctl_enable_cap() Al Viro
2024-06-07  1:59   ` [PATCH 02/19] lirc: rc_dev_get_from_fd(): fix file leak Al Viro
2024-06-07 15:17     ` Christian Brauner
2024-06-07  1:59   ` [PATCH 03/19] introduce fd_file(), convert all accessors to it Al Viro
2024-06-07  1:59   ` [PATCH 04/19] struct fd: representation change Al Viro
2024-06-07  5:55     ` Amir Goldstein
2024-06-07  1:59   ` [PATCH 05/19] add struct fd constructors, get rid of __to_fd() Al Viro
2024-06-07  1:59   ` [PATCH 06/19] net/socket.c: switch to CLASS(fd) Al Viro
2024-06-07  1:59   ` [PATCH 07/19] introduce struct fderr, convert overlayfs uses to that Al Viro
2024-06-07  1:59   ` [PATCH 08/19] fdget_raw() users: switch to CLASS(fd_raw, ...) Al Viro
2024-06-07 15:20     ` Christian Brauner
2024-06-07  1:59   ` [PATCH 09/19] css_set_fork(): " Al Viro
2024-06-07 15:21     ` Christian Brauner
2024-06-07  1:59   ` [PATCH 10/19] introduce "fd_pos" class Al Viro
2024-06-07 15:21     ` Christian Brauner
2024-06-07  1:59   ` [PATCH 11/19] switch simple users of fdget() to CLASS(fd, ...) Al Viro
2024-06-07 15:26     ` Christian Brauner
2024-06-07 16:10       ` Al Viro
2024-06-07 16:11         ` Al Viro
2024-06-07 21:08         ` Al Viro
2024-06-10  2:44           ` [RFC] potential UAF in kvm_spapr_tce_attach_iommu_group() (was Re: [PATCH 11/19] switch simple users of fdget() to CLASS(fd, ...)) Al Viro
2024-06-12 16:36             ` Linus Torvalds
2024-06-13 10:56               ` Michael Ellerman
2024-06-10  5:12           ` [RFC] UAF in acrn_irqfd_assign() and vfio_virqfd_enable() Al Viro
2024-06-10 17:03             ` Al Viro
2024-06-10 20:09             ` Alex Williamson
2024-06-10 20:53               ` Al Viro
2024-06-11 23:04                 ` Alex Williamson
2024-06-12  2:16                   ` Al Viro [this message]
2024-06-07  1:59   ` [PATCH 12/19] bpf: switch to CLASS(fd, ...) Al Viro
2024-06-07 15:27     ` Christian Brauner
2024-06-07  1:59   ` [PATCH 13/19] convert vmsplice() " Al Viro
2024-06-07  1:59   ` [PATCH 14/19] finit_module(): convert " Al Viro
2024-06-07  1:59   ` [PATCH 15/19] timerfd: switch " Al Viro
2024-06-07  1:59   ` [PATCH 16/19] do_mq_notify(): " Al Viro
2024-06-07  1:59   ` [PATCH 17/19] simplify xfs_find_handle() a bit Al Viro
2024-06-07  1:59   ` [PATCH 18/19] convert kernel/events/core.c Al Viro
2024-06-07  1:59   ` [PATCH 19/19] deal with the last remaing boolean uses of fd_file() Al Viro
2024-06-07 15:16   ` [PATCH 01/19] powerpc: fix a file leak in kvm_vcpu_ioctl_enable_cap() Christian Brauner
2024-06-07 15:30 ` [PATCHES][RFC] rework of struct fd handling Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240612021653.GF1629371@ZenIV \
    --to=viro@zeniv.linux.org.uk \
    --cc=alex.williamson@redhat.com \
    --cc=fei1.li@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).