* aio: fix a user triggered use after free
@ 2016-10-29 7:44 Christoph Hellwig
2016-10-29 7:44 ` [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes) Christoph Hellwig
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2016-10-29 7:44 UTC (permalink / raw)
To: torvalds, viro
Cc: jack, dmonakhov, jmoyer, linux-fsdevel, linux-aio, linux-kernel
Hi Linus,
the patch below fixes a use after free triggered if one threads closes
a file descriptor after another just started an aio operation.
The patch is originally from Jan with some updates from me - I tried
to get in through Al but I haven't managed to get him to respond yet,
but I'd really like to get into the tree, -stable and the distros ASAP.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 7:44 aio: fix a user triggered use after free Christoph Hellwig
@ 2016-10-29 7:44 ` Christoph Hellwig
2016-10-29 12:24 ` Al Viro
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2016-10-29 7:44 UTC (permalink / raw)
To: torvalds, viro
Cc: jack, dmonakhov, jmoyer, linux-fsdevel, linux-aio, linux-kernel,
stable
From: Jan Kara <jack@suse.cz>
Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:
for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
--runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite
Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().
[hch: The above was the changelog of the original patch from Jan.
It turns out that it fixes something even more important - a use
after free of the file structucture given that the direct I/O
code calls fput and potentially drops the last reference to it in
aio_complete. Together with two racing threads and a zero sized
I/O this seems easily exploitable]
Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: switch to use __sb_writers_acquired and file_inode(file),
updated changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
---
fs/aio.c | 28 +++++++++++++++++++++++++---
include/linux/fs.h | 1 +
2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 1157e13..bf315cd 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1078,6 +1078,17 @@ static void aio_complete(struct kiocb *kiocb, long res, long res2)
unsigned tail, pos, head;
unsigned long flags;
+ if (kiocb->ki_flags & IOCB_WRITE) {
+ struct file *file = kiocb->ki_filp;
+
+ /*
+ * Tell lockdep we inherited freeze protection from submission
+ * thread.
+ */
+ __sb_writers_acquired(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+ file_end_write(file);
+ }
+
/*
* Special case handling for sync iocbs:
* - events go directly into the iocb for fast handling
@@ -1460,13 +1471,24 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
return ret;
}
- if (rw == WRITE)
+ if (rw == WRITE) {
file_start_write(file);
+ req->ki_flags |= IOCB_WRITE;
+ }
+
+ if (rw == WRITE) {
+ /*
+ * We release freeze protection in aio_complete(). Fool
+ * lockdep by telling it the lock got released so that
+ * it doesn't complain about held lock when we return
+ * to userspace.
+ */
+ __sb_writers_release(file_inode(file)->i_sb,
+ SB_FREEZE_WRITE);
+ }
ret = iter_op(req, &iter);
- if (rw == WRITE)
- file_end_write(file);
kfree(iovec);
break;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 16d2b6e..db600e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -321,6 +321,7 @@ struct writeback_control;
#define IOCB_HIPRI (1 << 3)
#define IOCB_DSYNC (1 << 4)
#define IOCB_SYNC (1 << 5)
+#define IOCB_WRITE (1 << 6)
struct kiocb {
struct file *ki_filp;
--
2.1.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 7:44 ` [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes) Christoph Hellwig
@ 2016-10-29 12:24 ` Al Viro
2016-10-29 15:20 ` Christoph Hellwig
0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2016-10-29 12:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: torvalds, jack, dmonakhov, jmoyer, linux-fsdevel, linux-aio,
linux-kernel, stable
On Sat, Oct 29, 2016 at 09:44:29AM +0200, Christoph Hellwig wrote:
> - if (rw == WRITE)
> + if (rw == WRITE) {
> file_start_write(file);
> + req->ki_flags |= IOCB_WRITE;
> + }
> + if (rw == WRITE) {
> + /*
> + * We release freeze protection in aio_complete(). Fool
> + * lockdep by telling it the lock got released so that
> + * it doesn't complain about held lock when we return
> + * to userspace.
> + */
> + __sb_writers_release(file_inode(file)->i_sb,
> + SB_FREEZE_WRITE);
> + }
How about taking this chunk (i.e. telling lockdep that we are not holding this
thing) past the iter_op() call, where file_end_write() used to be?
As it is, you risk hiding the lock dependencies the current mainline would've
caught. Other than that I see no problems with the patch...
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 12:24 ` Al Viro
@ 2016-10-29 15:20 ` Christoph Hellwig
2016-10-29 16:12 ` Al Viro
2016-10-29 17:47 ` Linus Torvalds
0 siblings, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2016-10-29 15:20 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, torvalds, jack, dmonakhov, jmoyer,
linux-fsdevel, linux-aio, linux-kernel, stable
On Sat, Oct 29, 2016 at 01:24:51PM +0100, Al Viro wrote:
> How about taking this chunk (i.e. telling lockdep that we are not holding this
> thing) past the iter_op() call, where file_end_write() used to be?
We can't as that would not fix the use after free (at least for the lockdep
case - otherwise the call is a no-op). Once iter_op returns aio_complete
might have dropped our reference to the file, and another thread might
have closed the fd so that the fput from aio_complete was the last one.
This is something that xfstests/323 can reproduce under the right conditions.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 15:20 ` Christoph Hellwig
@ 2016-10-29 16:12 ` Al Viro
2016-10-29 16:29 ` Al Viro
2016-10-30 6:32 ` Christoph Hellwig
2016-10-29 17:47 ` Linus Torvalds
1 sibling, 2 replies; 16+ messages in thread
From: Al Viro @ 2016-10-29 16:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: torvalds, jack, dmonakhov, jmoyer, linux-fsdevel, linux-aio,
linux-kernel, stable
On Sat, Oct 29, 2016 at 05:20:17PM +0200, Christoph Hellwig wrote:
> On Sat, Oct 29, 2016 at 01:24:51PM +0100, Al Viro wrote:
> > How about taking this chunk (i.e. telling lockdep that we are not holding this
> > thing) past the iter_op() call, where file_end_write() used to be?
>
> We can't as that would not fix the use after free (at least for the lockdep
> case - otherwise the call is a no-op). Once iter_op returns aio_complete
> might have dropped our reference to the file, and another thread might
> have closed the fd so that the fput from aio_complete was the last one.
>
> This is something that xfstests/323 can reproduce under the right conditions.
Eww... IOW, as soon as we'd submitted an async iocb, nobody can so much as
look at struct file *or* iocb, right? Or underlying inode, or any fs-private
data structures attached to it...
I certainly agree that it's a bug, but IMO you are just papering over it.
Just look at e.g.
written = mapping->a_ops->direct_IO(iocb, &data);
/*
* Finally, try again to invalidate clean pages which might have been
* cached by non-direct readahead, or faulted in by get_user_pages()
* if the source of the write was an mmap'ed region of the file
* we're writing. Either one is a pretty crazy thing to do,
* so we don't support it 100%. If this invalidation
* fails, tough, the write still worked...
*/
if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,
pos >> PAGE_SHIFT, end);
}
in generic_file_direct_write() - who said that mapping doesn't point into
freed memory by that point? Or that
inode_lock(inode);
ret = generic_write_checks(iocb, from);
if (ret > 0)
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
in generic_file_write_iter() won't blow up on inode_unlock() of a freed inode?
Or that anything of that sort won't happen in other ->write_iter instances,
for that matter?
NAK, with apologies for not having looked at that earlier. The bug is real,
all right, but this is not a solution - both incomplete and far too brittle.
Why do we play that kind of insane games, anyway? Why not e.g. refcount
the (async) iocb and have kiocb_free() drop the reference, with io_submit_one()
holding one across the call of aio_run_iocb()? Cacheline bouncing issues?
Anything more subtle?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 16:12 ` Al Viro
@ 2016-10-29 16:29 ` Al Viro
2016-10-30 6:32 ` Christoph Hellwig
1 sibling, 0 replies; 16+ messages in thread
From: Al Viro @ 2016-10-29 16:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: torvalds, jack, dmonakhov, jmoyer, linux-fsdevel, linux-aio,
linux-kernel, stable
On Sat, Oct 29, 2016 at 05:12:30PM +0100, Al Viro wrote:
> NAK, with apologies for not having looked at that earlier. The bug is real,
> all right, but this is not a solution - both incomplete and far too brittle.
>
> Why do we play that kind of insane games, anyway? Why not e.g. refcount
> the (async) iocb and have kiocb_free() drop the reference, with io_submit_one()
> holding one across the call of aio_run_iocb()? Cacheline bouncing issues?
> Anything more subtle?
PS: I'm not saying that refcounting kiocb is the best solution - grabbing an
extra reference to struct file might be better (we have just dirtied that
cacheline, so the fact that struct file is shared more than kiocb shouldn't
matter much), but I really think that "file and everything attached to it
might disappear as soon as you get async IO started" is insanely brittle -
e.g. xfs_rw_iunlock(ip, iolock) in xfs_file_dio_aio_write() is also unsafe
<checks -next - yup, still there>
If struct file might be gone, so might struct inode and everything behind
it. Which means that we either are not allowed to hold any locks across
__blockdev_direct_IO(), or need have end_io() callback taking care of
dropping those (and adjust the callers accordingly). It might be not
impossible, but... *ouch*
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 15:20 ` Christoph Hellwig
2016-10-29 16:12 ` Al Viro
@ 2016-10-29 17:47 ` Linus Torvalds
2016-10-29 18:52 ` Al Viro
2016-10-30 6:29 ` Christoph Hellwig
1 sibling, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2016-10-29 17:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Al Viro, Jan Kara, Dmitry Monakhov, Jeff Moyer, linux-fsdevel,
linux-aio, Linux Kernel Mailing List, stable
[-- Attachment #1: Type: text/plain, Size: 900 bytes --]
On Sat, Oct 29, 2016 at 8:20 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> We can't as that would not fix the use after free (at least for the lockdep
> case - otherwise the call is a no-op). Once iter_op returns aio_complete
> might have dropped our reference to the file, and another thread might
> have closed the fd so that the fput from aio_complete was the last one.
I don't concpetually mind the patch per se, but the repeated
if (rw == WRITE) {
..
}
if (rw == WRITE) {
..
}
is just insane and makes the code less legible than it should be.
Also, honestly, make it use a helper: "aio_file_start_write()" and
"aio_file_end_write()" that has the comments and the lockdep games.
Because that patch is just too effing ugly.
Does something like the attached work for you guys?
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 2113 bytes --]
fs/aio.c | 33 +++++++++++++++++++++++++++++----
include/linux/fs.h | 1 +
2 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 1157e13a36d6..3f66331ef90c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1066,6 +1066,27 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
return ret;
}
+/*
+ * We do file_start_write/file_end_write() to make sure
+ * we have filesystem freeze protection over the whole
+ * AIO write sequence, but to make sure that lockdep does
+ * not complain about the held lock when we return to
+ * userspace, we tell it that we release and reaquire the
+ * lock.
+ */
+static void aio_file_start_write(struct file *file)
+{
+ file_start_write(file);
+ __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+}
+
+static void aio_file_end_write(struct file *file)
+{
+ __sb_writers_acquired(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+ file_end_write(file);
+}
+
+
/* aio_complete
* Called when the io request on the given iocb is complete.
*/
@@ -1078,6 +1099,9 @@ static void aio_complete(struct kiocb *kiocb, long res, long res2)
unsigned tail, pos, head;
unsigned long flags;
+ if (kiocb->ki_flags & IOCB_WRITE)
+ aio_file_end_write(kiocb->ki_filp);
+
/*
* Special case handling for sync iocbs:
* - events go directly into the iocb for fast handling
@@ -1460,13 +1484,14 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
return ret;
}
- if (rw == WRITE)
- file_start_write(file);
+ if (rw == WRITE) {
+ /* aio_complete() will end the write */
+ req->ki_flags |= IOCB_WRITE;
+ aio_file_start_write(file);
+ }
ret = iter_op(req, &iter);
- if (rw == WRITE)
- file_end_write(file);
kfree(iovec);
break;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 16d2b6e874d6..db600e9bb1b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -321,6 +321,7 @@ struct writeback_control;
#define IOCB_HIPRI (1 << 3)
#define IOCB_DSYNC (1 << 4)
#define IOCB_SYNC (1 << 5)
+#define IOCB_WRITE (1 << 6)
struct kiocb {
struct file *ki_filp;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 17:47 ` Linus Torvalds
@ 2016-10-29 18:52 ` Al Viro
2016-10-29 19:07 ` Linus Torvalds
2016-10-30 6:29 ` Christoph Hellwig
1 sibling, 1 reply; 16+ messages in thread
From: Al Viro @ 2016-10-29 18:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sat, Oct 29, 2016 at 10:47:58AM -0700, Linus Torvalds wrote:
> Also, honestly, make it use a helper: "aio_file_start_write()" and
> "aio_file_end_write()" that has the comments and the lockdep games.
>
> Because that patch is just too effing ugly.
>
> Does something like the attached work for you guys?
No. The use-after-free problem is real, nasty and only papered over by
that patch.
What happens is that io_submit_one()
* allocates aio_kiocb
* does fget() and stuffs the struct file * into kiocb
* in case of early problems we call kiocb_free(), freeing kiocb and
doing fput() on file, then bugger off.
* otherwise, eventually we get to passing that iocb to
->read_iter()/->write_iter().
* if that has resulted in anything other than -EIOCBQUEUED, we
call aio_complete(), which calls kiocb_free(), freeing kiocb and doing fput()
on file.
* if ->{read,write}_iter() returns -EIOCBQUEUED, we expect
aio_complete() to be called asynchronously.
And that call can happen as soon as we return from __blockdev_direct_IO()
(even earlier, actually). As soon as that happens, the reference to
struct file we'd acquired in io_submit_one() is dropped. If descriptor
table had been shared, another thread might have already closed that sucker,
and fput() from aio_complete() would free struct file.
That's what this patch is papering over. Because if we hit that scenario
and struct file *does* get closed asynchronously just as our ->write_iter()
is returning from __blockdev_direct_IO(), we are fucked. Not only struct
file might be freed - struct inode might've been gone too. And a bunch
of ->write_iter/->read_iter instances do access struct inode after the call
of __blockdev_direct_IO(). file_write_end() is just the tip of the iceberg -
see examples I've posted today for the things we *can't* move around.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 18:52 ` Al Viro
@ 2016-10-29 19:07 ` Linus Torvalds
2016-10-29 19:17 ` Al Viro
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2016-10-29 19:07 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sat, Oct 29, 2016 at 11:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> And that call can happen as soon as we return from __blockdev_direct_IO()
> (even earlier, actually). As soon as that happens, the reference to
> struct file we'd acquired in io_submit_one() is dropped. If descriptor
> table had been shared, another thread might have already closed that sucker,
> and fput() from aio_complete() would free struct file.
But that's the point. We don't *do* anything like that any more. We
now always do the final access from aio_complete(). So it doesn't
matter if that is called asynchronously (very early) or not.
That's the whole point of the patch. Exactly to do everything either
*before* we even submit it (at which point no completion can happen),
or doing it in aio_complete() which is guaranteed to be after the
submission. No races, no use-after-free.
What am I missing?
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 19:07 ` Linus Torvalds
@ 2016-10-29 19:17 ` Al Viro
2016-10-29 20:09 ` Linus Torvalds
0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2016-10-29 19:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sat, Oct 29, 2016 at 12:07:07PM -0700, Linus Torvalds wrote:
> On Sat, Oct 29, 2016 at 11:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > And that call can happen as soon as we return from __blockdev_direct_IO()
> > (even earlier, actually). As soon as that happens, the reference to
> > struct file we'd acquired in io_submit_one() is dropped. If descriptor
> > table had been shared, another thread might have already closed that sucker,
> > and fput() from aio_complete() would free struct file.
>
> But that's the point. We don't *do* anything like that any more. We
> now always do the final access from aio_complete(). So it doesn't
> matter if that is called asynchronously (very early) or not.
>
> That's the whole point of the patch. Exactly to do everything either
> *before* we even submit it (at which point no completion can happen),
> or doing it in aio_complete() which is guaranteed to be after the
> submission. No races, no use-after-free.
>
> What am I missing?
if (ret > 0)
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
in generic_file_write_iter() (inode might be gone here)
written = mapping->a_ops->direct_IO(iocb, &data);
/*
* Finally, try again to invalidate clean pages which might have been
* cached by non-direct readahead, or faulted in by get_user_pages()
* if the source of the write was an mmap'ed region of the file
* we're writing. Either one is a pretty crazy thing to do,
* so we don't support it 100%. If this invalidation
* fails, tough, the write still worked...
*/
if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,
pos >> PAGE_SHIFT, end);
in generic_file_direct_write() (mapping points into inode, which might be
freed)
ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
xfs_get_blocks_direct, NULL, NULL, 0);
if (ret >= 0) {
iocb->ki_pos += ret;
iov_iter_advance(to, ret);
}
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
in xfs_file_dio_aio_read() (ip points to xfs-private part of inode).
And that - just from a quick look. We *do* access inode between the call
of __blockdev_direct_IO() and return from ->write_iter(). What's more,
as soon as aio_complete() has happened, what's to stop close from another
thread + umount + rmmod unmapping that ->write_iter() completely?
AFAICS, the possibility of dropping the last reference to struct file
before ->write_iter() has returned is fundamentally broken. I might be
missing something subtle here, but...
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 19:17 ` Al Viro
@ 2016-10-29 20:09 ` Linus Torvalds
2016-10-30 9:44 ` Jan Kara
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2016-10-29 20:09 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> AFAICS, the possibility of dropping the last reference to struct file
> before ->write_iter() has returned is fundamentally broken. I might be
> missing something subtle here, but...
Ok, let's add a get_file(); fput(); around that whole iter call sequence.
And that's a separate issue from "we should hold the fs freezer lock
around the whole operation". So I think we need both.
Does that make everybody happy?
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 17:47 ` Linus Torvalds
2016-10-29 18:52 ` Al Viro
@ 2016-10-30 6:29 ` Christoph Hellwig
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2016-10-30 6:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Al Viro, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sat, Oct 29, 2016 at 10:47:58AM -0700, Linus Torvalds wrote:
> On Sat, Oct 29, 2016 at 8:20 AM, Christoph Hellwig <hch@lst.de> wrote:
> >
> > We can't as that would not fix the use after free (at least for the lockdep
> > case - otherwise the call is a no-op). Once iter_op returns aio_complete
> > might have dropped our reference to the file, and another thread might
> > have closed the fd so that the fput from aio_complete was the last one.
>
> I don't concpetually mind the patch per se, but the repeated
>
> if (rw == WRITE) {
> ..
> }
>
> if (rw == WRITE) {
> ..
> }
>
> is just insane and makes the code less legible than it should be.
I agree. And I actually prepared a major refactor of the code to
get rid of it, but didn't dare to post it this late in the cycle.
Patch below, although this recently rebased version is not actually tested,
so be aware:
---
commit 8bc9e04f941645a610ca46804a95170bdd854450
Author: Christoph Hellwig <hch@lst.de>
Date: Sun Oct 30 07:27:16 2016 +0100
fs: refactor aio_run_iocb
Signed-off-by: Christoph Hellwig <hch@lst.de>
diff --git a/fs/aio.c b/fs/aio.c
index bf315cd..bc5541e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1403,13 +1403,16 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
return -EINVAL;
}
-typedef ssize_t (rw_iter_op)(struct kiocb *, struct iov_iter *);
-
-static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
- struct iovec **iovec,
- bool compat,
- struct iov_iter *iter)
+static int aio_setup_rw(int rw, char __user *buf, size_t len,
+ struct iovec **iovec, bool vectored, bool compat,
+ struct iov_iter *iter)
{
+ if (!vectored) {
+ ssize_t ret = import_single_range(rw, buf, len, *iovec, iter);
+ iovec = NULL;
+ return ret;
+ }
+
#ifdef CONFIG_COMPAT
if (compat)
return compat_import_iovec(rw,
@@ -1420,110 +1423,100 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
len, UIO_FASTIOV, iovec, iter);
}
-/*
- * aio_run_iocb:
- * Performs the initial checks and io submission.
- */
-static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
- char __user *buf, size_t len, bool compat)
+static ssize_t aio_read(struct kiocb *req, char __user *buf, size_t len,
+ bool vectored, bool compat)
{
struct file *file = req->ki_filp;
- ssize_t ret;
- int rw;
- fmode_t mode;
- rw_iter_op *iter_op;
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
+ ssize_t ret;
- switch (opcode) {
- case IOCB_CMD_PREAD:
- case IOCB_CMD_PREADV:
- mode = FMODE_READ;
- rw = READ;
- iter_op = file->f_op->read_iter;
- goto rw_common;
+ if (unlikely(!(file->f_mode & FMODE_READ)))
+ return -EBADF;
+ if (unlikely(!file->f_op->read_iter))
+ return -EINVAL;
- case IOCB_CMD_PWRITE:
- case IOCB_CMD_PWRITEV:
- mode = FMODE_WRITE;
- rw = WRITE;
- iter_op = file->f_op->write_iter;
- goto rw_common;
-rw_common:
- if (unlikely(!(file->f_mode & mode)))
- return -EBADF;
-
- if (!iter_op)
- return -EINVAL;
+ ret = aio_setup_rw(READ, buf, len, &iovec, vectored, compat, &iter);
+ if (ret)
+ goto out_free_iovec;
- if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
- ret = aio_setup_vectored_rw(rw, buf, len,
- &iovec, compat, &iter);
- else {
- ret = import_single_range(rw, buf, len, iovec, &iter);
- iovec = NULL;
- }
- if (!ret)
- ret = rw_verify_area(rw, file, &req->ki_pos,
- iov_iter_count(&iter));
- if (ret < 0) {
- kfree(iovec);
- return ret;
- }
+ ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
+ if (ret < 0)
+ goto out_free_iovec;
- if (rw == WRITE) {
- file_start_write(file);
- req->ki_flags |= IOCB_WRITE;
- }
+ ret = file->f_op->read_iter(req, &iter);
+out_free_iovec:
+ kfree(iovec);
+ return ret;
+}
- if (rw == WRITE) {
- /*
- * We release freeze protection in aio_complete(). Fool
- * lockdep by telling it the lock got released so that
- * it doesn't complain about held lock when we return
- * to userspace.
- */
- __sb_writers_release(file_inode(file)->i_sb,
- SB_FREEZE_WRITE);
- }
+static ssize_t aio_write(struct kiocb *req, char __user *buf, size_t len,
+ bool vectored, bool compat)
+{
+ struct file *file = req->ki_filp;
+ struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+ struct iov_iter iter;
+ ssize_t ret;
- ret = iter_op(req, &iter);
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
+ return -EBADF;
+ if (unlikely(!file->f_op->write_iter))
+ return -EINVAL;
- kfree(iovec);
- break;
+ ret = aio_setup_rw(WRITE, buf, len, &iovec, vectored, compat, &iter);
+ if (ret)
+ goto out_free_iovec;
- case IOCB_CMD_FDSYNC:
- if (!file->f_op->aio_fsync)
- return -EINVAL;
+ ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
+ if (ret < 0)
+ goto out_free_iovec;
- ret = file->f_op->aio_fsync(req, 1);
- break;
+ file_start_write(file);
+ req->ki_flags |= IOCB_WRITE;
+ ret = file->f_op->write_iter(req, &iter);
- case IOCB_CMD_FSYNC:
- if (!file->f_op->aio_fsync)
- return -EINVAL;
+ /*
+ * We release freeze protection in aio_complete(). Fool
+ * lockdep by telling it the lock got released so that
+ * it doesn't complain about held lock when we return
+ * to userspace.
+ */
+ __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
- ret = file->f_op->aio_fsync(req, 0);
- break;
+out_free_iovec:
+ kfree(iovec);
+ return ret;
+}
+static ssize_t aio_fsync(struct kiocb *req, bool datasync)
+{
+ struct file *file = req->ki_filp;
+
+ if (!file->f_op->aio_fsync)
+ return -EINVAL;
+ return file->f_op->aio_fsync(req, datasync);
+}
+
+static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
+ char __user *buf, size_t len, bool compat)
+{
+ switch (opcode) {
+ case IOCB_CMD_PREAD:
+ return aio_read(req, buf, len, false, compat);
+ case IOCB_CMD_PREADV:
+ return aio_read(req, buf, len, true, compat);
+ case IOCB_CMD_PWRITE:
+ return aio_write(req, buf, len, false, compat);
+ case IOCB_CMD_PWRITEV:
+ return aio_write(req, buf, len, true, compat);
+ case IOCB_CMD_FDSYNC:
+ return aio_fsync(req, true);
+ case IOCB_CMD_FSYNC:
+ return aio_fsync(req, false);
default:
pr_debug("EINVAL: no operation provided\n");
return -EINVAL;
}
-
- if (ret != -EIOCBQUEUED) {
- /*
- * There's no easy way to restart the syscall since other AIO's
- * may be already running. Just fail this IO with EINTR.
- */
- if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
- ret == -ERESTARTNOHAND ||
- ret == -ERESTART_RESTARTBLOCK))
- ret = -EINTR;
- aio_complete(req, ret, 0);
- }
-
- return 0;
}
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
@@ -1591,8 +1584,17 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
(char __user *)(unsigned long)iocb->aio_buf,
iocb->aio_nbytes,
compat);
- if (ret)
- goto out_put_req;
+ if (ret != -EIOCBQUEUED) {
+ /*
+ * There's no easy way to restart the syscall since other AIO's
+ * may be already running. Just fail this IO with EINTR.
+ */
+ if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+ ret == -ERESTARTNOHAND ||
+ ret == -ERESTART_RESTARTBLOCK))
+ ret = -EINTR;
+ aio_complete(&req->common, ret, 0);
+ }
return 0;
out_put_req:
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 16:12 ` Al Viro
2016-10-29 16:29 ` Al Viro
@ 2016-10-30 6:32 ` Christoph Hellwig
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2016-10-30 6:32 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, torvalds, jack, dmonakhov, jmoyer,
linux-fsdevel, linux-aio, linux-kernel, stable
On Sat, Oct 29, 2016 at 05:12:30PM +0100, Al Viro wrote:
> Eww... IOW, as soon as we'd submitted an async iocb, nobody can so much as
> look at struct file *or* iocb, right? Or underlying inode, or any fs-private
> data structures attached to it...
Yeah.
> I certainly agree that it's a bug, but IMO you are just papering over it.
> Just look at e.g.
> written = mapping->a_ops->direct_IO(iocb, &data);
>
> /*
> * Finally, try again to invalidate clean pages which might have been
> * cached by non-direct readahead, or faulted in by get_user_pages()
> * if the source of the write was an mmap'ed region of the file
> * we're writing. Either one is a pretty crazy thing to do,
> * so we don't support it 100%. If this invalidation
> * fails, tough, the write still worked...
> */
> if (mapping->nrpages) {
> invalidate_inode_pages2_range(mapping,
> pos >> PAGE_SHIFT, end);
> }
> in generic_file_direct_write() - who said that mapping doesn't point into
> freed memory by that point?
True, somewhat unlike due to inode caching, but for sure possible
and something that needs to be deal with.
> Why do we play that kind of insane games, anyway? Why not e.g. refcount
> the (async) iocb and have kiocb_free() drop the reference, with io_submit_one()
> holding one across the call of aio_run_iocb()? Cacheline bouncing issues?
> Anything more subtle?
There shouldn't really be a need to refcount the iocb itself - nothing
worth looking at. The one we consider was struct file, and I didn't
like it because of the cacheline bouncing if we decrement if from both
the submitter and completion context. But it starts to sounds like
the sane version more and more.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-29 20:09 ` Linus Torvalds
@ 2016-10-30 9:44 ` Jan Kara
2016-10-30 10:52 ` Jan Kara
0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2016-10-30 9:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Christoph Hellwig, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sat 29-10-16 13:09:25, Linus Torvalds wrote:
> On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > AFAICS, the possibility of dropping the last reference to struct file
> > before ->write_iter() has returned is fundamentally broken. I might be
> > missing something subtle here, but...
>
> Ok, let's add a get_file(); fput(); around that whole iter call sequence.
>
> And that's a separate issue from "we should hold the fs freezer lock
> around the whole operation". So I think we need both.
Yup, these are two separate issues. I'm fine with adding get_file(); fput()
around the iter call sequence. Then, when we have struct file available for
the whole time ->write_iter runs, I'd prefer to keep the call to fool
lockdep where original file_end_write() call was - that gives us proper
lockdep coverage for all the code behind iter_op(). The downside is we
cannot keep helpers as elegant as you suggested in your patch but I believe
it's bearable and worth the additional lockdep coverage. I'll send patches
shortly...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-30 9:44 ` Jan Kara
@ 2016-10-30 10:52 ` Jan Kara
2016-10-30 15:58 ` Christoph Hellwig
0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2016-10-30 10:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Christoph Hellwig, Jan Kara, Dmitry Monakhov, Jeff Moyer,
linux-fsdevel, linux-aio, Linux Kernel Mailing List, stable
On Sun 30-10-16 10:44:37, Jan Kara wrote:
> On Sat 29-10-16 13:09:25, Linus Torvalds wrote:
> > On Sat, Oct 29, 2016 at 12:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > AFAICS, the possibility of dropping the last reference to struct file
> > > before ->write_iter() has returned is fundamentally broken. I might be
> > > missing something subtle here, but...
> >
> > Ok, let's add a get_file(); fput(); around that whole iter call sequence.
> >
> > And that's a separate issue from "we should hold the fs freezer lock
> > around the whole operation". So I think we need both.
>
> Yup, these are two separate issues. I'm fine with adding get_file(); fput()
> around the iter call sequence. Then, when we have struct file available for
> the whole time ->write_iter runs, I'd prefer to keep the call to fool
> lockdep where original file_end_write() call was - that gives us proper
> lockdep coverage for all the code behind iter_op(). The downside is we
> cannot keep helpers as elegant as you suggested in your patch but I believe
> it's bearable and worth the additional lockdep coverage. I'll send patches
> shortly...
Hum, the additional refcount patch oopses on me when running generic/323,
I'll have to board to my flight to US shortly so I won't be able to send it
soon - maybe when I'm transferring in Denver ;).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes)
2016-10-30 10:52 ` Jan Kara
@ 2016-10-30 15:58 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2016-10-30 15:58 UTC (permalink / raw)
To: Jan Kara
Cc: Linus Torvalds, Al Viro, Christoph Hellwig, Dmitry Monakhov,
Jeff Moyer, linux-fsdevel, linux-aio, Linux Kernel Mailing List,
stable
On Sun, Oct 30, 2016 at 11:52:31AM +0100, Jan Kara wrote:
> Hum, the additional refcount patch oopses on me when running generic/323,
> I'll have to board to my flight to US shortly so I won't be able to send it
> soon - maybe when I'm transferring in Denver ;).
I've got a version that works and the rest of the patches rebased on top
of it.
And I've also got internet on my flight, so I'll be able to post it as
soon as testing finishes :)
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
---end quoted text---
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-10-30 15:58 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-29 7:44 aio: fix a user triggered use after free Christoph Hellwig
2016-10-29 7:44 ` [PATCH] aio: fix a user triggered use after free (and fix freeze protection of aio writes) Christoph Hellwig
2016-10-29 12:24 ` Al Viro
2016-10-29 15:20 ` Christoph Hellwig
2016-10-29 16:12 ` Al Viro
2016-10-29 16:29 ` Al Viro
2016-10-30 6:32 ` Christoph Hellwig
2016-10-29 17:47 ` Linus Torvalds
2016-10-29 18:52 ` Al Viro
2016-10-29 19:07 ` Linus Torvalds
2016-10-29 19:17 ` Al Viro
2016-10-29 20:09 ` Linus Torvalds
2016-10-30 9:44 ` Jan Kara
2016-10-30 10:52 ` Jan Kara
2016-10-30 15:58 ` Christoph Hellwig
2016-10-30 6:29 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).