Re: [PATCH 3/3] io_uring: add support for getdents

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hao Xu <hao.xu@linux.dev>
To: Christian Brauner <brauner@kernel.org>
Cc: io-uring@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Stefan Roesch <shr@fb.com>, Clay Harris <bugs@claycon.org>,
	Dave Chinner <david@fromorbit.com>,
	linux-fsdevel@vger.kernel.org, Wanpeng Li <wanpengli@tencent.com>
Subject: Re: [PATCH 3/3] io_uring: add support for getdents
Date: Thu, 13 Jul 2023 12:35:07 +0800	[thread overview]
Message-ID: <bb2aa872-c3fb-93f0-c0da-3a897f39347d@linux.dev> (raw)
In-Reply-To: <20230712-alltag-abberufen-67a615152bee@brauner>

On 7/12/23 23:27, Christian Brauner wrote:
> On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
>> From: Hao Xu <howeyxu@tencent.com>
>>
>> This add support for getdents64 to io_uring, acting exactly like the
>> syscall: the directory is iterated from it's current's position as
>> stored in the file struct, and the file's position is updated exactly as
>> if getdents64 had been called.
>>
>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>> first; if a user already knows the filesystem they use do not support
>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>> avoiding the need to bounce back through a useless EAGAIN return.
>>
>> Co-developed-by: Dominique Martinet <asmadeus@codewreck.org>
>> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>> ---
>>   include/uapi/linux/io_uring.h |  7 ++++
>>   io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
>>   io_uring/fs.h                 |  3 ++
>>   io_uring/opdef.c              |  8 +++++
>>   4 files changed, 78 insertions(+)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 08720c7bd92f..6c0d521135a6 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>>   		__u32		xattr_flags;
>>   		__u32		msg_ring_flags;
>>   		__u32		uring_cmd_flags;
>> +		__u32		getdents_flags;
>>   	};
>>   	__u64	user_data;	/* data to be passed back at completion time */
>>   	/* pack this to avoid bogus arm OABI complaints */
>> @@ -235,6 +236,7 @@ enum io_uring_op {
>>   	IORING_OP_URING_CMD,
>>   	IORING_OP_SEND_ZC,
>>   	IORING_OP_SENDMSG_ZC,
>> +	IORING_OP_GETDENTS,
>>   
>>   	/* this goes last, obviously */
>>   	IORING_OP_LAST,
>> @@ -273,6 +275,11 @@ enum io_uring_op {
>>    */
>>   #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>>   
>> +/*
>> + * sqe->getdents_flags
>> + */
>> +#define IORING_GETDENTS_REWIND	(1U << 0)
>> +
>>   /*
>>    * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>>    * command flags for POLL_ADD are stored in sqe->len.
>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>> index f6a69a549fd4..77f00577e09c 100644
>> --- a/io_uring/fs.c
>> +++ b/io_uring/fs.c
>> @@ -47,6 +47,13 @@ struct io_link {
>>   	int				flags;
>>   };
>>   
>> +struct io_getdents {
>> +	struct file			*file;
>> +	struct linux_dirent64 __user	*dirent;
>> +	unsigned int			count;
>> +	int				flags;
>> +};
>> +
>>   int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   {
>>   	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
>> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>>   	putname(sl->oldpath);
>>   	putname(sl->newpath);
>>   }
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +
>> +	if (READ_ONCE(sqe->off) != 0)
>> +		return -EINVAL;
>> +
>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>> +	gd->count = READ_ONCE(sqe->len);
>> +
>> +	return 0;
>> +}
>> +
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +	struct file *file;
>> +	unsigned long getdents_flags = 0;
>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>> +	bool should_lock = false;
>> +	int ret;
>> +
>> +	if (force_nonblock) {
>> +		if (!(req->file->f_mode & FMODE_NOWAIT))
>> +			return -EAGAIN;
>> +
>> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
> 
> I mentioned this on the other patch but it seems really pointless to
> have that extra flag. I would really like to hear a good reason for
> this.
> 
>> +	}
>> +
>> +	file = req->file;
>> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>> +		if (file_count(file) > 1)
> 
> Assume we have a regular non-threaded process that just opens an fd to a
> file. The process registers an async readdir request via that fd for the
> file with io_uring and goes to do other stuff while waiting for the
> result.
> 
> Some time later, io_uring gets to io_getdents() and the task is still
> single threaded and the file hasn't been shared in the meantime. So
> io_getdents() doesn't take the lock and starts the readdir() call.
> 
> Concurrently, the process that registered the io_uring request was free
> to do other stuff and issued a synchronous readdir() system call which
> calls fdget_pos(). Since the fdtable still isn't shared it doesn't
> increment f_count and doesn't acquire the mutex. Now there's another
> concurrent readdir() going on.
> 
> (Similar thing can happen if the process creates a thread for example.)
> 
> Two readdir() requests now proceed concurrently which is not intended.
> Now to verify that this race can't happen with io_uring:
> 
> * regular fds:
>    It seems that io_uring calls fget() on each regular file descriptor
>    when an async request is registered. So that means that io_uring
>    always hold its own explicit reference here.
>    So as long as the original task is alive or another thread is alive
>    f_count is guaranteed to be > 1 and so the mutex would always be
>    acquired.
> 
>    If the registering process dies right before io_uring gets to the
>    io_getdents() request no other process can steal the fd anymore and in
>    that case the readdir call would not lock. But that's fine.
> 
> * fixed fds:
>    I don't know the reference counting rules here. io_uring would need to
>    ensure that it's impossible for two async readdir requests via a fixed
>    fd to race because f_count is == 1.
> 
>    Iiuc, if a process registers a file it opened as a fixed file and
>    immediately closes the fd afterwards - without anyone else holding a
>    reference to that file - and only uses the fixed fd going forward, the
>    f_count of that file in io_uring's fixed file table is always 1.
> 
>    So one could issue any number of concurrent readdir requests with no
>    mutual exclusion. So for fixed files there definitely is a race, no?

Hi Christian,
The ref logic for fixed file is that it does fdget() when registering
the file, and fdput() when unregistering it. So the ref in between is
always > 1. The fixed file feature is to reduce frequent fdget/fdput,
but it does call them at the register/unregister time.


> 
> All of that could ofc be simplified if we could just always acquire the
> mutex in fdget_pos() and other places and drop that file_count(file) > 1
> optimization everywhere. But I have no idea if the optimization for not
> acquiring the mutex if f_count == 1 is worth it?
> 
> I hope I didn't confuse myself here.
> 
> Jens, do yo have any input here?
> 
>> +			should_lock = true;
>> +	}
>> +	if (should_lock) {
>> +		if (!force_nonblock)
>> +			mutex_lock(&file->f_pos_lock);
>> +		else if (!mutex_trylock(&file->f_pos_lock))
>> +			return -EAGAIN;
>> +	}
> 
> Open-coding this seems extremely brittle with an invitation for subtle
> bugs.

Could you elaborate on this, I'm not sure that I understand it quite
well. Sorry for my poor English.

Thanks,
Hao

> 
>> +
>> +	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
>> +	if (should_lock)
>> +		mutex_unlock(&file->f_pos_lock);
>> +
>> +	if (ret == -EAGAIN && force_nonblock)
>> +		return -EAGAIN;
>> +
>> +	io_req_set_res(req, ret, 0);
>> +	return 0;
>> +}
>> +
>> diff --git a/io_uring/fs.h b/io_uring/fs.h
>> index 0bb5efe3d6bb..f83a6f3a678d 100644
>> --- a/io_uring/fs.h
>> +++ b/io_uring/fs.h
>> @@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
>>   int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>>   int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
>>   void io_link_cleanup(struct io_kiocb *req);
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>> index 3b9c6489b8b6..1bae6b2a8d0b 100644
>> --- a/io_uring/opdef.c
>> +++ b/io_uring/opdef.c
>> @@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
>>   		.prep			= io_eopnotsupp_prep,
>>   #endif
>>   	},
>> +	[IORING_OP_GETDENTS] = {
>> +		.needs_file		= 1,
>> +		.prep			= io_getdents_prep,
>> +		.issue			= io_getdents,
>> +	},
>>   };
>>   
>>   
>> @@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
>>   		.fail			= io_sendrecv_fail,
>>   #endif
>>   	},
>> +	[IORING_OP_GETDENTS] = {
>> +		.name			= "GETDENTS",
>> +	},
>>   };
>>   
>>   const char *io_uring_get_opcode(u8 opcode)
>> -- 
>> 2.25.1
>>

next prev parent reply	other threads:[~2023-07-13  4:35 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-11 11:40 [PATCH v3 0/3] io_uring getdents Hao Xu
2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
2023-07-11 13:02   ` Ammar Faizi
2023-07-12  8:03     ` Hao Xu
2023-07-12 13:55       ` Ammar Faizi
2023-07-13  4:17         ` Hao Xu
2023-07-11 23:41   ` Dave Chinner
2023-07-11 23:50     ` Jens Axboe
2023-07-12 11:14       ` Christian Brauner
2023-07-11 11:40 ` [PATCH 2/3] vfs_getdents/struct dir_context: add flags field Hao Xu
2023-07-12 11:31   ` Christian Brauner
2023-07-12 16:02     ` Dominique Martinet
2023-07-13  4:12       ` Hao Xu
2023-07-11 11:40 ` [PATCH 3/3] io_uring: add support for getdents Hao Xu
2023-07-11 12:15   ` Dominique Martinet
2023-07-12  7:53     ` Hao Xu
2023-07-12 16:10       ` Dominique Martinet
2023-07-13  4:05         ` Hao Xu
2023-07-13  4:40           ` Hao Xu
2023-07-13  4:50           ` Dominique Martinet
2023-07-12  8:01     ` Hao Xu
2023-07-12 15:27   ` Christian Brauner
2023-07-13  4:35     ` Hao Xu [this message]
2023-07-13  7:10       ` Christian Brauner
2023-07-13  9:06         ` Hao Xu
2023-07-13 15:14           ` Christian Brauner
2023-07-16 11:57             ` Hao Xu
2023-07-18  6:55               ` Hao Xu
2023-07-11 23:47 ` [PATCH v3 0/3] io_uring getdents Dave Chinner
2023-07-11 23:51   ` Jens Axboe
2023-07-12  0:53     ` Dominique Martinet
2023-07-12  0:56       ` Jens Axboe
2023-07-12  3:16         ` Hao Xu
2023-07-12  3:12       ` Hao Xu
2023-07-12  3:19   ` Hao Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb2aa872-c3fb-93f0-c0da-3a897f39347d@linux.dev \
    --to=hao.xu@linux.dev \
    --cc=asmadeus@codewreck.org \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=bugs@claycon.org \
    --cc=david@fromorbit.com \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=shr@fb.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.