linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Al Viro <viro@zeniv.linux.org.uk>, Jann Horn <jannh@google.com>
Cc: linux-aio@kvack.org, linux-block@vger.kernel.org,
	Linux API <linux-api@vger.kernel.org>,
	hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 13/18] io_uring: add file set registration
Date: Tue, 5 Feb 2019 17:27:29 -0700	[thread overview]
Message-ID: <0d2e5085-32ff-e86e-d628-6000071fd132@kernel.dk> (raw)
In-Reply-To: <40b27e78-9ee8-1395-feb3-a73aac87c9a7@kernel.dk>

On 2/5/19 12:08 PM, Jens Axboe wrote:
> On 2/5/19 10:57 AM, Jens Axboe wrote:
>> On 2/4/19 7:19 PM, Jens Axboe wrote:
>>> On 2/3/19 7:56 PM, Al Viro wrote:
>>>> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>>>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> We normally have to fget/fput for each IO we do on a file. Even with
>>>>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>>>>> count adds up.
>>>>>>
>>>>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>>>>> for the io_uring_register(2) system call. The arguments passed in must
>>>>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>>>>> the number of file descriptors the application wishes to pin for the
>>>>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>>>>> called).
>>>>>>
>>>>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>>>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>>>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>>>>
>>>>>> Files are automatically unregistered when the io_uring context is
>>>>>> torn down. An application need only unregister if it wishes to
>>>>>> register a new set of fds.
>>>>>
>>>>> Crazy idea:
>>>>>
>>>>> Taking a step back, at a high level, basically this patch creates sort
>>>>> of the same difference that you get when you compare the following
>>>>> scenarios for normal multithreaded I/O in userspace:
>>>>
>>>>> This kinda makes me wonder whether this is really something that
>>>>> should be implemented specifically for the io_uring API, or whether it
>>>>> would make sense to somehow handle part of this in the generic VFS
>>>>> code and give the user the ability to prepare a new files_struct that
>>>>> can then be transferred to the worker thread, or something like
>>>>> that... I'm not sure whether there's a particularly clean way to do
>>>>> that though.
>>>>
>>>> Using files_struct for that opens a can of worms you really don't
>>>> want to touch.
>>>>
>>>> Consider the following scenario with any variant of this interface:
>>>> 	* create io_uring fd.
>>>> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
>>>> 	* add the descriptor of that AF_UNIX socket to your fd
>>>> 	* close AF_UNIX fd, close io_uring fd.
>>>> Voila - you've got a shiny leak.  No ->release() is called for
>>>> anyone (and you really don't want to do that on ->flush(), because
>>>> otherwise a library helper doing e.g. system("/bin/date") will tear
>>>> down all the io_uring in your process).  The socket is held by
>>>> the reference you've stashed into io_uring (whichever way you do
>>>> that).  io_uring is held by the reference you've stashed into
>>>> SCM_RIGHTS datagram in queue of the socket.
>>>>
>>>> No matter what, you need net/unix/garbage.c to be aware of that stuff.
>>>> And getting files_struct lifetime mixed into that would be beyond
>>>> any reason.
>>>>
>>>> The only reason for doing that as a descriptor table would be
>>>> avoiding the cost of fget() in whatever uses it, right?  Since
>>>
>>> Right, the only purpose of this patch is to avoid doing fget/fput for
>>> each IO.
>>>
>>>> those are *not* the normal syscalls (and fdget() really should not
>>>> be used anywhere other than the very top of syscall's call chain -
>>>> that's another reason why tossing file_struct around like that
>>>> is insane) and since the benefit is all due to the fact that it's
>>>> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
>>>> treat file references as stable... why the hell use the descriptor
>>>> tables at all?
>>>
>>> This one is not a regular system call, since we don't do fget, then IO,
>>> then fput. We hang on to it. But for the non-registered case, it's very
>>> much just like a regular read/write system call, where we fget to do IO
>>> on it, then fput when we are done.
>>>
>>>> All you need is an array of struct file *, explicitly populated.
>>>> With net/unix/garbage.c aware of such beasts.  Guess what?  We
>>>> do have such an object already.  The one net/unix/garbage.c is
>>>> working with.  SCM_RIGHTS datagrams, that is.
>>>>
>>>> IOW, can't we give those io_uring descriptors associated struct
>>>> unix_sock?  No socket descriptors, no struct socket (probably),
>>>> just the AF_UNIX-specific part thereof.  Then teach
>>>> unix_inflight()/unix_notinflight() about getting unix_sock out
>>>> of these guys (incidentally, both would seem to benefit from
>>>> _not_ touching unix_gc_lock in case when there's no unix_sock
>>>> attached to file we are dealing with - I might be missing
>>>> something very subtle about barriers there, but it doesn't
>>>> look likely).
>>>
>>> That might be workable, though I'm not sure we currently have helpers to
>>> just explicitly create a unix_sock by itself. Not familiar with the
>>> networking bits at all, I'll take a look.
>>>
>>>> And make that (i.e. registering the descriptors) mandatory.
>>>
>>> I don't want to make it mandatory, that's very inflexible for managing
>>> tons of files. The registration is useful for specific cases where we
>>> have high frequency of operations on a set of files. Besides, it'd make
>>> the use of the API cumbersome as well for the basic case of just wanting
>>> to do async IO.
>>>
>>>> Hell, combine that with creating io_uring fd, if we really
>>>> care about the syscall count.  Benefits:
>>>
>>> We don't care about syscall count for setup as much. If you're doing
>>> registration of a file set, you're expected to do a LOT of IO to those
>>> files. Hence having an extra one for setup is not a concern. My concern
>>> is just making it mandatory to do registration, I don't think that's a
>>> workable alternative.
>>>
>>>> 	* no file_struct refcount wanking
>>>> 	* no fget()/fput() (conditional, at that) from kernel
>>>> threads
>>>> 	* no CLOEXEC-dependent anything; just the teardown
>>>> on the final fput(), whichever way it comes.
>>>> 	* no fun with duelling garbage collectors.
>>>
>>> The fget/fput from a kernel thread can be solved by just hanging on to
>>> the struct file * when we punt the IO. Right now we don't, which is a
>>> little silly, that should be changed.
>>>
>>> Getting rid of the files_struct{} is doable.
>>
>> OK, I've reworked the initial parts to wire up the io_uring fd to the
>> AF_UNIX garbage collection. As I made it to the file registration part,
>> I wanted to wire up that too. But I don't think there's a need for that
>> - if we have the io_uring fd appropriately protected, we'll be dropping
>> our struct file ** array index when the io_uring fd is released. That
>> should be adequate, we don't need the garbage collection to be aware of
>> those individually.
>>
>> The only part I had to drop for now is the sq thread polling, as that
>> depends on us carrying the files_struct. I'm going to fold that in
>> shortly, but just make it be dependent on having registered files. That
>> avoids needing to fget/fput for that case, and needing registered files
>> for the sq side submission/polling is not a usability issue like it
>> would be for the "normal" use cases.
> 
> Proof is in the pudding, here's the main commit introducing io_uring
> and now wiring it up to the AF_UNIX garbage collection:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
> 
> How does that look? Outside of the inflight hookup, we simply retain
> the file * for punting to the workqueue. This means that buffered
> retry does NOT need to do fget/fput, so we don't need a files_struct
> for that anymore.
> 
> In terms of the SQPOLL patch that's further down the series, it doesn't
> allow that mode of operation without having fixed files enabled. That
> eliminates the need for fget/fput from a kernel thread, and hence the
> need to carry a files_struct around for that as well.

This should be better, passes some basic testing, too:

http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=01a93aa784319a02ccfa6523371b93401c9e0073

Verified that we're grabbing the right refs, and don't hold any
ourselves. For the file registration, forbid registration of the
io_uring fd, as that is pointless and will introduce a loop regardless
of fd passing.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

  reply	other threads:[~2019-02-06  0:27 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
2019-01-29 19:26 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-29 19:26 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
2019-01-29 19:26 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
2019-01-29 19:26 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
2019-01-29 19:26 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
2019-01-29 19:26 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
2019-01-29 19:26 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
2019-01-29 20:47   ` Jann Horn
2019-01-29 20:56     ` Jens Axboe
2019-01-29 21:10       ` Jann Horn
2019-01-29 21:33         ` Jens Axboe
2019-01-29 19:26 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
2019-01-29 19:26 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-29 23:31   ` Jann Horn
2019-01-29 23:44     ` Jens Axboe
2019-01-30 15:33       ` Jens Axboe
2019-01-29 19:26 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-29 19:26 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-29 19:26 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-29 22:44   ` Jann Horn
2019-01-29 22:56     ` Jens Axboe
2019-01-29 23:03       ` Jann Horn
2019-01-29 23:06         ` Jens Axboe
2019-01-29 23:08           ` Jann Horn
2019-01-29 23:14             ` Jens Axboe
2019-01-29 23:42               ` Jann Horn
2019-01-29 23:51                 ` Jens Axboe
2019-01-29 19:26 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-30  1:29   ` Jann Horn
2019-01-30 15:35     ` Jens Axboe
2019-02-04  2:56     ` Al Viro
2019-02-05  2:19       ` Jens Axboe
2019-02-05 17:57         ` Jens Axboe
2019-02-05 19:08           ` Jens Axboe
2019-02-06  0:27             ` Jens Axboe [this message]
2019-02-06  1:01               ` Al Viro
2019-02-06 17:56                 ` Jens Axboe
2019-02-07  4:05                   ` Al Viro
2019-02-07 16:14                     ` Jens Axboe
2019-02-07 16:30                       ` Al Viro
2019-02-07 16:35                         ` Jens Axboe
2019-02-07 16:51                         ` Al Viro
2019-02-06  0:56             ` Al Viro
2019-02-06 13:41               ` Jens Axboe
2019-02-07  4:00                 ` Al Viro
2019-02-07  9:22                   ` Miklos Szeredi
2019-02-07 13:31                     ` Al Viro
2019-02-07 14:20                       ` Miklos Szeredi
2019-02-07 15:20                         ` Al Viro
2019-02-07 15:27                           ` Miklos Szeredi
2019-02-07 16:26                             ` Al Viro
2019-02-07 19:08                               ` Miklos Szeredi
2019-02-07 18:45                   ` Jens Axboe
2019-02-07 18:58                     ` Jens Axboe
2019-02-11 15:55                     ` Jonathan Corbet
2019-02-11 17:35                       ` Al Viro
2019-02-11 20:33                         ` Jonathan Corbet
2019-01-29 19:26 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
2019-01-29 19:26 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
2019-01-29 19:27 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-01-29 19:27 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-01-29 19:27 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-02-07 19:55 [PATCHSET v12] io_uring IO interface Jens Axboe
2019-02-07 19:55 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-02-08 12:17   ` Alan Jenkins
2019-02-08 12:57     ` Jens Axboe
2019-02-08 14:02       ` Alan Jenkins
2019-02-08 15:13         ` Jens Axboe
2019-02-12 12:29           ` Alan Jenkins
2019-02-12 15:17             ` Jens Axboe
2019-02-12 17:21               ` Alan Jenkins
2019-02-12 17:33                 ` Jens Axboe
2019-02-12 20:23                   ` Alan Jenkins
2019-02-12 21:10                     ` Jens Axboe
2019-02-01 15:23 [PATCHSET v11] io_uring IO interface Jens Axboe
2019-02-01 15:24 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-30 21:55 [PATCHSET v10] io_uring IO interface Jens Axboe
2019-01-30 21:55 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-28 21:35 [PATCHSET v8] io_uring IO interface Jens Axboe
2019-01-28 21:35 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-29 16:36   ` Jann Horn
2019-01-29 18:13     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0d2e5085-32ff-e86e-d628-6000071fd132@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=avi@scylladb.com \
    --cc=hch@lst.de \
    --cc=jannh@google.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).