linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Avi Kivity <avi@scylladb.com>, Jan Kara <jack@suse.cz>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>,
	jack@suse.com, hch@infradead.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-btrfs@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/8 v2] Non-blocking AIO
Date: Mon, 6 Mar 2017 10:06:43 -0700	[thread overview]
Message-ID: <7aabb6b4-df8d-8554-fbe3-90504887fb8e@kernel.dk> (raw)
In-Reply-To: <56ae3a64-5e27-d7d4-5ab5-f5f68eef8b78@scylladb.com>

On 03/06/2017 09:59 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>
>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>> we're saving anything by pushing it there.
>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>> this information to the workqueue.
>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>> queue work like that, and let any user potentially create millions of
>>>>>> pending work items (and IOs).
>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>> resources needed.
>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>> the caller having flagged support for that change of behavior would
>>>> be problematic.
>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>> maxevents?
>> It's a setup thing. We check these limits when someone creates an IO
>> context, and carve out the specified entries form our global pool. Then
>> we free those "resources" when the io context is freed.
>>
>> Right now I can setup an IO context with 1000 entries on it, yet that
>> number has NO bearing on when io_submit() would potentially block or
>> return EAGAIN.
>>
>> We can have a huge gap on the intent signaled by io context setup, and
>> the reality imposed by what actually happens on the IO submission side.
> 
> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return 
> EAGAIN?
> 
> Just tested it, and maxevents is not respected for this:
> 
> io_setup(1, [0x7fc64537f000])           = 0
> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, 
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, 
> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, 
> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, 
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
> 
> which is unexpected, to me.

ioctx_alloc()
{
        [...]

        /*                                                                      
         * We keep track of the number of available ringbuffer slots, to prevent
         * overflow (reqs_available), and we also use percpu counters for this. 
         *                                                                      
         * So since up to half the slots might be on other cpu's percpu counters
         * and unavailable, double nr_events so userspace sees what they        
         * expected: additionally, we move req_batch slots to/from percpu       
         * counters at a time, so make sure that isn't 0:                       
         */                                                                     
        nr_events = max(nr_events, num_possible_cpus() * 4);                    
        nr_events *= 2;                                    
}


-- 
Jens Axboe

  reply	other threads:[~2017-03-06 17:06 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT Goldwyn Rodrigues
2017-03-01 15:36   ` Christoph Hellwig
2017-03-01 15:56     ` Christoph Hellwig
2017-03-01 16:57       ` Goldwyn Rodrigues
2017-03-01 22:44         ` Christoph Hellwig
2017-02-28 23:36 ` [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem Goldwyn Rodrigues
2017-03-01 15:37   ` Christoph Hellwig
2017-02-28 23:36 ` [PATCH 3/8] nowait aio: return if direct write will trigger writeback Goldwyn Rodrigues
2017-03-01  3:46   ` Matthew Wilcox
2017-03-01 15:38     ` Christoph Hellwig
2017-03-02 10:38       ` Jan Kara
2017-03-02 14:12         ` Matthew Wilcox
2017-03-02 15:22           ` Jan Kara
2017-02-28 23:36 ` [PATCH 4/8] nowait aio: Introduce IOMAP_NOWAIT Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
2017-03-08  7:03   ` Sagi Grimberg
2017-03-08 15:00     ` Goldwyn Rodrigues
2017-03-08 15:28       ` Jan Kara
2017-03-08 15:51         ` Christoph Hellwig
2017-03-08 16:17       ` Jens Axboe
2017-03-09  2:18         ` Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 6/8] nowait aio: ext4 Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 7/8] nowait aio: xfs Goldwyn Rodrigues
2017-03-01 15:40   ` Christoph Hellwig
2017-02-28 23:36 ` [PATCH 8/8] nowait aio: btrfs Goldwyn Rodrigues
2017-03-05 14:56 ` [PATCH 0/8 v2] Non-blocking AIO Avi Kivity
2017-03-06  8:25   ` Jan Kara
2017-03-06  8:40     ` Avi Kivity
2017-03-06 15:19     ` Jens Axboe
2017-03-06 15:29       ` Avi Kivity
2017-03-06 15:38         ` Jens Axboe
2017-03-06 15:59           ` Avi Kivity
2017-03-06 16:08             ` Jens Axboe
2017-03-06 16:59               ` Avi Kivity
2017-03-06 17:06                 ` Jens Axboe [this message]
2017-03-06 18:17                   ` Avi Kivity
2017-03-06 18:27                     ` Jens Axboe
2017-03-06 18:50                       ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7aabb6b4-df8d-8554-fbe3-90504887fb8e@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=avi@scylladb.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rgoldwyn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).