All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	linux-block@vger.kernel.org, linux-arch@vger.kernel.org,
	hch@lst.de, jmoyer@redhat.com, avi@scylladb.com
Subject: Re: [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers
Date: Wed, 16 Jan 2019 16:17:01 -0700	[thread overview]
Message-ID: <29622208-d155-4f76-78d5-e7dd54ee807b@kernel.dk> (raw)
In-Reply-To: <20190116230920.GT4205@dastard>

On 1/16/19 4:09 PM, Dave Chinner wrote:
> On Wed, Jan 16, 2019 at 03:21:21PM -0700, Jens Axboe wrote:
>> On 1/16/19 3:09 PM, Dave Chinner wrote:
>>> On Wed, Jan 16, 2019 at 02:20:53PM -0700, Jens Axboe wrote:
>>>> On 1/16/19 1:53 PM, Dave Chinner wrote:
>>>> I'd be fine with that restriction, especially since it can get relaxed
>>>> down the line. Do we have an appropriate API for this?  And why isn't
>>>> get_user_pages_longterm() that exact API already?
>>>
>>> get_user_pages_longterm() is the right thing to use to ensure DAX
>>> doesn't trip over this - it's effectively just get_user_pages()
>>> with a "if (vma_is_fsdax(vma))" check in it to abort and return
>>> -EOPNOTSUPP. IOWs, this is safe on DAX but it's not safe on anything
>>> else. :/
>>>
>>> Unfortunately, disallowing userspace GUP pins on non-DAX file backed
>>> pages will break existing "mostly just work" userspace apps all over
>>> the place. And so right now there are discussions ongoing about how
>>> to map gup references avoid the writeback races and be able to be
>>> seen/tracked by other kernel infrastructure (see the long, long
>>> thread "[PATCH 0/2] put_user_page*(): start converting the call
>>> sites" on -fsdevel). Progress is slow, but I think we're starting to
>>> close on a workable solution.
>>>
>>> FWIW, this doesn't solve the "long term user pin will block
>>> filesystem operations until unpin" problem, that's what moving to
>>> using revocable file layout leases is intended to solve. There have
>>> been patches posted some time ago to add this user API for this, but
>>> we've got to solve the other problems first....
>>>
>>>> Would seem that most
>>>> (all?) callers of this API is currently broken then.
>>>
>>> Yup, there's a long, long history of machines using userspace RDMA
>>> panicing because filesystems have detected or tripped over invalid
>>> page cache state during writeback attempts. This is not a new
>>> problem....
>>
>> Thanks for your detailed answer, Dave! I didn't see it before I sent
>> out the previous email. FWIW, I've updated the patch:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=0c8f2299f8069af6b2fa8f99a10d81646d1237a7
>>
>> Checks for file backed memory, fails the registration with EOPNOTSUPP
>> if the check fails.
> 
> Doesn't it need to call put_pages() on all the pages picked up by
> get_user_pages_longterm() when it returns -EOPNOTSUPP? They haven't
> been mapped into the imu->bvec array yet, so AFAICT there's nothing
> to release the page references on teardown here.

Oops, yes good point. The usual error handling won't work for this, need
to put them.

> Also, not a vma expert here, but the vma array contents may only be
> valid while the mmap_sem is held - I think vmas can come and go
> after it has been dropped and so accessing vmas to check
> vma->vm_file after the mmap_sem has been dropped may be open to
> read-after-free races.

I did fix that one right after sending out the email:

http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=d2b44723d5bceeb9966c858255a03596ed62929c

I'll fix the missing put_pages() on error and update it.

>> That should handle the issue on the io_uring side at least, and it's a
>> restriction that can always be relaxed/lifted, when appropriate solutions
>> to file backed buffers exists.
> 
> Modulo the issue above, that works for me.

Great!

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

WARNING: multiple messages have this Message-ID (diff)
From: Jens Axboe <axboe@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	linux-block@vger.kernel.org, linux-arch@vger.kernel.org,
	hch@lst.de, jmoyer@redhat.com, avi@scylladb.com
Subject: Re: [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers
Date: Wed, 16 Jan 2019 16:17:01 -0700	[thread overview]
Message-ID: <29622208-d155-4f76-78d5-e7dd54ee807b@kernel.dk> (raw)
Message-ID: <20190116231701.TYBv8nMXnkLZoftkYCyjJU9R7UsdHgMLlVUL8ik6ROg@z> (raw)
In-Reply-To: <20190116230920.GT4205@dastard>

On 1/16/19 4:09 PM, Dave Chinner wrote:
> On Wed, Jan 16, 2019 at 03:21:21PM -0700, Jens Axboe wrote:
>> On 1/16/19 3:09 PM, Dave Chinner wrote:
>>> On Wed, Jan 16, 2019 at 02:20:53PM -0700, Jens Axboe wrote:
>>>> On 1/16/19 1:53 PM, Dave Chinner wrote:
>>>> I'd be fine with that restriction, especially since it can get relaxed
>>>> down the line. Do we have an appropriate API for this?  And why isn't
>>>> get_user_pages_longterm() that exact API already?
>>>
>>> get_user_pages_longterm() is the right thing to use to ensure DAX
>>> doesn't trip over this - it's effectively just get_user_pages()
>>> with a "if (vma_is_fsdax(vma))" check in it to abort and return
>>> -EOPNOTSUPP. IOWs, this is safe on DAX but it's not safe on anything
>>> else. :/
>>>
>>> Unfortunately, disallowing userspace GUP pins on non-DAX file backed
>>> pages will break existing "mostly just work" userspace apps all over
>>> the place. And so right now there are discussions ongoing about how
>>> to map gup references avoid the writeback races and be able to be
>>> seen/tracked by other kernel infrastructure (see the long, long
>>> thread "[PATCH 0/2] put_user_page*(): start converting the call
>>> sites" on -fsdevel). Progress is slow, but I think we're starting to
>>> close on a workable solution.
>>>
>>> FWIW, this doesn't solve the "long term user pin will block
>>> filesystem operations until unpin" problem, that's what moving to
>>> using revocable file layout leases is intended to solve. There have
>>> been patches posted some time ago to add this user API for this, but
>>> we've got to solve the other problems first....
>>>
>>>> Would seem that most
>>>> (all?) callers of this API is currently broken then.
>>>
>>> Yup, there's a long, long history of machines using userspace RDMA
>>> panicing because filesystems have detected or tripped over invalid
>>> page cache state during writeback attempts. This is not a new
>>> problem....
>>
>> Thanks for your detailed answer, Dave! I didn't see it before I sent
>> out the previous email. FWIW, I've updated the patch:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=0c8f2299f8069af6b2fa8f99a10d81646d1237a7
>>
>> Checks for file backed memory, fails the registration with EOPNOTSUPP
>> if the check fails.
> 
> Doesn't it need to call put_pages() on all the pages picked up by
> get_user_pages_longterm() when it returns -EOPNOTSUPP? They haven't
> been mapped into the imu->bvec array yet, so AFAICT there's nothing
> to release the page references on teardown here.

Oops, yes good point. The usual error handling won't work for this, need
to put them.

> Also, not a vma expert here, but the vma array contents may only be
> valid while the mmap_sem is held - I think vmas can come and go
> after it has been dropped and so accessing vmas to check
> vma->vm_file after the mmap_sem has been dropped may be open to
> read-after-free races.

I did fix that one right after sending out the email:

http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=d2b44723d5bceeb9966c858255a03596ed62929c

I'll fix the missing put_pages() on error and update it.

>> That should handle the issue on the io_uring side at least, and it's a
>> restriction that can always be relaxed/lifted, when appropriate solutions
>> to file backed buffers exists.
> 
> Modulo the issue above, that works for me.

Great!

-- 
Jens Axboe

  reply	other threads:[~2019-01-16 23:17 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-16 17:49 [PATCHSET v5] io_uring IO interface Jens Axboe
2019-01-16 17:49 ` Jens Axboe
2019-01-16 17:49 ` [PATCH 01/15] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 02/15] block: wire up block device iopoll method Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 03/15] block: add bio_set_polled() helper Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 04/15] iomap: wire up the iopoll method Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 05/15] Add io_uring IO interface Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-17 12:02   ` Roman Penyaev
2019-01-17 12:02     ` Roman Penyaev
2019-01-17 13:54     ` Jens Axboe
2019-01-17 13:54       ` Jens Axboe
2019-01-17 14:34       ` Roman Penyaev
2019-01-17 14:34         ` Roman Penyaev
2019-01-17 14:54         ` Jens Axboe
2019-01-17 14:54           ` Jens Axboe
2019-01-17 15:19           ` Roman Penyaev
2019-01-17 15:19             ` Roman Penyaev
2019-01-17 12:48   ` Roman Penyaev
2019-01-17 12:48     ` Roman Penyaev
2019-01-17 14:01     ` Jens Axboe
2019-01-17 14:01       ` Jens Axboe
2019-01-17 20:03       ` Jeff Moyer
2019-01-17 20:03         ` Jeff Moyer
2019-01-17 20:09         ` Jens Axboe
2019-01-17 20:09           ` Jens Axboe
2019-01-17 20:14           ` Jens Axboe
2019-01-17 20:14             ` Jens Axboe
2019-01-17 20:50             ` Jeff Moyer
2019-01-17 20:50               ` Jeff Moyer
2019-01-17 20:53               ` Jens Axboe
2019-01-17 20:53                 ` Jens Axboe
2019-01-17 21:02                 ` Jeff Moyer
2019-01-17 21:02                   ` Jeff Moyer
2019-01-17 21:17                   ` Jens Axboe
2019-01-17 21:17                     ` Jens Axboe
2019-01-17 21:21                     ` Jeff Moyer
2019-01-17 21:21                       ` Jeff Moyer
2019-01-17 21:27                       ` Jens Axboe
2019-01-17 21:27                         ` Jens Axboe
2019-01-18  8:23               ` Roman Penyaev
2019-01-18  8:23                 ` Roman Penyaev
2019-01-16 17:49 ` [PATCH 06/15] io_uring: add fsync support Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 07/15] io_uring: support for IO polling Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 08/15] fs: add fget_many() and fput_many() Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 09/15] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 10/15] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:49 ` [PATCH 11/15] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-16 17:49   ` Jens Axboe
2019-01-16 17:50 ` [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-16 17:50   ` Jens Axboe
2019-01-16 20:53   ` Dave Chinner
2019-01-16 21:20     ` Jens Axboe
2019-01-16 21:20       ` Jens Axboe
2019-01-16 22:09       ` Dave Chinner
2019-01-16 22:21         ` Jens Axboe
2019-01-16 22:21           ` Jens Axboe
2019-01-16 23:09           ` Dave Chinner
2019-01-16 23:17             ` Jens Axboe [this message]
2019-01-16 23:17               ` Jens Axboe
2019-01-16 22:13       ` Jens Axboe
2019-01-16 22:13         ` Jens Axboe
2019-01-16 17:50 ` [PATCH 13/15] io_uring: add submission polling Jens Axboe
2019-01-16 17:50   ` Jens Axboe
2019-01-16 17:50 ` [PATCH 14/15] io_uring: add file registration Jens Axboe
2019-01-16 17:50   ` Jens Axboe
2019-01-16 17:50 ` [PATCH 15/15] io_uring: add io_uring_event cache hit information Jens Axboe
2019-01-16 17:50   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-01-10  2:43 [PATCHSET v2] io_uring IO interface Jens Axboe
2019-01-10  2:44 ` [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-10  2:44   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29622208-d155-4f76-78d5-e7dd54ee807b@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=avi@scylladb.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.