Re: [LSF/MM/BPF TOPIC] block drivers in user space

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
       [not found] ` <986caf55-65d1-0755-383b-73834ec04967@suse.de>
@ 2022-03-27 16:35   ` Ming Lei
  2022-03-28  5:47     ` Kanchan Joshi
                       ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Ming Lei @ 2022-03-27 16:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block, Xiaoguang Wang,
	linux-mm

On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
> > I'd like to discuss an interface to implement user space block devices,
> > while avoiding local network NBD solutions.  There has been reiterated
> > interest in the topic, both from researchers [1] and from the community,
> > including a proposed session in LSFMM2018 [2] (though I don't think it
> > happened).
> > 
> > I've been working on top of the Google iblock implementation to find
> > something upstreamable and would like to present my design and gather
> > feedback on some points, in particular zero-copy and overall user space
> > interface.
> > 
> > The design I'm pending towards uses special fds opened by the driver to
> > transfer data to/from the block driver, preferably through direct
> > splicing as much as possible, to keep data only in kernel space.  This
> > is because, in my use case, the driver usually only manipulates
> > metadata, while data is forwarded directly through the network, or
> > similar. It would be neat if we can leverage the existing
> > splice/copy_file_range syscalls such that we don't ever need to bring
> > disk data to user space, if we can avoid it.  I've also experimented
> > with regular pipes, But I found no way around keeping a lot of pipes
> > opened, one for each possible command 'slot'.
> > 
> > [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > 
> Actually, I'd rather have something like an 'inverse io_uring', where an
> application creates a memory region separated into several 'ring' for
> submission and completion.
> Then the kernel could write/map the incoming data onto the rings, and
> application can read from there.
> Maybe it'll be worthwhile to look at virtio here.

IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
does cover this case, the userspace part can submit SQEs beforehand
for getting notification of each incoming io request from kernel driver,
then after one io request is queued to the driver, the driver can
queue a CQE for the previous submitted SQE. Recent posted patch of
IORING_OP_URING_CMD[1] is perfect for such purpose.

I have written one such userspace block driver recently, and [2] is the
kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
Both the two parts look quite simple, but still in very early stage, so
far only ubd-loop and ubd-null targets are implemented in [3]. Not only
the io command communication channel is done via IORING_OP_URING_CMD, but
also IO handling for ubd-loop is implemented via plain io_uring too.

It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
on the ubd block device compared with same xfstests on underlying disk, and
my simple performance test on VM shows the result isn't worse than kernel loop
driver with dio, or even much better on some test situations.

Wrt. this userspace block driver things, I am more interested in the following
sub-topics:

1) zero copy
- the ubd driver[2] needs one data copy: for WRITE request, copy pages
  in io request to userspace buffer before handling the WRITE IO by ubdsrv;
  for READ request, the reverse copy is done after READ request is
  handled by ubdsrv

- I tried to apply zero copy via remap_pfn_range() for avoiding this
  data copy, but looks it can't work for ubd driver, since pages in the
  remapped vm area can't be retrieved by get_user_pages_*() which is called in
  direct io code path

- recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
  tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
  it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
  vm_insert_pages may work, but anonymous pages can not be remapped by
  vm_insert_pages.

- here the requirement is to remap either anonymous pages or page cache
  pages into userspace vm, and the mapping/unmapping can be done for
  each IO runtime. Is this requirement reasonable? If yes, is there any
  easy way to implement it in kernel?

2) batching queueing io_uring CQEs

- for ubd driver, batching is very sensitive to performance per my
  observation, if we can run batch queueing IORING_OP_URING_CMD CQEs,
  ubd_queue_rqs() can be wirted to the batching CQEs, then the whole batch
  only takes one io_uring_enter().

- not digging into io_uring code for this interface yet, but looks not
  see such interface

3) requirement on userspace block driver
- exact requirements from user viewpoint

4) apply eBPF in userspace block driver
- it is one open topic, still not have specific or exact idea yet,

- is there chance to apply ebpf for mapping ubd io into its target handling
for avoiding data copy and remapping cost for zero copy?

I am happy to join the virtual discussion on lsf/mm if there is and it
is possible.

[1] https://lore.kernel.org/linux-block/20220308152105.309618-1-joshi.k@samsung.com/#r
[2] https://github.com/ming1/linux/tree/v5.17-ubd-dev
[3] https://github.com/ming1/ubdsrv
[4] https://lore.kernel.org/linux-block/abbe51c4-873f-e96e-d421-85906689a55a@gmail.com/#r

Thanks,
Ming



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` [LSF/MM/BPF TOPIC] block drivers in user space Ming Lei
@ 2022-03-28  5:47     ` Kanchan Joshi
  2022-03-28  5:48     ` Hannes Reinecke
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Kanchan Joshi @ 2022-03-28  5:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	Xiaoguang Wang, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3497 bytes --]

On Mon, Mar 28, 2022 at 12:35:33AM +0800, Ming Lei wrote:
>On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>> > I'd like to discuss an interface to implement user space block devices,
>> > while avoiding local network NBD solutions.  There has been reiterated
>> > interest in the topic, both from researchers [1] and from the community,
>> > including a proposed session in LSFMM2018 [2] (though I don't think it
>> > happened).
>> >
>> > I've been working on top of the Google iblock implementation to find
>> > something upstreamable and would like to present my design and gather
>> > feedback on some points, in particular zero-copy and overall user space
>> > interface.
>> >
>> > The design I'm pending towards uses special fds opened by the driver to
>> > transfer data to/from the block driver, preferably through direct
>> > splicing as much as possible, to keep data only in kernel space.  This
>> > is because, in my use case, the driver usually only manipulates
>> > metadata, while data is forwarded directly through the network, or
>> > similar. It would be neat if we can leverage the existing
>> > splice/copy_file_range syscalls such that we don't ever need to bring
>> > disk data to user space, if we can avoid it.  I've also experimented
>> > with regular pipes, But I found no way around keeping a lot of pipes
>> > opened, one for each possible command 'slot'.
>> >
>> > [1] https://protect2.fireeye.com/v1/url?k=894d9ec4-e83076bc-894c158b-74fe485fffb1-3de06c94a9e9abfa&q=1&e=40f886a9-b53a-42b0-8e68-c94bc3813a9c&u=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3456727.3463768
>> > [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>> >
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
>
>IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
>does cover this case, the userspace part can submit SQEs beforehand
>for getting notification of each incoming io request from kernel driver,
>then after one io request is queued to the driver, the driver can
>queue a CQE for the previous submitted SQE. Recent posted patch of
>IORING_OP_URING_CMD[1] is perfect for such purpose.
I had added that as one of the potential usecases to discuss for
uring-cmd:
https://lore.kernel.org/linux-block/20220228092511.458285-1-joshi.k@samsung.com/
And your email is already bringing lot of clarity on this.

>I have written one such userspace block driver recently, and [2] is the
>kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
>Both the two parts look quite simple, but still in very early stage, so
>far only ubd-loop and ubd-null targets are implemented in [3]. Not only
>the io command communication channel is done via IORING_OP_URING_CMD, but
>also IO handling for ubd-loop is implemented via plain io_uring too.
>
>It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
>on the ubd block device compared with same xfstests on underlying disk, and
>my simple performance test on VM shows the result isn't worse than kernel loop
>driver with dio, or even much better on some test situations.
Added this in my to-be-read list. Thanks for sharing.




[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` [LSF/MM/BPF TOPIC] block drivers in user space Ming Lei
  2022-03-28  5:47     ` Kanchan Joshi
@ 2022-03-28  5:48     ` Hannes Reinecke
  2022-03-28 20:20     ` Gabriel Krisman Bertazi
  2022-04-08  6:52     ` Xiaoguang Wang
  3 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2022-03-28  5:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block, Xiaoguang Wang,
	linux-mm

On 3/27/22 18:35, Ming Lei wrote:
> On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>>> I'd like to discuss an interface to implement user space block devices,
>>> while avoiding local network NBD solutions.  There has been reiterated
>>> interest in the topic, both from researchers [1] and from the community,
>>> including a proposed session in LSFMM2018 [2] (though I don't think it
>>> happened).
>>>
>>> I've been working on top of the Google iblock implementation to find
>>> something upstreamable and would like to present my design and gather
>>> feedback on some points, in particular zero-copy and overall user space
>>> interface.
>>>
>>> The design I'm pending towards uses special fds opened by the driver to
>>> transfer data to/from the block driver, preferably through direct
>>> splicing as much as possible, to keep data only in kernel space.  This
>>> is because, in my use case, the driver usually only manipulates
>>> metadata, while data is forwarded directly through the network, or
>>> similar. It would be neat if we can leverage the existing
>>> splice/copy_file_range syscalls such that we don't ever need to bring
>>> disk data to user space, if we can avoid it.  I've also experimented
>>> with regular pipes, But I found no way around keeping a lot of pipes
>>> opened, one for each possible command 'slot'.
>>>
>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>>>
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
> 
> IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> does cover this case, the userspace part can submit SQEs beforehand
> for getting notification of each incoming io request from kernel driver,
> then after one io request is queued to the driver, the driver can
> queue a CQE for the previous submitted SQE. Recent posted patch of
> IORING_OP_URING_CMD[1] is perfect for such purpose.
> 

Ah, cool idea.

> I have written one such userspace block driver recently, and [2] is the
> kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> Both the two parts look quite simple, but still in very early stage, so
> far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> the io command communication channel is done via IORING_OP_URING_CMD, but
> also IO handling for ubd-loop is implemented via plain io_uring too.
> 
> It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> on the ubd block device compared with same xfstests on underlying disk, and
> my simple performance test on VM shows the result isn't worse than kernel loop
> driver with dio, or even much better on some test situations.
> 
Neat. I'll have a look.

Thanks for doing that!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` [LSF/MM/BPF TOPIC] block drivers in user space Ming Lei
  2022-03-28  5:47     ` Kanchan Joshi
  2022-03-28  5:48     ` Hannes Reinecke
@ 2022-03-28 20:20     ` Gabriel Krisman Bertazi
  2022-03-29  0:30       ` Ming Lei
  2022-04-08  6:52     ` Xiaoguang Wang
  3 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-28 20:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

Ming Lei <ming.lei@redhat.com> writes:

> IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> does cover this case, the userspace part can submit SQEs beforehand
> for getting notification of each incoming io request from kernel driver,
> then after one io request is queued to the driver, the driver can
> queue a CQE for the previous submitted SQE. Recent posted patch of
> IORING_OP_URING_CMD[1] is perfect for such purpose.
>
> I have written one such userspace block driver recently, and [2] is the
> kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> Both the two parts look quite simple, but still in very early stage, so
> far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> the io command communication channel is done via IORING_OP_URING_CMD, but
> also IO handling for ubd-loop is implemented via plain io_uring too.
>
> It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> on the ubd block device compared with same xfstests on underlying disk, and
> my simple performance test on VM shows the result isn't worse than kernel loop
> driver with dio, or even much better on some test situations.

Thanks for sharing.  This is a very interesting implementation that
seems to cover quite well the original use case.  I'm giving it a try and
will report back.

> Wrt. this userspace block driver things, I am more interested in the following
> sub-topics:
>
> 1) zero copy
> - the ubd driver[2] needs one data copy: for WRITE request, copy pages
>   in io request to userspace buffer before handling the WRITE IO by ubdsrv;
>   for READ request, the reverse copy is done after READ request is
>   handled by ubdsrv
>
> - I tried to apply zero copy via remap_pfn_range() for avoiding this
>   data copy, but looks it can't work for ubd driver, since pages in the
>   remapped vm area can't be retrieved by get_user_pages_*() which is called in
>   direct io code path
>
> - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
>   tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
>   it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
>   vm_insert_pages may work, but anonymous pages can not be remapped by
>   vm_insert_pages.
>
> - here the requirement is to remap either anonymous pages or page cache
>   pages into userspace vm, and the mapping/unmapping can be done for
>   each IO runtime. Is this requirement reasonable? If yes, is there any
>   easy way to implement it in kernel?

I've run into the same issue with my fd implementation and haven't been
able to workaround it.

> 4) apply eBPF in userspace block driver
> - it is one open topic, still not have specific or exact idea yet,
>
> - is there chance to apply ebpf for mapping ubd io into its target handling
> for avoiding data copy and remapping cost for zero copy?

I was thinking of something like this, or having a way for the server to
only operate on the fds and do splice/sendfile.  But, I don't know if it
would be useful for many use cases.  We also want to be able to send the
data to userspace, for instance, for userspace networking.

-- 
Gabriel Krisman Bertazi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-28 20:20     ` Gabriel Krisman Bertazi
@ 2022-03-29  0:30       ` Ming Lei
  2022-03-29 17:20         ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-03-29  0:30 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Mon, Mar 28, 2022 at 04:20:03PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> > does cover this case, the userspace part can submit SQEs beforehand
> > for getting notification of each incoming io request from kernel driver,
> > then after one io request is queued to the driver, the driver can
> > queue a CQE for the previous submitted SQE. Recent posted patch of
> > IORING_OP_URING_CMD[1] is perfect for such purpose.
> >
> > I have written one such userspace block driver recently, and [2] is the
> > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> > Both the two parts look quite simple, but still in very early stage, so
> > far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> > the io command communication channel is done via IORING_OP_URING_CMD, but
> > also IO handling for ubd-loop is implemented via plain io_uring too.
> >
> > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> > on the ubd block device compared with same xfstests on underlying disk, and
> > my simple performance test on VM shows the result isn't worse than kernel loop
> > driver with dio, or even much better on some test situations.
> 
> Thanks for sharing.  This is a very interesting implementation that
> seems to cover quite well the original use case.  I'm giving it a try and
> will report back.
> 
> > Wrt. this userspace block driver things, I am more interested in the following
> > sub-topics:
> >
> > 1) zero copy
> > - the ubd driver[2] needs one data copy: for WRITE request, copy pages
> >   in io request to userspace buffer before handling the WRITE IO by ubdsrv;
> >   for READ request, the reverse copy is done after READ request is
> >   handled by ubdsrv
> >
> > - I tried to apply zero copy via remap_pfn_range() for avoiding this
> >   data copy, but looks it can't work for ubd driver, since pages in the
> >   remapped vm area can't be retrieved by get_user_pages_*() which is called in
> >   direct io code path
> >
> > - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
> >   tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
> >   it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
> >   vm_insert_pages may work, but anonymous pages can not be remapped by
> >   vm_insert_pages.
> >
> > - here the requirement is to remap either anonymous pages or page cache
> >   pages into userspace vm, and the mapping/unmapping can be done for
> >   each IO runtime. Is this requirement reasonable? If yes, is there any
> >   easy way to implement it in kernel?
> 
> I've run into the same issue with my fd implementation and haven't been
> able to workaround it.
> 
> > 4) apply eBPF in userspace block driver
> > - it is one open topic, still not have specific or exact idea yet,
> >
> > - is there chance to apply ebpf for mapping ubd io into its target handling
> > for avoiding data copy and remapping cost for zero copy?
> 
> I was thinking of something like this, or having a way for the server to
> only operate on the fds and do splice/sendfile.  But, I don't know if it
> would be useful for many use cases.  We also want to be able to send the
> data to userspace, for instance, for userspace networking.

I understand the big point is that how to pass the io data to ubd driver's
request/bio pages. But splice/sendfile just transfers data between two FDs,
then how can the block request/bio's pages get filled with expected data?
Can you explain a bit in detail?

If block layer is bypassed, it won't be exposed as block disk to userspace.


thanks,
Ming



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-29  0:30       ` Ming Lei
@ 2022-03-29 17:20         ` Gabriel Krisman Bertazi
  2022-03-30  1:55           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-29 17:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

Ming Lei <ming.lei@redhat.com> writes:

>> I was thinking of something like this, or having a way for the server to
>> only operate on the fds and do splice/sendfile.  But, I don't know if it
>> would be useful for many use cases.  We also want to be able to send the
>> data to userspace, for instance, for userspace networking.
>
> I understand the big point is that how to pass the io data to ubd driver's
> request/bio pages. But splice/sendfile just transfers data between two FDs,
> then how can the block request/bio's pages get filled with expected data?
> Can you explain a bit in detail?

Hi Ming,

My idea was to split the control and dataplanes in different file
descriptors.

A queue has a fd that is mapped to a shared memory area where the
request descriptors are.  Submission/completion are done by read/writing
the index of the request on the shared memory area.

For the data plane, each request descriptor in the queue has an
associated file descriptor to be used for data transfer, which is
preallocated at queue creation time.  I'm mapping the bio linearly, from
offset 0, on these descriptors on .queue_rq().  Userspace operates on
these data file descriptors with regular RW syscalls, direct splice to
another fd or pipe, or mmap it to move data around. The data is
available on that fd until IO is completed through the queue fd.  After
an operation is completed, the fds are reused for the next IO on that
queue position.

Hannes has pointed out the issues with fd limits. :)

> If block layer is bypassed, it won't be exposed as block disk to userspace.

I implemented it as a block-mq driver, but it still only supports one
queue.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-29 17:20         ` Gabriel Krisman Bertazi
@ 2022-03-30  1:55           ` Ming Lei
  2022-03-30 18:22             ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-03-30  1:55 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> >> I was thinking of something like this, or having a way for the server to
> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
> >> would be useful for many use cases.  We also want to be able to send the
> >> data to userspace, for instance, for userspace networking.
> >
> > I understand the big point is that how to pass the io data to ubd driver's
> > request/bio pages. But splice/sendfile just transfers data between two FDs,
> > then how can the block request/bio's pages get filled with expected data?
> > Can you explain a bit in detail?
> 
> Hi Ming,
> 
> My idea was to split the control and dataplanes in different file
> descriptors.
> 
> A queue has a fd that is mapped to a shared memory area where the
> request descriptors are.  Submission/completion are done by read/writing
> the index of the request on the shared memory area.
> 
> For the data plane, each request descriptor in the queue has an
> associated file descriptor to be used for data transfer, which is
> preallocated at queue creation time.  I'm mapping the bio linearly, from
> offset 0, on these descriptors on .queue_rq().  Userspace operates on
> these data file descriptors with regular RW syscalls, direct splice to
> another fd or pipe, or mmap it to move data around. The data is
> available on that fd until IO is completed through the queue fd.  After
> an operation is completed, the fds are reused for the next IO on that
> queue position.
> 
> Hannes has pointed out the issues with fd limits. :)

OK, thanks for the detailed explanation!

Also you may switch to map each request queue/disk into a FD, and every
request is mapped to one fixed extent of the 'file' via rq->tag since we
have max sectors limit for each request, then fd limits can be avoided.

But I am wondering if this way is friendly to userspace side implementation,
since there isn't buffer, only FDs visible to userspace.


thanks,
Ming



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-30  1:55           ` Ming Lei
@ 2022-03-30 18:22             ` Gabriel Krisman Bertazi
  2022-03-31  1:38               ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-30 18:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

Ming Lei <ming.lei@redhat.com> writes:

> On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
>> Ming Lei <ming.lei@redhat.com> writes:
>> 
>> >> I was thinking of something like this, or having a way for the server to
>> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
>> >> would be useful for many use cases.  We also want to be able to send the
>> >> data to userspace, for instance, for userspace networking.
>> >
>> > I understand the big point is that how to pass the io data to ubd driver's
>> > request/bio pages. But splice/sendfile just transfers data between two FDs,
>> > then how can the block request/bio's pages get filled with expected data?
>> > Can you explain a bit in detail?
>> 
>> Hi Ming,
>> 
>> My idea was to split the control and dataplanes in different file
>> descriptors.
>> 
>> A queue has a fd that is mapped to a shared memory area where the
>> request descriptors are.  Submission/completion are done by read/writing
>> the index of the request on the shared memory area.
>> 
>> For the data plane, each request descriptor in the queue has an
>> associated file descriptor to be used for data transfer, which is
>> preallocated at queue creation time.  I'm mapping the bio linearly, from
>> offset 0, on these descriptors on .queue_rq().  Userspace operates on
>> these data file descriptors with regular RW syscalls, direct splice to
>> another fd or pipe, or mmap it to move data around. The data is
>> available on that fd until IO is completed through the queue fd.  After
>> an operation is completed, the fds are reused for the next IO on that
>> queue position.
>> 
>> Hannes has pointed out the issues with fd limits. :)
>
> OK, thanks for the detailed explanation!
>
> Also you may switch to map each request queue/disk into a FD, and every
> request is mapped to one fixed extent of the 'file' via rq->tag since we
> have max sectors limit for each request, then fd limits can be avoided.
>
> But I am wondering if this way is friendly to userspace side implementation,
> since there isn't buffer, only FDs visible to userspace.

The advantages would be not mapping the request data in userspace if we
could avoid it, since it would be possible to just forward the data
inside the kernel.  But my latest understanding is that most use cases
will want to directly manipulate the data anyway, maybe to checksum, or
even for sending through userspace networking.  It is not clear to me
anymore that we'd benefit from not always mapping the requests to
userspace.

I've been looking at your implementation and I really like how simple it
is. I think it's the most promising approach for this feature I've
reviewed so far.  I'd like to send you a few patches for bugs I found
when testing it and keep working on making it upstreamable.  How can I
send you those patches?  Is it fine to just email you or should I also
cc linux-block, even though this is yet out-of-tree code?

-- 
Gabriel Krisman Bertazi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-30 18:22             ` Gabriel Krisman Bertazi
@ 2022-03-31  1:38               ` Ming Lei
  2022-03-31  3:49                 ` Bart Van Assche
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-03-31  1:38 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Wed, Mar 30, 2022 at 02:22:20PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
> >> Ming Lei <ming.lei@redhat.com> writes:
> >> 
> >> >> I was thinking of something like this, or having a way for the server to
> >> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
> >> >> would be useful for many use cases.  We also want to be able to send the
> >> >> data to userspace, for instance, for userspace networking.
> >> >
> >> > I understand the big point is that how to pass the io data to ubd driver's
> >> > request/bio pages. But splice/sendfile just transfers data between two FDs,
> >> > then how can the block request/bio's pages get filled with expected data?
> >> > Can you explain a bit in detail?
> >> 
> >> Hi Ming,
> >> 
> >> My idea was to split the control and dataplanes in different file
> >> descriptors.
> >> 
> >> A queue has a fd that is mapped to a shared memory area where the
> >> request descriptors are.  Submission/completion are done by read/writing
> >> the index of the request on the shared memory area.
> >> 
> >> For the data plane, each request descriptor in the queue has an
> >> associated file descriptor to be used for data transfer, which is
> >> preallocated at queue creation time.  I'm mapping the bio linearly, from
> >> offset 0, on these descriptors on .queue_rq().  Userspace operates on
> >> these data file descriptors with regular RW syscalls, direct splice to
> >> another fd or pipe, or mmap it to move data around. The data is
> >> available on that fd until IO is completed through the queue fd.  After
> >> an operation is completed, the fds are reused for the next IO on that
> >> queue position.
> >> 
> >> Hannes has pointed out the issues with fd limits. :)
> >
> > OK, thanks for the detailed explanation!
> >
> > Also you may switch to map each request queue/disk into a FD, and every
> > request is mapped to one fixed extent of the 'file' via rq->tag since we
> > have max sectors limit for each request, then fd limits can be avoided.
> >
> > But I am wondering if this way is friendly to userspace side implementation,
> > since there isn't buffer, only FDs visible to userspace.
> 
> The advantages would be not mapping the request data in userspace if we
> could avoid it, since it would be possible to just forward the data
> inside the kernel.  But my latest understanding is that most use cases
> will want to directly manipulate the data anyway, maybe to checksum, or
> even for sending through userspace networking.  It is not clear to me
> anymore that we'd benefit from not always mapping the requests to
> userspace.

Yeah, I think it is more flexible or usable to allow userspace to
operate on data directly as one generic solution, such as, implement one disk
to read/write on qcow2 image, or read from/write to network by parsing
protocol, or whatever.

> I've been looking at your implementation and I really like how simple it
> is. I think it's the most promising approach for this feature I've
> reviewed so far.  I'd like to send you a few patches for bugs I found
> when testing it and keep working on making it upstreamable.  How can I
> send you those patches?  Is it fine to just email you or should I also
> cc linux-block, even though this is yet out-of-tree code?

The topic has been discussed for a bit long, and looks people are still
interested in it, so I prefer to send out patches on linux-block if no
one objects. Then we can still discuss further when reviewing patches.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-31  1:38               ` Ming Lei
@ 2022-03-31  3:49                 ` Bart Van Assche
  0 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2022-03-31  3:49 UTC (permalink / raw)
  To: Ming Lei, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On 3/30/22 18:38, Ming Lei wrote:
> The topic has been discussed for a bit long, and looks people are still
> interested in it, so I prefer to send out patches on linux-block if no
> one objects. Then we can still discuss further when reviewing patches.

I'm in favor of the above proposal :-)

Bart.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` [LSF/MM/BPF TOPIC] block drivers in user space Ming Lei
                       ` (2 preceding siblings ...)
  2022-03-28 20:20     ` Gabriel Krisman Bertazi
@ 2022-04-08  6:52     ` Xiaoguang Wang
  2022-04-08  7:44       ` Ming Lei
  3 siblings, 1 reply; 12+ messages in thread
From: Xiaoguang Wang @ 2022-04-08  6:52 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block, linux-mm

hi,

> On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>>> I'd like to discuss an interface to implement user space block devices,
>>> while avoiding local network NBD solutions.  There has been reiterated
>>> interest in the topic, both from researchers [1] and from the community,
>>> including a proposed session in LSFMM2018 [2] (though I don't think it
>>> happened).
>>>
>>> I've been working on top of the Google iblock implementation to find
>>> something upstreamable and would like to present my design and gather
>>> feedback on some points, in particular zero-copy and overall user space
>>> interface.
>>>
>>> The design I'm pending towards uses special fds opened by the driver to
>>> transfer data to/from the block driver, preferably through direct
>>> splicing as much as possible, to keep data only in kernel space.  This
>>> is because, in my use case, the driver usually only manipulates
>>> metadata, while data is forwarded directly through the network, or
>>> similar. It would be neat if we can leverage the existing
>>> splice/copy_file_range syscalls such that we don't ever need to bring
>>> disk data to user space, if we can avoid it.  I've also experimented
>>> with regular pipes, But I found no way around keeping a lot of pipes
>>> opened, one for each possible command 'slot'.
>>>
>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>>>
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
> IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> does cover this case, the userspace part can submit SQEs beforehand
> for getting notification of each incoming io request from kernel driver,
> then after one io request is queued to the driver, the driver can
> queue a CQE for the previous submitted SQE. Recent posted patch of
> IORING_OP_URING_CMD[1] is perfect for such purpose.
>
> I have written one such userspace block driver recently, and [2] is the
> kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> Both the two parts look quite simple, but still in very early stage, so
> far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> the io command communication channel is done via IORING_OP_URING_CMD, but
> also IO handling for ubd-loop is implemented via plain io_uring too.
>
> It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> on the ubd block device compared with same xfstests on underlying disk, and
> my simple performance test on VM shows the result isn't worse than kernel loop
> driver with dio, or even much better on some test situations.
I also have spent time to learn your codes, its idea is really good, thanks for this
great work. Though we're using tcmu, indeed we just needs a simple block device
based on block semantics. Tcmu is based on scsi protocol, which is somewhat
complicated and influences small io request performance. So if you like, we're
willing to participate this project, and may use it in our internal business, thanks.

Another little question, why you use raw io_uring interface rather than liburing?
Are there any special reasons?

Regards,
Xiaoguang Wang
>
> Wrt. this userspace block driver things, I am more interested in the following
> sub-topics:
>
> 1) zero copy
> - the ubd driver[2] needs one data copy: for WRITE request, copy pages
>   in io request to userspace buffer before handling the WRITE IO by ubdsrv;
>   for READ request, the reverse copy is done after READ request is
>   handled by ubdsrv
>
> - I tried to apply zero copy via remap_pfn_range() for avoiding this
>   data copy, but looks it can't work for ubd driver, since pages in the
>   remapped vm area can't be retrieved by get_user_pages_*() which is called in
>   direct io code path
>
> - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
>   tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
>   it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
>   vm_insert_pages may work, but anonymous pages can not be remapped by
>   vm_insert_pages.
>
> - here the requirement is to remap either anonymous pages or page cache
>   pages into userspace vm, and the mapping/unmapping can be done for
>   each IO runtime. Is this requirement reasonable? If yes, is there any
>   easy way to implement it in kernel?
>
> 2) batching queueing io_uring CQEs
>
> - for ubd driver, batching is very sensitive to performance per my
>   observation, if we can run batch queueing IORING_OP_URING_CMD CQEs,
>   ubd_queue_rqs() can be wirted to the batching CQEs, then the whole batch
>   only takes one io_uring_enter().
>
> - not digging into io_uring code for this interface yet, but looks not
>   see such interface
>
> 3) requirement on userspace block driver
> - exact requirements from user viewpoint
>
> 4) apply eBPF in userspace block driver
> - it is one open topic, still not have specific or exact idea yet,
>
> - is there chance to apply ebpf for mapping ubd io into its target handling
> for avoiding data copy and remapping cost for zero copy?
>
> I am happy to join the virtual discussion on lsf/mm if there is and it
> is possible.
>
> [1] https://lore.kernel.org/linux-block/20220308152105.309618-1-joshi.k@samsung.com/#r
> [2] https://github.com/ming1/linux/tree/v5.17-ubd-dev
> [3] https://github.com/ming1/ubdsrv
> [4] https://lore.kernel.org/linux-block/abbe51c4-873f-e96e-d421-85906689a55a@gmail.com/#r
>
> Thanks,
> Ming



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-04-08  6:52     ` Xiaoguang Wang
@ 2022-04-08  7:44       ` Ming Lei
  0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-04-08  7:44 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	linux-mm

On Fri, Apr 08, 2022 at 02:52:35PM +0800, Xiaoguang Wang wrote:
> hi,
> 
> > On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
> >> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
> >>> I'd like to discuss an interface to implement user space block devices,
> >>> while avoiding local network NBD solutions.  There has been reiterated
> >>> interest in the topic, both from researchers [1] and from the community,
> >>> including a proposed session in LSFMM2018 [2] (though I don't think it
> >>> happened).
> >>>
> >>> I've been working on top of the Google iblock implementation to find
> >>> something upstreamable and would like to present my design and gather
> >>> feedback on some points, in particular zero-copy and overall user space
> >>> interface.
> >>>
> >>> The design I'm pending towards uses special fds opened by the driver to
> >>> transfer data to/from the block driver, preferably through direct
> >>> splicing as much as possible, to keep data only in kernel space.  This
> >>> is because, in my use case, the driver usually only manipulates
> >>> metadata, while data is forwarded directly through the network, or
> >>> similar. It would be neat if we can leverage the existing
> >>> splice/copy_file_range syscalls such that we don't ever need to bring
> >>> disk data to user space, if we can avoid it.  I've also experimented
> >>> with regular pipes, But I found no way around keeping a lot of pipes
> >>> opened, one for each possible command 'slot'.
> >>>
> >>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> >>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> >>>
> >> Actually, I'd rather have something like an 'inverse io_uring', where an
> >> application creates a memory region separated into several 'ring' for
> >> submission and completion.
> >> Then the kernel could write/map the incoming data onto the rings, and
> >> application can read from there.
> >> Maybe it'll be worthwhile to look at virtio here.
> > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> > does cover this case, the userspace part can submit SQEs beforehand
> > for getting notification of each incoming io request from kernel driver,
> > then after one io request is queued to the driver, the driver can
> > queue a CQE for the previous submitted SQE. Recent posted patch of
> > IORING_OP_URING_CMD[1] is perfect for such purpose.
> >
> > I have written one such userspace block driver recently, and [2] is the
> > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> > Both the two parts look quite simple, but still in very early stage, so
> > far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> > the io command communication channel is done via IORING_OP_URING_CMD, but
> > also IO handling for ubd-loop is implemented via plain io_uring too.
> >
> > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> > on the ubd block device compared with same xfstests on underlying disk, and
> > my simple performance test on VM shows the result isn't worse than kernel loop
> > driver with dio, or even much better on some test situations.
> I also have spent time to learn your codes, its idea is really good, thanks for this
> great work. Though we're using tcmu, indeed we just needs a simple block device
> based on block semantics. Tcmu is based on scsi protocol, which is somewhat
> complicated and influences small io request performance. So if you like, we're
> willing to participate this project, and may use it in our internal business, thanks.

That is great, and welcome to participate! Glad to see there is real potential
user of userspace block device.

I believe there are lots of thing to do in this area, but so far:

1) consolidate the interface between ubd driver and ubdsrv, since this
part is kabi

2) consolidate design in ubdsrv(userspace part), so that we can support
different backing or target easily, one idea is to handle all io request
via io_uring.

3) consolidate design in ubdsrv for providing stable interface to support
advanced languages(python, rust, ...), and inevitable one new complicated
target/backing should be developed meantime, such as qcow2, or other
real/popular device.

I plan to post driver formal patches out after the patchset of io_uring
command interface is merged, but maybe we can make it soon for early
review.

And the driver side should be kept as simple as possible, and as
efficient as possible. It just focuses on : forward io request
to userspace and handle data copy or zero copy, and ubd driver won't store
any state of backing/target. Also actual performance is really sensitive with
batching handling. Recently, I take task_work_add() to improve batching, and
easy to observe performance boot. Another related part is how to implement
zero copy, which exists on tcmu or other projects too.

> 
> Another little question, why you use raw io_uring interface rather than liburing?
> Are there any special reasons?

It is just for building ubdsrv easily without any dependency, and it
definitely will switch to liburing. And the change should be quite simple,
since the related glue code is put in one source file, and the current
interface is similar with liburing's too.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-08  7:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87tucsf0sr.fsf@collabora.com>
     [not found] ` <986caf55-65d1-0755-383b-73834ec04967@suse.de>
2022-03-27 16:35   ` [LSF/MM/BPF TOPIC] block drivers in user space Ming Lei
2022-03-28  5:47     ` Kanchan Joshi
2022-03-28  5:48     ` Hannes Reinecke
2022-03-28 20:20     ` Gabriel Krisman Bertazi
2022-03-29  0:30       ` Ming Lei
2022-03-29 17:20         ` Gabriel Krisman Bertazi
2022-03-30  1:55           ` Ming Lei
2022-03-30 18:22             ` Gabriel Krisman Bertazi
2022-03-31  1:38               ` Ming Lei
2022-03-31  3:49                 ` Bart Van Assche
2022-04-08  6:52     ` Xiaoguang Wang
2022-04-08  7:44       ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).