[LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication
@ 2023-02-05  0:59 Bernd Schubert
  2023-02-10 10:45 ` Miklos Szeredi
  0 siblings, 1 reply; 3+ messages in thread
From: Bernd Schubert @ 2023-02-05  0:59 UTC (permalink / raw)
  To: lsf-pc@lists.linux-foundation.org
  Cc: Ming Lei, Amir Goldstein, Miklos Szeredi,
	linux-fsdevel@vger.kernel.org

Hello,

I'm working for some time on fuse uring based communication that is numa 
aware and core-affine.

In the current /dev/fuse based IO model requests are queued on lists 
that are not core-affine or numa aware. For every request a round trip 
between userspace and kernel is needed.
When we benchmarked our atomic-open patches (also still WIP) initially 
confusing findings came up [1] and could be tracked down to multiple 
threads reading from /dev/fuse. After switching to a single thread that 
reads from /dev/fuse we got consistent and expected results.
Later we also figured out that adding a polling spin fuse_dev_do_read() 
before going into a waitq sleep when no request is available greatly 
improved meta data benchmark performance [2].

That made us to think about the current communication and to look into a 
ring based queuing model. Around that time IORING_OP_URING_CMD was added 
to uring and the new userspace block device driver (ublk) is using that 
command, to send requests from kernel to userspace.
I started to look how ublk works and started to adapt a similar model to 
fuse. State as today is that it is basically working, but I'm still 
fixing issues found by xfstests. Benchmarks and patch cleanup for 
submission follow next.

https://github.com/bsbernd/linux/tree/fuse-uring
https://github.com/bsbernd/libfuse/tree/uring
(these branches will _not_ be used for upstream submission, these are 
purely for base development)

A fuse design documentation update will also be added in the 1st RFC 
request, basic details follow as

- Initial mount setup goes over /dev/fuse
- fuse.ko queues FUSE_INIT in the existing /dev/fuse (background) queue
- User space sets up the ring and all queues with a new ioctl
- fuse.ko sets up the ring and allocates request queues/request memory 
per queue/request
- Userspace mmaps these buffers and assigns them per queue/request
- Data are send through these mmaped buffers, there is no kmap involved 
(difference to ublk)
- Similar to ublk user space first submits SQEs with as 
FUSE_URING_REQ_FETCH, then later as FUSE_URING_REQ_COMMIT_AND_FETCH - 
commit results of the current request and fetch the next one.
- FUSE_URING_REQ_FETCH also takes the FUSE_INIT request, later these 
lists are not checked anymore, as there is nothing supposed to be on them
- The ring currently only only handles fuse pending and background 
requests (with credits assigned)
- Forget requires libfuse still read /dev/fuse (handling will be added 
to the ring later)
- In the WIP state request interrupts are not supported (yet)
- Userspace needs to send fuse notifications to /dev/fuse, needs to be 
handled by the ring as well (or maybe a separate ring)
- My goal was to keep compatibility with existing fuse file systems, 
except of the so far missing interrupt handling that should work so far.

There are certainly some questionable design decisions and longer 
discussion threads might come up in the next weeks/months. Debating and 
resolving some of these in person might be very helpful.

Ming is also working on zero-copy for ublk and I'm going to look into 
that next. Splice and zero-copy is currently not supported yet in my 
uring branch [3]

Thanks,
Bernd

[1] 
https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/

[2] 
https://lore.kernel.org/lkml/6ba14287-336d-cdcd-0d39-680f288ca776@ddn.com/

[3] 
https://patchwork.kernel.org/project/linux-block/cover/20221103085004.1029763-1-ming.lei@redhat.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication
  2023-02-05  0:59 [LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication Bernd Schubert
@ 2023-02-10 10:45 ` Miklos Szeredi
  2023-02-10 11:46   ` Bernd Schubert
  0 siblings, 1 reply; 3+ messages in thread
From: Miklos Szeredi @ 2023-02-10 10:45 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: lsf-pc@lists.linux-foundation.org, Ming Lei, Amir Goldstein,
	linux-fsdevel@vger.kernel.org

On Sun, 5 Feb 2023 at 02:00, Bernd Schubert <bschubert@ddn.com> wrote:
>
> Hello,
>
> I'm working for some time on fuse uring based communication that is numa
> aware and core-affine.

I might have mentioned this earlier, but one of the bigger issues with
NUMA that I found was that having a single process with multiple
threads serving queues of different NUMA nodes incurs a performance
hit each time a server thread gets to run. This is due to having to
update mm->cpu_bitmap, which indicates on which  CPUs the current
process is running on.  This bitmap is shared by the address space,
hence constantly updating it from different nodes means having to move
it from one node to the other.

My workaround was to use separate processes (address space is not
shared) but use shared memory for common structures.  This complicates
things quite a bit, so it would be nice to find some other way of
fixing this issue.  For example it occurs to me that making this
bitmap use different cachelines for CPUs that are on different nodes
might actually help fix the issue.

> In the current /dev/fuse based IO model requests are queued on lists
> that are not core-affine or numa aware. For every request a round trip
> between userspace and kernel is needed.
> When we benchmarked our atomic-open patches (also still WIP) initially
> confusing findings came up [1] and could be tracked down to multiple
> threads reading from /dev/fuse. After switching to a single thread that
> reads from /dev/fuse we got consistent and expected results.
> Later we also figured out that adding a polling spin fuse_dev_do_read()
> before going into a waitq sleep when no request is available greatly
> improved meta data benchmark performance [2].
>
> That made us to think about the current communication and to look into a
> ring based queuing model. Around that time IORING_OP_URING_CMD was added
> to uring and the new userspace block device driver (ublk) is using that
> command, to send requests from kernel to userspace.
> I started to look how ublk works and started to adapt a similar model to
> fuse. State as today is that it is basically working, but I'm still
> fixing issues found by xfstests. Benchmarks and patch cleanup for
> submission follow next.
>
> https://github.com/bsbernd/linux/tree/fuse-uring
> https://github.com/bsbernd/libfuse/tree/uring
> (these branches will _not_ be used for upstream submission, these are
> purely for base development)
>
>
> A fuse design documentation update will also be added in the 1st RFC
> request, basic details follow as
>
> - Initial mount setup goes over /dev/fuse
> - fuse.ko queues FUSE_INIT in the existing /dev/fuse (background) queue
> - User space sets up the ring and all queues with a new ioctl
> - fuse.ko sets up the ring and allocates request queues/request memory
> per queue/request
> - Userspace mmaps these buffers and assigns them per queue/request
> - Data are send through these mmaped buffers, there is no kmap involved
> (difference to ublk)

How is the queue buffer filled?  Are requests packed or is the queue
divided into equal parts for each request?

How replies are sent?  Do they use the same buffer?

> - Similar to ublk user space first submits SQEs with as
> FUSE_URING_REQ_FETCH, then later as FUSE_URING_REQ_COMMIT_AND_FETCH -
> commit results of the current request and fetch the next one.
> - FUSE_URING_REQ_FETCH also takes the FUSE_INIT request, later these
> lists are not checked anymore, as there is nothing supposed to be on them

Which list?  If the FUSE_INIT is handled on /dev/fuse why handle it on
the uring?

> - The ring currently only only handles fuse pending and background
> requests (with credits assigned)
> - Forget requires libfuse still read /dev/fuse (handling will be added
> to the ring later)
> - In the WIP state request interrupts are not supported (yet)
> - Userspace needs to send fuse notifications to /dev/fuse, needs to be
> handled by the ring as well (or maybe a separate ring)
> - My goal was to keep compatibility with existing fuse file systems,
> except of the so far missing interrupt handling that should work so far.

Interrupts and notifications are used by very few fs.  So if it's
easier, then we could leave one thread to handle legacy /dev/fuse
requests for anything that's not performance sensitive.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication
  2023-02-10 10:45 ` Miklos Szeredi
@ 2023-02-10 11:46   ` Bernd Schubert
  0 siblings, 0 replies; 3+ messages in thread
From: Bernd Schubert @ 2023-02-10 11:46 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lsf-pc@lists.linux-foundation.org, Ming Lei, Amir Goldstein,
	linux-fsdevel@vger.kernel.org

On 2/10/23 11:45, Miklos Szeredi wrote:
> On Sun, 5 Feb 2023 at 02:00, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> Hello,
>>
>> I'm working for some time on fuse uring based communication that is numa
>> aware and core-affine.
> 
> I might have mentioned this earlier, but one of the bigger issues with
> NUMA that I found was that having a single process with multiple
> threads serving queues of different NUMA nodes incurs a performance
> hit each time a server thread gets to run. This is due to having to
> update mm->cpu_bitmap, which indicates on which  CPUs the current
> process is running on.  This bitmap is shared by the address space,
> hence constantly updating it from different nodes means having to move
> it from one node to the other.

For our current usage we have entirely restricted the fuse daemon to run 
on one numa node only. Some years ago I had tested clone_fd - we could 
see for some workloads that a single fd run into a spin-lock contention 
and clone_fd solved that, but it didn't solve most of the other 
performance issues.

> 
> My workaround was to use separate processes (address space is not
> shared) but use shared memory for common structures.  This complicates
> things quite a bit, so it would be nice to find some other way of
> fixing this issue.  For example it occurs to me that making this
> bitmap use different cachelines for CPUs that are on different nodes
> might actually help fix the issue.

With my uring approach you will get a ring thread per core and basically 
no shared data structures - that should solve the issue? Well 
'fuse_connection' is still shared, but then has queues per ring 
(actually you remind me that I need to make the queues cache line 
aligned, on kernel and daemon side).

Well struct fuse_conn holds 'struct fuse_ring', with an
/* XXX: Move to struct fuse_dev? */

There are some design decisions that are certainly debatable and I have 
marked some of these with such XXX comments.


> 
>> In the current /dev/fuse based IO model requests are queued on lists
>> that are not core-affine or numa aware. For every request a round trip
>> between userspace and kernel is needed.
>> When we benchmarked our atomic-open patches (also still WIP) initially
>> confusing findings came up [1] and could be tracked down to multiple
>> threads reading from /dev/fuse. After switching to a single thread that
>> reads from /dev/fuse we got consistent and expected results.
>> Later we also figured out that adding a polling spin fuse_dev_do_read()
>> before going into a waitq sleep when no request is available greatly
>> improved meta data benchmark performance [2].
>>
>> That made us to think about the current communication and to look into a
>> ring based queuing model. Around that time IORING_OP_URING_CMD was added
>> to uring and the new userspace block device driver (ublk) is using that
>> command, to send requests from kernel to userspace.
>> I started to look how ublk works and started to adapt a similar model to
>> fuse. State as today is that it is basically working, but I'm still
>> fixing issues found by xfstests. Benchmarks and patch cleanup for
>> submission follow next.
>>
>> https://github.com/bsbernd/linux/tree/fuse-uring
>> https://github.com/bsbernd/libfuse/tree/uring
>> (these branches will _not_ be used for upstream submission, these are
>> purely for base development)
>>
>>
>> A fuse design documentation update will also be added in the 1st RFC
>> request, basic details follow as
>>
>> - Initial mount setup goes over /dev/fuse
>> - fuse.ko queues FUSE_INIT in the existing /dev/fuse (background) queue
>> - User space sets up the ring and all queues with a new ioctl
>> - fuse.ko sets up the ring and allocates request queues/request memory
>> per queue/request
>> - Userspace mmaps these buffers and assigns them per queue/request
>> - Data are send through these mmaped buffers, there is no kmap involved
>> (difference to ublk)
> 
> How is the queue buffer filled?  Are requests packed or is the queue
> divided into equal parts for each request?

The latter, queues are divided into equal parts - which gives the ring 
queue depth. I have further divided these with credits into pending and 
background. My reasoning is that background is basically anything page 
cache related and we do not want to introduce latencies due to filled 
queue with background writes and read-head.

> 
> How replies are sent?  Do they use the same buffer?

Queues/requests use a shared memory buffer between kernel and daemon.

> 
>> - Similar to ublk user space first submits SQEs with as
>> FUSE_URING_REQ_FETCH, then later as FUSE_URING_REQ_COMMIT_AND_FETCH -
>> commit results of the current request and fetch the next one.
>> - FUSE_URING_REQ_FETCH also takes the FUSE_INIT request, later these
>> lists are not checked anymore, as there is nothing supposed to be on them
> 
> Which list?  If the FUSE_INIT is handled on /dev/fuse why handle it on
> the uring?

struct fuse_iqueue::pending

Yeah, we could leave FUSE_INIT with /dev/fuse IO, but then using the 
ring for that is not so much more complicated and FUSE_INIT actually a 
nice startup test if the ring basically works.

> 
>> - The ring currently only only handles fuse pending and background
>> requests (with credits assigned)
>> - Forget requires libfuse still read /dev/fuse (handling will be added
>> to the ring later)
>> - In the WIP state request interrupts are not supported (yet)
>> - Userspace needs to send fuse notifications to /dev/fuse, needs to be
>> handled by the ring as well (or maybe a separate ring)
>> - My goal was to keep compatibility with existing fuse file systems,
>> except of the so far missing interrupt handling that should work so far.
> 
> Interrupts and notifications are used by very few fs.  So if it's
> easier, then we could leave one thread to handle legacy /dev/fuse
> requests for anything that's not performance sensitive.

Interrupts maybe, but our product that is currently in active 
development has a DLM and will be a heavy user of notifications. So 
easier yes, but mismatch with our needs.


I'm still in the process to fixing issues I had overseen, I hope to get 
that done today, so that I can work on clean patches for upstream, that 
will also explain things in commit messages and updated 
Documentation/filesystems/fuse.rst. I really hope to have first patches 
ready next week.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-02-10 11:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-05  0:59 [LSF/MM/BFP ATTEND][LSF/MM/BFP TOPIC] fuse uring communication Bernd Schubert
2023-02-10 10:45 ` Miklos Szeredi
2023-02-10 11:46   ` Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.