linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jingbo Xu <jefflexu@linux.alibaba.com>
To: Bernd Schubert <bernd.schubert@fastmail.fm>,
	Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	zhangjiachen.jaycee@bytedance.com
Subject: Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
Date: Thu, 7 Mar 2024 10:16:49 +0800	[thread overview]
Message-ID: <cb39ba49-eada-44b4-97fd-ea27ac8ba1f4@linux.alibaba.com> (raw)
In-Reply-To: <5343dc29-83cb-49b4-91ff-57bbd0eaa1df@fastmail.fm>

Hi Bernd,

On 3/6/24 11:45 PM, Bernd Schubert wrote:
> 
> 
> On 3/6/24 14:32, Jingbo Xu wrote:
>>
>>
>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> Hi Miklos,
>>>>
>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>
>>>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>
>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>>>> single request is increased.
>>>>>>>
>>>>>>> The only worry is about where this memory is getting accounted to.
>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>
>>>>> Apart from the request size, the maximum number of background requests,
>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>> daemon), also limits the size of the memory that an unprivileged user
>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>> increasing the maximum request size.
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>
>>>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>
>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>> ---
>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>>>> and saw a ~20% performance boost.
>>>>>>>
>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>
>>>>>> Yeah I guess so.
>>>>>>
>>>>>>
>>>>>>> It would be interesting to
>>>>>>> see the how the number of pages per request affects performance and
>>>>>>> why.
>>>>>>
>>>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>>>> bytedance's case.
>>>>>>
>>>>>> While in our internal use scenario, the optimal IO size of the backend
>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>>>> who implements the fuse server to give more background info and the
>>>>>> exact performance statistics.
>>>>>
>>>>> Here are more details about our internal use case:
>>>>>
>>>>> We have a fuse server used in our internal cloud scenarios, while the
>>>>> backend store is actually a distributed filesystem.  That is, the fuse
>>>>> server actually plays as the client of the remote distributed
>>>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>>>> backing store through network, while the remote distributed filesystem
>>>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>>>
>>>>> Then it comes the details of the remote distributed filesystem when it
>>>>> process the requested data with the persistent store.
>>>>>
>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>
>>>>> It also utilize the stripe technology to boost the performance, for
>>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>>>> blocks.
>>>>>
>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>>>> received.  Since the EC described above, when the write fuse request is
>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>>>> other 3MB is read from the persistent store first, then compute the
>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>
>>>>>
>>>>> Thus the write amplification is un-neglectable and is the performance
>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>
>>>>> Here are some simple performance statistics with varying request size.
>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>
>>> I sort of understand the issue, although my guess is that this could
>>> be worked around in the client by coalescing writes.  This could be
>>> done by adding a small delay before sending a write request off to the
>>> network.
>>>
>>> Would that work in your case?
>>
>> It's possible but I'm not sure. I've asked my colleagues who working on
>> the fuse server and the backend store, though have not been replied yet.
>>  But I guess it's not as simple as increasing the maximum FUSE request
>> size directly and thus more complexity gets involved.
>>
>> I can also understand the concern that this may increase the risk of
>> pinning more memory footprint, and a more generic using scenario needs
>> to be considered.  I can make it a private patch for our internal product.
>>
>> Thanks for the suggestions and discussion.
> 
> It also gets kind of solved in my fuse-over-io-uring branch - as long as
> there are enough free ring entries. I'm going to add in a flag there
> that other CQEs might be follow up requests. Really time to post a new
> version.

Thanks for the information.  I've not read the fuse-over-io-uring branch
yet, but sounds like it would be much helpful .  Would there be a flag
in the FUSE request indicating it's one of the linked FUSE requests?  Is
this feature, say linked FUSE requests, enabled only when io-uring is
upon FUSE?

-- 
Thanks,
Jingbo

  reply	other threads:[~2024-03-07  2:16 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24  7:05 [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit Jingbo Xu
2024-01-24 12:23 ` Miklos Szeredi
2024-01-24 12:47   ` Jingbo Xu
2024-01-26  6:29     ` Jingbo Xu
2024-02-26  4:00       ` Jingbo Xu
2024-03-05 14:26         ` Miklos Szeredi
2024-03-06 13:32           ` Jingbo Xu
2024-03-06 15:45             ` Bernd Schubert
2024-03-07  2:16               ` Jingbo Xu [this message]
2024-03-07 22:06                 ` Bernd Schubert
2024-03-28 16:46                   ` Sweet Tea Dorminy
2024-03-28 22:08                     ` Bernd Schubert
  -- strict thread matches above, loose matches on Subject: below --
2024-04-08  6:32 Sweet Tea Dorminy
2024-04-08 14:26 ` Antonio SJ Musumeci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cb39ba49-eada-44b4-97fd-ea27ac8ba1f4@linux.alibaba.com \
    --to=jefflexu@linux.alibaba.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=zhangjiachen.jaycee@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).