Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Antonio SJ Musumeci <trapexit@spawn.link>
To: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>,
	Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	zhangjiachen.jaycee@bytedance.com
Subject: Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
Date: Mon, 08 Apr 2024 14:26:53 +0000	[thread overview]
Message-ID: <af555e3c-cd00-4bf4-b774-8099517bf559@spawn.link> (raw)
In-Reply-To: <b4d801a442c71d064a6b2212d8d6f661@dorminy.me>

On 4/8/24 01:32, Sweet Tea Dorminy wrote:
> 
> On 2024-01-26 01:29, Jingbo Xu wrote:
>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>
>>>
>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com>
>>>> wrote:
>>>>>
>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>
>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of
>>>>> a
>>>>> single request is increased.
>>>>
>>>> The only worry is about where this memory is getting accounted to.
>>>> This needs to be thought through, since the we are increasing the
>>>> possible memory that an unprivileged user is allowed to pin.
>>
>> Apart from the request size, the maximum number of background requests,
>> i.e. max_background (12 by default, and configurable by the fuse
>> daemon), also limits the size of the memory that an unprivileged user
>> can pin.  But yes, it indeed increases the number proportionally by
>> increasing the maximum request size.
>>
>>
>>>
>>>> It would be interesting to
>>>> see the how the number of pages per request affects performance and
>>>> why.
>>>
>>> To be honest, I'm not sure the root cause of the performance boost in
>>> bytedance's case.
>>>
>>> While in our internal use scenario, the optimal IO size of the backend
>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>> throughput can not be achieved with current 256 pages per request. IOW
>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>> performance when the data is aligned at 4MB boundary.  I can ask my
>>> folk
>>> who implements the fuse server to give more background info and the
>>> exact performance statistics.
>>
>> Here are more details about our internal use case:
>>
>> We have a fuse server used in our internal cloud scenarios, while the
>> backend store is actually a distributed filesystem.  That is, the fuse
>> server actually plays as the client of the remote distributed
>> filesystem.  The fuse server forwards the fuse requests to the remote
>> backing store through network, while the remote distributed filesystem
>> handles the IO requests, e.g. process the data from/to the persistent
>> store.
>>
>> Then it comes the details of the remote distributed filesystem when it
>> process the requested data with the persistent store.
>>
>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>> (ErasureCode), where each fixed sized user data is split and stored as
>> 8
>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>> block size, for each 4MB user data, it's split and stored as 8 (512
>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>
>> It also utilize the stripe technology to boost the performance, for
>> example, there are 8 data disks and 3 parity disks in the above 8+3
>> mode
>> example, in which each stripe consists of 8 data blocks and 3 parity
>> blocks.
>>
>> [2] To avoid data corruption on power off, the remote distributed
>> filesystem commit a O_SYNC write right away once a write (fuse) request
>> received.  Since the EC described above, when the write fuse request is
>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size,
>> the
>> other 3MB is read from the persistent store first, then compute the
>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>> the 8 data blocks and 3 parity blocks down.
>>
>>
>> Thus the write amplification is un-neglectable and is the performance
>> bottleneck when the fuse request size is less than the stripe size.
>>
>> Here are some simple performance statistics with varying request size.
>> With 4MB stripe size, there's ~3x bandwidth improvement when the
>> maximum
>> request size is increased from 256KB to 3.9MB, and another ~20%
>> improvement when the request size is increased to 4MB from 3.9MB.
> 
> To add my own performance statistics in a microbenchmark:
> 
> Tested on both small VM and large hardware, with suitably large
> FUSE_MAX_MAX_PAGES, using a simple fuse filesystem whose write handlers
> did basically nothing but read the data buffers (memcmp() each 8 bytes
> of data provided against a variable), I ran fio with 128M blocksize,
> end_fsync=1, psync IO engine, times each of 4 parallel jobs. Throughput
> was as follows over variable write_size in MB/s.
> 
> write_size  machine1 machine2
> 32M	1071	6425
> 16M	1002	6445
> 8M	890	6443
> 4M	713	6342
> 2M	557	6290
> 1M	404	6201
> 512K	268	6041
> 256K	156	5782
> 
> Even on the fast machine, increasing the buffer size to 8M is worth 3.9%
> over keeping it at 1M, and is worth over 2x on the small VM. We are
> striving to reduce the ingestion speed in particular as we have seen
> that as a limiting factor on some machines, and there's a clear plateau
> reached around 8M. While most fuse servers would likely not benefit from
> this, and others would benefit from fuse passthrough instead, it does
> seem like a performance win.
> 
> Perhaps, in analogy to soft and hard limits on pipe size,
> FUSE_MAX_MAX_PAGES could be increased and treated as the maximum
> possible hard limit for max_write; and the default hard limit could stay
> at 1M, thereby allowing folks to opt into the new behavior if they care
> about the performance more than the memory?
> 
> Sweet Tea

As I recall the concern about increased message sizes is that it gives a 
process the ability to allocate non-insignificant amounts of kernel 
memory. Perhaps the limits could be expanded only if the server has 
SYS_ADMIN cap.

next prev parent reply	other threads:[~2024-04-08 14:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08  6:32 [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit Sweet Tea Dorminy
2024-04-08 14:26 ` Antonio SJ Musumeci [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-01-24  7:05 Jingbo Xu
2024-01-24 12:23 ` Miklos Szeredi
2024-01-24 12:47   ` Jingbo Xu
2024-01-26  6:29     ` Jingbo Xu
2024-02-26  4:00       ` Jingbo Xu
2024-03-05 14:26         ` Miklos Szeredi
2024-03-06 13:32           ` Jingbo Xu
2024-03-06 15:45             ` Bernd Schubert
2024-03-07  2:16               ` Jingbo Xu
2024-03-07 22:06                 ` Bernd Schubert
2024-03-28 16:46                   ` Sweet Tea Dorminy
2024-03-28 22:08                     ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=af555e3c-cd00-4bf4-b774-8099517bf559@spawn.link \
    --to=trapexit@spawn.link \
    --cc=jefflexu@linux.alibaba.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=sweettea-kernel@dorminy.me \
    --cc=zhangjiachen.jaycee@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).