From: Chuck Lever <chuck.lever@oracle.com>
To: Cedric Blancher <cedric.blancher@gmail.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: Increase RPCSVC_MAXPAYLOAD to 8M?
Date: Thu, 6 Feb 2025 09:25:05 -0500 [thread overview]
Message-ID: <db677cf9-6979-4247-a195-5761c27ef2ab@oracle.com> (raw)
In-Reply-To: <CALXu0Uew5qUxvH7wum7xC1TBaP43tmrYAbU6iS6yuwJVF6rBrg@mail.gmail.com>
On 2/6/25 3:45 AM, Cedric Blancher wrote:
> On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/29/25 2:32 AM, Cedric Blancher wrote:
>>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>>>
>>>> Good morning!
>>>>
>>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
>>>> giving the NFSv4.1 session mechanism some headroom for negotiation.
>>>> For over a decade the default value is 1M (1*1024*1024u), which IMO
>>>> causes problems with anything faster than 2500baseT.
>>>
>>> The 1MB limit was defined when 10base5/10baseT was the norm, and
>>> 100baseT (100mbit) was "fast".
>>>
>>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
>>> 10000baseT is fast.
>>> Just the 1MB limit is now in the way of EVERYTHING, including "large
>>> send offload" and other acceleration features.
>>>
>>> So my suggestion is to increase the buffer to 4MB by default (2*2MB
>>> hugepages on x86), and allow a tuneable to select up to 16MB.
>>
>> TL;DR: This has been on the long-term to-do list for NFSD for quite some
>> time.
>>
>> We certainly want to support larger COMPOUNDs, but increasing
>> RPCSVC_MAXPAYLOAD is only the first step.
>>
>> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
>> it has 259 entries. Quadrupling that would make the array itself
>> multiple pages in size, and there's one of these for each nfsd thread.
>>
>> We are working on replacing the use of page arrays with folios, which
>> would make this infrastructure significantly smaller and faster, but it
>> depends on folio support in all the kernel APIs that NFSD makes use of.
>> That situation continues to evolve.
>>
>> An equivalent issue exists in the Linux NFS client.
>>
>> Increasing this capability on the server without having a client that
>> can make use of it doesn't seem wise.
>>
>> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
>> some measurements to help make the case (and analyze the operational
>> costs). I think we need some confidence that increasing the maximum
>> payload size will not unduly impact small I/O.
>>
>> Re: a tunable: I'm not sure why someone would want to tune this number
>> down from the maximum. You can control how much total memory the server
>> consumes by reducing the number of nfsd threads.
>>
>
> I want a tuneable for TESTING, i.e. lower default (for now), but allow
> people to grab a stock Linux kernel, increase tunable, and do testing.
> Not everyone is happy with doing the voodoo of self-build testing,
> even more so in the (dark) "Age Of SecureBoot", where a signed kernel
> is mandatory. Therefore: Tuneable.
That's appropriate for experimentation, but not a good long-term
solution that should go into the upstream source code.
A tuneable in the upstream source base means the upstream community and
distributors have to support it for a very long time, and these are hard
to get rid of once they become irrelevant.
We have to provide documentation. That documentation might contain
recommended values, and those change over time. They spread out over
the internet and the stale recommended values become a liability.
Admins and users frequently set tuneables incorrectly and that results
in bugs and support calls.
It increases the size of test matrices.
Adding only one of these might not result in a significant increase in
maintenance cost, but if we allow one tuneable, then we have to allow
all of them, and that becomes a living nightmare.
So, not as simple and low-cost as you might think to just "add a
tuneable" in upstream. And not a sensible choice when all you need is a
temporary adjustment for testing.
Do you have a reason why, after we agree on an increase, this should
be a setting that admins will need to lower the value from a default of,
say, 4MB or more? If so, then it makes sense to consider a tuneable (or
better, a self-tuning mechanism). For a temporary setting for the
purpose of experimentation, writing your own patch is the better and
less costly approach.
--
Chuck Lever
next prev parent reply other threads:[~2025-02-06 14:25 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher
2025-01-29 7:32 ` Cedric Blancher
2025-01-29 15:02 ` Chuck Lever
2025-02-06 8:45 ` Cedric Blancher
2025-02-06 14:25 ` Chuck Lever [this message]
2025-03-04 6:43 ` Cedric Blancher
2025-03-04 14:40 ` Chuck Lever
2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher
2025-04-07 13:58 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=db677cf9-6979-4247-a195-5761c27ef2ab@oracle.com \
--to=chuck.lever@oracle.com \
--cc=cedric.blancher@gmail.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox