Increase RPCSVC_MAXPAYLOAD to 8M?

Linux NFS development
 help / color / mirror / Atom feed

* Increase RPCSVC_MAXPAYLOAD to 8M?
@ 2025-01-22 10:07 Cedric Blancher
  2025-01-29  7:32 ` Cedric Blancher
  2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher
  0 siblings, 2 replies; 9+ messages in thread
From: Cedric Blancher @ 2025-01-22 10:07 UTC (permalink / raw)
  To: Linux NFS Mailing List

Good morning!

IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
giving the NFSv4.1 session mechanism some headroom for negotiation.
For over a decade the default value is 1M (1*1024*1024u), which IMO
causes problems with anything faster than 2500baseT.

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M?
  2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher
@ 2025-01-29  7:32 ` Cedric Blancher
  2025-01-29 15:02   ` Chuck Lever
  2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher
  1 sibling, 1 reply; 9+ messages in thread
From: Cedric Blancher @ 2025-01-29  7:32 UTC (permalink / raw)
  To: Linux NFS Mailing List

On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>
> Good morning!
>
> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
> giving the NFSv4.1 session mechanism some headroom for negotiation.
> For over a decade the default value is 1M (1*1024*1024u), which IMO
> causes problems with anything faster than 2500baseT.

The 1MB limit was defined when 10base5/10baseT was the norm, and
100baseT (100mbit) was "fast".

Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
10000baseT is fast.
Just the 1MB limit is now in the way of EVERYTHING, including "large
send offload" and other acceleration features.

So my suggestion is to increase the buffer to 4MB by default (2*2MB
hugepages on x86), and allow a tuneable to select up to 16MB.

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M?
  2025-01-29  7:32 ` Cedric Blancher
@ 2025-01-29 15:02   ` Chuck Lever
  2025-02-06  8:45     ` Cedric Blancher
  0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2025-01-29 15:02 UTC (permalink / raw)
  To: Cedric Blancher; +Cc: Linux NFS Mailing List

On 1/29/25 2:32 AM, Cedric Blancher wrote:
> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>
>> Good morning!
>>
>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
>> giving the NFSv4.1 session mechanism some headroom for negotiation.
>> For over a decade the default value is 1M (1*1024*1024u), which IMO
>> causes problems with anything faster than 2500baseT.
> 
> The 1MB limit was defined when 10base5/10baseT was the norm, and
> 100baseT (100mbit) was "fast".
> 
> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
> 10000baseT is fast.
> Just the 1MB limit is now in the way of EVERYTHING, including "large
> send offload" and other acceleration features.
> 
> So my suggestion is to increase the buffer to 4MB by default (2*2MB
> hugepages on x86), and allow a tuneable to select up to 16MB.

TL;DR: This has been on the long-term to-do list for NFSD for quite some
time.

We certainly want to support larger COMPOUNDs, but increasing
RPCSVC_MAXPAYLOAD is only the first step.

The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
it has 259 entries. Quadrupling that would make the array itself
multiple pages in size, and there's one of these for each nfsd thread.

We are working on replacing the use of page arrays with folios, which
would make this infrastructure significantly smaller and faster, but it
depends on folio support in all the kernel APIs that NFSD makes use of.
That situation continues to evolve.

An equivalent issue exists in the Linux NFS client.

Increasing this capability on the server without having a client that
can make use of it doesn't seem wise.

You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
some measurements to help make the case (and analyze the operational
costs). I think we need some confidence that increasing the maximum
payload size will not unduly impact small I/O.

Re: a tunable: I'm not sure why someone would want to tune this number
down from the maximum. You can control how much total memory the server
consumes by reducing the number of nfsd threads.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M?
  2025-01-29 15:02   ` Chuck Lever
@ 2025-02-06  8:45     ` Cedric Blancher
  2025-02-06 14:25       ` Chuck Lever
  0 siblings, 1 reply; 9+ messages in thread
From: Cedric Blancher @ 2025-02-06  8:45 UTC (permalink / raw)
  To: Linux NFS Mailing List

On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 1/29/25 2:32 AM, Cedric Blancher wrote:
> > On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >>
> >> Good morning!
> >>
> >> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
> >> giving the NFSv4.1 session mechanism some headroom for negotiation.
> >> For over a decade the default value is 1M (1*1024*1024u), which IMO
> >> causes problems with anything faster than 2500baseT.
> >
> > The 1MB limit was defined when 10base5/10baseT was the norm, and
> > 100baseT (100mbit) was "fast".
> >
> > Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
> > 10000baseT is fast.
> > Just the 1MB limit is now in the way of EVERYTHING, including "large
> > send offload" and other acceleration features.
> >
> > So my suggestion is to increase the buffer to 4MB by default (2*2MB
> > hugepages on x86), and allow a tuneable to select up to 16MB.
>
> TL;DR: This has been on the long-term to-do list for NFSD for quite some
> time.
>
> We certainly want to support larger COMPOUNDs, but increasing
> RPCSVC_MAXPAYLOAD is only the first step.
>
> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
> it has 259 entries. Quadrupling that would make the array itself
> multiple pages in size, and there's one of these for each nfsd thread.
>
> We are working on replacing the use of page arrays with folios, which
> would make this infrastructure significantly smaller and faster, but it
> depends on folio support in all the kernel APIs that NFSD makes use of.
> That situation continues to evolve.
>
> An equivalent issue exists in the Linux NFS client.
>
> Increasing this capability on the server without having a client that
> can make use of it doesn't seem wise.
>
> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
> some measurements to help make the case (and analyze the operational
> costs). I think we need some confidence that increasing the maximum
> payload size will not unduly impact small I/O.
>
> Re: a tunable: I'm not sure why someone would want to tune this number
> down from the maximum. You can control how much total memory the server
> consumes by reducing the number of nfsd threads.
>

I want a tuneable for TESTING, i.e. lower default (for now), but allow
people to grab a stock Linux kernel, increase tunable, and do testing.
Not everyone is happy with doing the voodoo of self-build testing,
even more so in the (dark) "Age Of SecureBoot", where a signed kernel
is mandatory. Therefore: Tuneable.

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M?
  2025-02-06  8:45     ` Cedric Blancher
@ 2025-02-06 14:25       ` Chuck Lever
  2025-03-04  6:43         ` Cedric Blancher
  0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2025-02-06 14:25 UTC (permalink / raw)
  To: Cedric Blancher; +Cc: Linux NFS Mailing List

On 2/6/25 3:45 AM, Cedric Blancher wrote:
> On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 1/29/25 2:32 AM, Cedric Blancher wrote:
>>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>>>
>>>> Good morning!
>>>>
>>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
>>>> giving the NFSv4.1 session mechanism some headroom for negotiation.
>>>> For over a decade the default value is 1M (1*1024*1024u), which IMO
>>>> causes problems with anything faster than 2500baseT.
>>>
>>> The 1MB limit was defined when 10base5/10baseT was the norm, and
>>> 100baseT (100mbit) was "fast".
>>>
>>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
>>> 10000baseT is fast.
>>> Just the 1MB limit is now in the way of EVERYTHING, including "large
>>> send offload" and other acceleration features.
>>>
>>> So my suggestion is to increase the buffer to 4MB by default (2*2MB
>>> hugepages on x86), and allow a tuneable to select up to 16MB.
>>
>> TL;DR: This has been on the long-term to-do list for NFSD for quite some
>> time.
>>
>> We certainly want to support larger COMPOUNDs, but increasing
>> RPCSVC_MAXPAYLOAD is only the first step.
>>
>> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
>> it has 259 entries. Quadrupling that would make the array itself
>> multiple pages in size, and there's one of these for each nfsd thread.
>>
>> We are working on replacing the use of page arrays with folios, which
>> would make this infrastructure significantly smaller and faster, but it
>> depends on folio support in all the kernel APIs that NFSD makes use of.
>> That situation continues to evolve.
>>
>> An equivalent issue exists in the Linux NFS client.
>>
>> Increasing this capability on the server without having a client that
>> can make use of it doesn't seem wise.
>>
>> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
>> some measurements to help make the case (and analyze the operational
>> costs). I think we need some confidence that increasing the maximum
>> payload size will not unduly impact small I/O.
>>
>> Re: a tunable: I'm not sure why someone would want to tune this number
>> down from the maximum. You can control how much total memory the server
>> consumes by reducing the number of nfsd threads.
>>
> 
> I want a tuneable for TESTING, i.e. lower default (for now), but allow
> people to grab a stock Linux kernel, increase tunable, and do testing.
> Not everyone is happy with doing the voodoo of self-build testing,
> even more so in the (dark) "Age Of SecureBoot", where a signed kernel
> is mandatory. Therefore: Tuneable.

That's appropriate for experimentation, but not a good long-term
solution that should go into the upstream source code.

A tuneable in the upstream source base means the upstream community and
distributors have to support it for a very long time, and these are hard
to get rid of once they become irrelevant.

We have to provide documentation. That documentation might contain
recommended values, and those change over time. They spread out over
the internet and the stale recommended values become a liability.

Admins and users frequently set tuneables incorrectly and that results
in bugs and support calls.

It increases the size of test matrices.

Adding only one of these might not result in a significant increase in
maintenance cost, but if we allow one tuneable, then we have to allow
all of them, and that becomes a living nightmare.

So, not as simple and low-cost as you might think to just "add a
tuneable" in upstream. And not a sensible choice when all you need is a
temporary adjustment for testing.

Do you have a reason why, after we agree on an increase, this should
be a setting that admins will need to lower the value from a default of,
say, 4MB or more? If so, then it makes sense to consider a tuneable (or
better, a self-tuning mechanism). For a temporary setting for the
purpose of experimentation, writing your own patch is the better and
less costly approach.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M?
  2025-02-06 14:25       ` Chuck Lever
@ 2025-03-04  6:43         ` Cedric Blancher
  2025-03-04 14:40           ` Chuck Lever
  0 siblings, 1 reply; 9+ messages in thread
From: Cedric Blancher @ 2025-03-04  6:43 UTC (permalink / raw)
  To: Linux NFS Mailing List

On Thu, 6 Feb 2025 at 15:25, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On 2/6/25 3:45 AM, Cedric Blancher wrote:
> > On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> On 1/29/25 2:32 AM, Cedric Blancher wrote:
> >>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
> >>>>
> >>>> Good morning!
> >>>>
> >>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
> >>>> giving the NFSv4.1 session mechanism some headroom for negotiation.
> >>>> For over a decade the default value is 1M (1*1024*1024u), which IMO
> >>>> causes problems with anything faster than 2500baseT.
> >>>
> >>> The 1MB limit was defined when 10base5/10baseT was the norm, and
> >>> 100baseT (100mbit) was "fast".
> >>>
> >>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
> >>> 10000baseT is fast.
> >>> Just the 1MB limit is now in the way of EVERYTHING, including "large
> >>> send offload" and other acceleration features.
> >>>
> >>> So my suggestion is to increase the buffer to 4MB by default (2*2MB
> >>> hugepages on x86), and allow a tuneable to select up to 16MB.
> >>
> >> TL;DR: This has been on the long-term to-do list for NFSD for quite some
> >> time.
> >>
> >> We certainly want to support larger COMPOUNDs, but increasing
> >> RPCSVC_MAXPAYLOAD is only the first step.
> >>
> >> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
> >> it has 259 entries. Quadrupling that would make the array itself
> >> multiple pages in size, and there's one of these for each nfsd thread.
> >>
> >> We are working on replacing the use of page arrays with folios, which
> >> would make this infrastructure significantly smaller and faster, but it
> >> depends on folio support in all the kernel APIs that NFSD makes use of.
> >> That situation continues to evolve.
> >>
> >> An equivalent issue exists in the Linux NFS client.
> >>
> >> Increasing this capability on the server without having a client that
> >> can make use of it doesn't seem wise.
> >>
> >> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
> >> some measurements to help make the case (and analyze the operational
> >> costs). I think we need some confidence that increasing the maximum
> >> payload size will not unduly impact small I/O.
> >>
> >> Re: a tunable: I'm not sure why someone would want to tune this number
> >> down from the maximum. You can control how much total memory the server
> >> consumes by reducing the number of nfsd threads.
> >>
> >
> > I want a tuneable for TESTING, i.e. lower default (for now), but allow
> > people to grab a stock Linux kernel, increase tunable, and do testing.
> > Not everyone is happy with doing the voodoo of self-build testing,
> > even more so in the (dark) "Age Of SecureBoot", where a signed kernel
> > is mandatory. Therefore: Tuneable.
>
> That's appropriate for experimentation, but not a good long-term
> solution that should go into the upstream source code.

I disagree. How should - in the age of "secureboot enforcement", which
implies that only kernels with cryptographic signatures can be loaded
on servers - data be collected?

>
> A tuneable in the upstream source base means the upstream community and
> distributors have to support it for a very long time, and these are hard
> to get rid of once they become irrelevant.

No, this tunable is very likely to stay. It defines the DEFAULT for the kernel

>
> We have to provide documentation. That documentation might contain
> recommended values, and those change over time. They spread out over
> the internet and the stale recommended values become a liability.
>
> Admins and users frequently set tuneables incorrectly and that results
> in bugs and support calls.
>
> It increases the size of test matrices.
>
> Adding only one of these might not result in a significant increase in
> maintenance cost, but if we allow one tuneable, then we have to allow
> all of them, and that becomes a living nightmare.

That never ever was a problem for any of the UNIX System V
derivatives, which all have kernel tunables loaded from /etc/system.
No one ever complained, and Linux has the same concept with sysctl

>
> So, not as simple and low-cost as you might think to just "add a
> tuneable" in upstream. And not a sensible choice when all you need is a
> temporary adjustment for testing.
>
> Do you have a reason why, after we agree on an increase, this should
> be a setting that admins will need to lower the value from a default of,
> say, 4MB or more? If so, then it makes sense to consider a tuneable (or
> better, a self-tuning mechanism). For a temporary setting for the
> purpose of experimentation, writing your own patch is the better and
> less costly approach.

Testing, profiling, performance measurements, and a 4M default might
be a problem for embedded machines with only 16MB.

So yes, I think Linux either needs a tunable, or just GIVE UP thinking
about a bigger TCP buffer size. People can always RDMA or other
platforms if they want decent transport performance.

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M?
  2025-03-04  6:43         ` Cedric Blancher
@ 2025-03-04 14:40           ` Chuck Lever
  0 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2025-03-04 14:40 UTC (permalink / raw)
  To: Cedric Blancher; +Cc: Linux NFS Mailing List

On 3/4/25 1:43 AM, Cedric Blancher wrote:
> On Thu, 6 Feb 2025 at 15:25, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On 2/6/25 3:45 AM, Cedric Blancher wrote:
>>> On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On 1/29/25 2:32 AM, Cedric Blancher wrote:
>>>>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>>>>>
>>>>>> Good morning!
>>>>>>
>>>>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
>>>>>> giving the NFSv4.1 session mechanism some headroom for negotiation.
>>>>>> For over a decade the default value is 1M (1*1024*1024u), which IMO
>>>>>> causes problems with anything faster than 2500baseT.
>>>>>
>>>>> The 1MB limit was defined when 10base5/10baseT was the norm, and
>>>>> 100baseT (100mbit) was "fast".
>>>>>
>>>>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and
>>>>> 10000baseT is fast.
>>>>> Just the 1MB limit is now in the way of EVERYTHING, including "large
>>>>> send offload" and other acceleration features.
>>>>>
>>>>> So my suggestion is to increase the buffer to 4MB by default (2*2MB
>>>>> hugepages on x86), and allow a tuneable to select up to 16MB.
>>>>
>>>> TL;DR: This has been on the long-term to-do list for NFSD for quite some
>>>> time.
>>>>
>>>> We certainly want to support larger COMPOUNDs, but increasing
>>>> RPCSVC_MAXPAYLOAD is only the first step.
>>>>
>>>> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today
>>>> it has 259 entries. Quadrupling that would make the array itself
>>>> multiple pages in size, and there's one of these for each nfsd thread.
>>>>
>>>> We are working on replacing the use of page arrays with folios, which
>>>> would make this infrastructure significantly smaller and faster, but it
>>>> depends on folio support in all the kernel APIs that NFSD makes use of.
>>>> That situation continues to evolve.
>>>>
>>>> An equivalent issue exists in the Linux NFS client.
>>>>
>>>> Increasing this capability on the server without having a client that
>>>> can make use of it doesn't seem wise.
>>>>
>>>> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try
>>>> some measurements to help make the case (and analyze the operational
>>>> costs). I think we need some confidence that increasing the maximum
>>>> payload size will not unduly impact small I/O.
>>>>
>>>> Re: a tunable: I'm not sure why someone would want to tune this number
>>>> down from the maximum. You can control how much total memory the server
>>>> consumes by reducing the number of nfsd threads.
>>>>
>>>
>>> I want a tuneable for TESTING, i.e. lower default (for now), but allow
>>> people to grab a stock Linux kernel, increase tunable, and do testing.
>>> Not everyone is happy with doing the voodoo of self-build testing,
>>> even more so in the (dark) "Age Of SecureBoot", where a signed kernel
>>> is mandatory. Therefore: Tuneable.
>>
>> That's appropriate for experimentation, but not a good long-term
>> solution that should go into the upstream source code.
> 
> I disagree. How should - in the age of "secureboot enforcement", which
> implies that only kernels with cryptographic signatures can be loaded
> on servers - data be collected?
> 
>>
>> A tuneable in the upstream source base means the upstream community and
>> distributors have to support it for a very long time, and these are hard
>> to get rid of once they become irrelevant.
> 
> No, this tunable is very likely to stay. It defines the DEFAULT for the kernel
> 
>>
>> We have to provide documentation. That documentation might contain
>> recommended values, and those change over time. They spread out over
>> the internet and the stale recommended values become a liability.
>>
>> Admins and users frequently set tuneables incorrectly and that results
>> in bugs and support calls.
>>
>> It increases the size of test matrices.
>>
>> Adding only one of these might not result in a significant increase in
>> maintenance cost, but if we allow one tuneable, then we have to allow
>> all of them, and that becomes a living nightmare.
> 
> That never ever was a problem for any of the UNIX System V
> derivatives, which all have kernel tunables loaded from /etc/system.
> No one ever complained, and Linux has the same concept with sysctl

I think you missed my point. It's not good design to add tuneables that
/aren't/ /generally/ /useful/ -- why does the rsize/wsize maximum need
to be changed for particular deployments?

Basic approach these days is:

 - After an experimentation period, will the tuneable still be useful?

 - Can the tuneable be abused?

 - Is there a mechanism or heuristic that can enable the system discover
   a good setting automatically?

Only after we have strong technical answers for those questions does it
make sense to add a public and documented administrative setting.

Otherwise, a tuneable simply adds complexity of implementation, of our
documentation, and of our testing requirements. We have to demonstrate,
first and foremost, that the tuneable adds sufficient value to warrant
the costs.


>> So, not as simple and low-cost as you might think to just "add a
>> tuneable" in upstream. And not a sensible choice when all you need is a
>> temporary adjustment for testing.
>>
>> Do you have a reason why, after we agree on an increase, this should
>> be a setting that admins will need to lower the value from a default of,
>> say, 4MB or more? If so, then it makes sense to consider a tuneable (or
>> better, a self-tuning mechanism). For a temporary setting for the
>> purpose of experimentation, writing your own patch is the better and
>> less costly approach.
> 
> Testing, profiling, performance measurements, and a 4M default might
> be a problem for embedded machines with only 16MB.

We can make the internal maximum scale down automatically with
available physical memory. That doesn't require an exposed global
setting.


> So yes, I think Linux either needs a tunable, or just GIVE UP thinking
> about a bigger TCP buffer size.

Let me summarize my position:

I'm OK with the long term goal of increasing NFSD's maximum rsize/wsize

I'm not convinced it needs to have an exposed tuneable (after a period
of experimentation). We should consider ways of setting the maximum
payload size automatically, for example.

Even if we need an exposed tuneable, I'm not convinced this needs to be
a global setting. Perhaps per export or per connection makes sense.
Let's think about this before committing to changes to the public
administrative API.

There are other development priorities that might potentially conflict
with increasing RPCSVC_MAXPAYLOAD. There is currently an effort, for
example, to replace NFSD's send and receive buffer page arrays with
folios. That will have direct impact on how larger rsize and wsize is
implemented, and it may avoid the need for reducing that maximum on
small memory systems.

Lastly, there have been some suggestions about how to add a temporary
tuneable that could be made available in hardened distro-built kernels
without the baggage of an unchangeable API contract (the NFSD netlink
protocol). That might give you the ability to adjust this value
without us having to support it forever.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX
  2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher
  2025-01-29  7:32 ` Cedric Blancher
@ 2025-04-07 11:34 ` Cedric Blancher
  2025-04-07 13:58   ` Chuck Lever
  1 sibling, 1 reply; 9+ messages in thread
From: Cedric Blancher @ 2025-04-07 11:34 UTC (permalink / raw)
  To: Linux NFS Mailing List

On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>
> Good morning!
>
> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
> giving the NFSv4.1 session mechanism some headroom for negotiation.
> For over a decade the default value is 1M (1*1024*1024u), which IMO
> causes problems with anything faster than 2500baseT.

Chuck pointed out that the new /sys/kernel/debug/ subdir could be used
to host "experimental" tunables.

Plan:
Add a /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD tunable file,
RPCSVC_MAXPAYLOAD defaults to 4M, on connection start it copies the
value of /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD into a private
variable to make it unchangeable during connection lifetime, because
Chuck is worried that svc_rqst::rq_pages might blow up in our face.

Would that be a plan?

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX
  2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher
@ 2025-04-07 13:58   ` Chuck Lever
  0 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2025-04-07 13:58 UTC (permalink / raw)
  To: Cedric Blancher, Linux NFS Mailing List

On 4/7/25 7:34 AM, Cedric Blancher wrote:
> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote:
>>
>> Good morning!
>>
>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M,
>> giving the NFSv4.1 session mechanism some headroom for negotiation.
>> For over a decade the default value is 1M (1*1024*1024u), which IMO
>> causes problems with anything faster than 2500baseT.
> 
> Chuck pointed out that the new /sys/kernel/debug/ subdir could be used
> to host "experimental" tunables.

Indeed. That hurdle will be addressed in v6.16. The patch that adds the
new debugfs directory for NFSD is now in nfsd-next.

> Plan:
> Add a /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD tunable file,
> RPCSVC_MAXPAYLOAD defaults to 4M, on connection start it copies the
> value of /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD into a private
> variable to make it unchangeable during connection lifetime, because
> Chuck is worried that svc_rqst::rq_pages might blow up in our face.
> 
> Would that be a plan?

Right now I'm still concerned about the bulky size of the rq_pages
array. rq_vec and rq_bvec are a problem as well.

maxpayload determines the size of the rq_pages array in each nfsd
thread. It's not a per-connection thing. So, the size of the array
is picked up when the threads are created. You'd have to set the
number of threads to zero, change maxpayload, then set the number
of threads to a positive integer -- at that point each of the freshly
created threads would pick up the new maxpayload value (for example).

Making the rq_pages array size dynamic has a few issues. It's in the
middle of struct svc_rqst. We might move it to the end of that
structure, or we could make rq_pages a pointer to a separate memory
allocation. That would introduce some code churn.

Or, we could simply allocate the maximum size (4MB worth of pages is
more than 1000 array entries) for every thread, all the time. That's
simple, but wasteful. I am a little concerned about introducing that
overhead on every NFSD operation, even for small ones.

The RPCSVC_MAXPAYLOAD macro is currently an integer constant and is used
in several places for bounds-checking during request marshaling and
unmarshaling. It would have to be replaced with a global variable; that
variable would likely be stored in svc_serv, to go along with the
rq_pages array size allocated in each thread in the serv's thread pool.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-07 13:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher
2025-01-29  7:32 ` Cedric Blancher
2025-01-29 15:02   ` Chuck Lever
2025-02-06  8:45     ` Cedric Blancher
2025-02-06 14:25       ` Chuck Lever
2025-03-04  6:43         ` Cedric Blancher
2025-03-04 14:40           ` Chuck Lever
2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher
2025-04-07 13:58   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox