* Increase RPCSVC_MAXPAYLOAD to 8M? @ 2025-01-22 10:07 Cedric Blancher 2025-01-29 7:32 ` Cedric Blancher 2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher 0 siblings, 2 replies; 9+ messages in thread From: Cedric Blancher @ 2025-01-22 10:07 UTC (permalink / raw) To: Linux NFS Mailing List Good morning! IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, giving the NFSv4.1 session mechanism some headroom for negotiation. For over a decade the default value is 1M (1*1024*1024u), which IMO causes problems with anything faster than 2500baseT. Ced -- Cedric Blancher <cedric.blancher@gmail.com> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M? 2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher @ 2025-01-29 7:32 ` Cedric Blancher 2025-01-29 15:02 ` Chuck Lever 2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher 1 sibling, 1 reply; 9+ messages in thread From: Cedric Blancher @ 2025-01-29 7:32 UTC (permalink / raw) To: Linux NFS Mailing List On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: > > Good morning! > > IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, > giving the NFSv4.1 session mechanism some headroom for negotiation. > For over a decade the default value is 1M (1*1024*1024u), which IMO > causes problems with anything faster than 2500baseT. The 1MB limit was defined when 10base5/10baseT was the norm, and 100baseT (100mbit) was "fast". Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and 10000baseT is fast. Just the 1MB limit is now in the way of EVERYTHING, including "large send offload" and other acceleration features. So my suggestion is to increase the buffer to 4MB by default (2*2MB hugepages on x86), and allow a tuneable to select up to 16MB. Ced -- Cedric Blancher <cedric.blancher@gmail.com> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M? 2025-01-29 7:32 ` Cedric Blancher @ 2025-01-29 15:02 ` Chuck Lever 2025-02-06 8:45 ` Cedric Blancher 0 siblings, 1 reply; 9+ messages in thread From: Chuck Lever @ 2025-01-29 15:02 UTC (permalink / raw) To: Cedric Blancher; +Cc: Linux NFS Mailing List On 1/29/25 2:32 AM, Cedric Blancher wrote: > On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: >> >> Good morning! >> >> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, >> giving the NFSv4.1 session mechanism some headroom for negotiation. >> For over a decade the default value is 1M (1*1024*1024u), which IMO >> causes problems with anything faster than 2500baseT. > > The 1MB limit was defined when 10base5/10baseT was the norm, and > 100baseT (100mbit) was "fast". > > Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and > 10000baseT is fast. > Just the 1MB limit is now in the way of EVERYTHING, including "large > send offload" and other acceleration features. > > So my suggestion is to increase the buffer to 4MB by default (2*2MB > hugepages on x86), and allow a tuneable to select up to 16MB. TL;DR: This has been on the long-term to-do list for NFSD for quite some time. We certainly want to support larger COMPOUNDs, but increasing RPCSVC_MAXPAYLOAD is only the first step. The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today it has 259 entries. Quadrupling that would make the array itself multiple pages in size, and there's one of these for each nfsd thread. We are working on replacing the use of page arrays with folios, which would make this infrastructure significantly smaller and faster, but it depends on folio support in all the kernel APIs that NFSD makes use of. That situation continues to evolve. An equivalent issue exists in the Linux NFS client. Increasing this capability on the server without having a client that can make use of it doesn't seem wise. You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try some measurements to help make the case (and analyze the operational costs). I think we need some confidence that increasing the maximum payload size will not unduly impact small I/O. Re: a tunable: I'm not sure why someone would want to tune this number down from the maximum. You can control how much total memory the server consumes by reducing the number of nfsd threads. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M? 2025-01-29 15:02 ` Chuck Lever @ 2025-02-06 8:45 ` Cedric Blancher 2025-02-06 14:25 ` Chuck Lever 0 siblings, 1 reply; 9+ messages in thread From: Cedric Blancher @ 2025-02-06 8:45 UTC (permalink / raw) To: Linux NFS Mailing List On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote: > > On 1/29/25 2:32 AM, Cedric Blancher wrote: > > On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: > >> > >> Good morning! > >> > >> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, > >> giving the NFSv4.1 session mechanism some headroom for negotiation. > >> For over a decade the default value is 1M (1*1024*1024u), which IMO > >> causes problems with anything faster than 2500baseT. > > > > The 1MB limit was defined when 10base5/10baseT was the norm, and > > 100baseT (100mbit) was "fast". > > > > Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and > > 10000baseT is fast. > > Just the 1MB limit is now in the way of EVERYTHING, including "large > > send offload" and other acceleration features. > > > > So my suggestion is to increase the buffer to 4MB by default (2*2MB > > hugepages on x86), and allow a tuneable to select up to 16MB. > > TL;DR: This has been on the long-term to-do list for NFSD for quite some > time. > > We certainly want to support larger COMPOUNDs, but increasing > RPCSVC_MAXPAYLOAD is only the first step. > > The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today > it has 259 entries. Quadrupling that would make the array itself > multiple pages in size, and there's one of these for each nfsd thread. > > We are working on replacing the use of page arrays with folios, which > would make this infrastructure significantly smaller and faster, but it > depends on folio support in all the kernel APIs that NFSD makes use of. > That situation continues to evolve. > > An equivalent issue exists in the Linux NFS client. > > Increasing this capability on the server without having a client that > can make use of it doesn't seem wise. > > You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try > some measurements to help make the case (and analyze the operational > costs). I think we need some confidence that increasing the maximum > payload size will not unduly impact small I/O. > > Re: a tunable: I'm not sure why someone would want to tune this number > down from the maximum. You can control how much total memory the server > consumes by reducing the number of nfsd threads. > I want a tuneable for TESTING, i.e. lower default (for now), but allow people to grab a stock Linux kernel, increase tunable, and do testing. Not everyone is happy with doing the voodoo of self-build testing, even more so in the (dark) "Age Of SecureBoot", where a signed kernel is mandatory. Therefore: Tuneable. Ced -- Cedric Blancher <cedric.blancher@gmail.com> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M? 2025-02-06 8:45 ` Cedric Blancher @ 2025-02-06 14:25 ` Chuck Lever 2025-03-04 6:43 ` Cedric Blancher 0 siblings, 1 reply; 9+ messages in thread From: Chuck Lever @ 2025-02-06 14:25 UTC (permalink / raw) To: Cedric Blancher; +Cc: Linux NFS Mailing List On 2/6/25 3:45 AM, Cedric Blancher wrote: > On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote: >> >> On 1/29/25 2:32 AM, Cedric Blancher wrote: >>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: >>>> >>>> Good morning! >>>> >>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, >>>> giving the NFSv4.1 session mechanism some headroom for negotiation. >>>> For over a decade the default value is 1M (1*1024*1024u), which IMO >>>> causes problems with anything faster than 2500baseT. >>> >>> The 1MB limit was defined when 10base5/10baseT was the norm, and >>> 100baseT (100mbit) was "fast". >>> >>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and >>> 10000baseT is fast. >>> Just the 1MB limit is now in the way of EVERYTHING, including "large >>> send offload" and other acceleration features. >>> >>> So my suggestion is to increase the buffer to 4MB by default (2*2MB >>> hugepages on x86), and allow a tuneable to select up to 16MB. >> >> TL;DR: This has been on the long-term to-do list for NFSD for quite some >> time. >> >> We certainly want to support larger COMPOUNDs, but increasing >> RPCSVC_MAXPAYLOAD is only the first step. >> >> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today >> it has 259 entries. Quadrupling that would make the array itself >> multiple pages in size, and there's one of these for each nfsd thread. >> >> We are working on replacing the use of page arrays with folios, which >> would make this infrastructure significantly smaller and faster, but it >> depends on folio support in all the kernel APIs that NFSD makes use of. >> That situation continues to evolve. >> >> An equivalent issue exists in the Linux NFS client. >> >> Increasing this capability on the server without having a client that >> can make use of it doesn't seem wise. >> >> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try >> some measurements to help make the case (and analyze the operational >> costs). I think we need some confidence that increasing the maximum >> payload size will not unduly impact small I/O. >> >> Re: a tunable: I'm not sure why someone would want to tune this number >> down from the maximum. You can control how much total memory the server >> consumes by reducing the number of nfsd threads. >> > > I want a tuneable for TESTING, i.e. lower default (for now), but allow > people to grab a stock Linux kernel, increase tunable, and do testing. > Not everyone is happy with doing the voodoo of self-build testing, > even more so in the (dark) "Age Of SecureBoot", where a signed kernel > is mandatory. Therefore: Tuneable. That's appropriate for experimentation, but not a good long-term solution that should go into the upstream source code. A tuneable in the upstream source base means the upstream community and distributors have to support it for a very long time, and these are hard to get rid of once they become irrelevant. We have to provide documentation. That documentation might contain recommended values, and those change over time. They spread out over the internet and the stale recommended values become a liability. Admins and users frequently set tuneables incorrectly and that results in bugs and support calls. It increases the size of test matrices. Adding only one of these might not result in a significant increase in maintenance cost, but if we allow one tuneable, then we have to allow all of them, and that becomes a living nightmare. So, not as simple and low-cost as you might think to just "add a tuneable" in upstream. And not a sensible choice when all you need is a temporary adjustment for testing. Do you have a reason why, after we agree on an increase, this should be a setting that admins will need to lower the value from a default of, say, 4MB or more? If so, then it makes sense to consider a tuneable (or better, a self-tuning mechanism). For a temporary setting for the purpose of experimentation, writing your own patch is the better and less costly approach. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M? 2025-02-06 14:25 ` Chuck Lever @ 2025-03-04 6:43 ` Cedric Blancher 2025-03-04 14:40 ` Chuck Lever 0 siblings, 1 reply; 9+ messages in thread From: Cedric Blancher @ 2025-03-04 6:43 UTC (permalink / raw) To: Linux NFS Mailing List On Thu, 6 Feb 2025 at 15:25, Chuck Lever <chuck.lever@oracle.com> wrote: > > On 2/6/25 3:45 AM, Cedric Blancher wrote: > > On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote: > >> > >> On 1/29/25 2:32 AM, Cedric Blancher wrote: > >>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: > >>>> > >>>> Good morning! > >>>> > >>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, > >>>> giving the NFSv4.1 session mechanism some headroom for negotiation. > >>>> For over a decade the default value is 1M (1*1024*1024u), which IMO > >>>> causes problems with anything faster than 2500baseT. > >>> > >>> The 1MB limit was defined when 10base5/10baseT was the norm, and > >>> 100baseT (100mbit) was "fast". > >>> > >>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and > >>> 10000baseT is fast. > >>> Just the 1MB limit is now in the way of EVERYTHING, including "large > >>> send offload" and other acceleration features. > >>> > >>> So my suggestion is to increase the buffer to 4MB by default (2*2MB > >>> hugepages on x86), and allow a tuneable to select up to 16MB. > >> > >> TL;DR: This has been on the long-term to-do list for NFSD for quite some > >> time. > >> > >> We certainly want to support larger COMPOUNDs, but increasing > >> RPCSVC_MAXPAYLOAD is only the first step. > >> > >> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today > >> it has 259 entries. Quadrupling that would make the array itself > >> multiple pages in size, and there's one of these for each nfsd thread. > >> > >> We are working on replacing the use of page arrays with folios, which > >> would make this infrastructure significantly smaller and faster, but it > >> depends on folio support in all the kernel APIs that NFSD makes use of. > >> That situation continues to evolve. > >> > >> An equivalent issue exists in the Linux NFS client. > >> > >> Increasing this capability on the server without having a client that > >> can make use of it doesn't seem wise. > >> > >> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try > >> some measurements to help make the case (and analyze the operational > >> costs). I think we need some confidence that increasing the maximum > >> payload size will not unduly impact small I/O. > >> > >> Re: a tunable: I'm not sure why someone would want to tune this number > >> down from the maximum. You can control how much total memory the server > >> consumes by reducing the number of nfsd threads. > >> > > > > I want a tuneable for TESTING, i.e. lower default (for now), but allow > > people to grab a stock Linux kernel, increase tunable, and do testing. > > Not everyone is happy with doing the voodoo of self-build testing, > > even more so in the (dark) "Age Of SecureBoot", where a signed kernel > > is mandatory. Therefore: Tuneable. > > That's appropriate for experimentation, but not a good long-term > solution that should go into the upstream source code. I disagree. How should - in the age of "secureboot enforcement", which implies that only kernels with cryptographic signatures can be loaded on servers - data be collected? > > A tuneable in the upstream source base means the upstream community and > distributors have to support it for a very long time, and these are hard > to get rid of once they become irrelevant. No, this tunable is very likely to stay. It defines the DEFAULT for the kernel > > We have to provide documentation. That documentation might contain > recommended values, and those change over time. They spread out over > the internet and the stale recommended values become a liability. > > Admins and users frequently set tuneables incorrectly and that results > in bugs and support calls. > > It increases the size of test matrices. > > Adding only one of these might not result in a significant increase in > maintenance cost, but if we allow one tuneable, then we have to allow > all of them, and that becomes a living nightmare. That never ever was a problem for any of the UNIX System V derivatives, which all have kernel tunables loaded from /etc/system. No one ever complained, and Linux has the same concept with sysctl > > So, not as simple and low-cost as you might think to just "add a > tuneable" in upstream. And not a sensible choice when all you need is a > temporary adjustment for testing. > > Do you have a reason why, after we agree on an increase, this should > be a setting that admins will need to lower the value from a default of, > say, 4MB or more? If so, then it makes sense to consider a tuneable (or > better, a self-tuning mechanism). For a temporary setting for the > purpose of experimentation, writing your own patch is the better and > less costly approach. Testing, profiling, performance measurements, and a 4M default might be a problem for embedded machines with only 16MB. So yes, I think Linux either needs a tunable, or just GIVE UP thinking about a bigger TCP buffer size. People can always RDMA or other platforms if they want decent transport performance. Ced -- Cedric Blancher <cedric.blancher@gmail.com> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M? 2025-03-04 6:43 ` Cedric Blancher @ 2025-03-04 14:40 ` Chuck Lever 0 siblings, 0 replies; 9+ messages in thread From: Chuck Lever @ 2025-03-04 14:40 UTC (permalink / raw) To: Cedric Blancher; +Cc: Linux NFS Mailing List On 3/4/25 1:43 AM, Cedric Blancher wrote: > On Thu, 6 Feb 2025 at 15:25, Chuck Lever <chuck.lever@oracle.com> wrote: >> >> On 2/6/25 3:45 AM, Cedric Blancher wrote: >>> On Wed, 29 Jan 2025 at 16:02, Chuck Lever <chuck.lever@oracle.com> wrote: >>>> >>>> On 1/29/25 2:32 AM, Cedric Blancher wrote: >>>>> On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: >>>>>> >>>>>> Good morning! >>>>>> >>>>>> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, >>>>>> giving the NFSv4.1 session mechanism some headroom for negotiation. >>>>>> For over a decade the default value is 1M (1*1024*1024u), which IMO >>>>>> causes problems with anything faster than 2500baseT. >>>>> >>>>> The 1MB limit was defined when 10base5/10baseT was the norm, and >>>>> 100baseT (100mbit) was "fast". >>>>> >>>>> Nowadays 1000baseT is the norm, 2500baseT is in premium *laptops*, and >>>>> 10000baseT is fast. >>>>> Just the 1MB limit is now in the way of EVERYTHING, including "large >>>>> send offload" and other acceleration features. >>>>> >>>>> So my suggestion is to increase the buffer to 4MB by default (2*2MB >>>>> hugepages on x86), and allow a tuneable to select up to 16MB. >>>> >>>> TL;DR: This has been on the long-term to-do list for NFSD for quite some >>>> time. >>>> >>>> We certainly want to support larger COMPOUNDs, but increasing >>>> RPCSVC_MAXPAYLOAD is only the first step. >>>> >>>> The biggest obstacle is the rq_pages[] array in struct svc_rqst. Today >>>> it has 259 entries. Quadrupling that would make the array itself >>>> multiple pages in size, and there's one of these for each nfsd thread. >>>> >>>> We are working on replacing the use of page arrays with folios, which >>>> would make this infrastructure significantly smaller and faster, but it >>>> depends on folio support in all the kernel APIs that NFSD makes use of. >>>> That situation continues to evolve. >>>> >>>> An equivalent issue exists in the Linux NFS client. >>>> >>>> Increasing this capability on the server without having a client that >>>> can make use of it doesn't seem wise. >>>> >>>> You can try increasing the value of RPCSVC_MAXPAYLOAD yourself and try >>>> some measurements to help make the case (and analyze the operational >>>> costs). I think we need some confidence that increasing the maximum >>>> payload size will not unduly impact small I/O. >>>> >>>> Re: a tunable: I'm not sure why someone would want to tune this number >>>> down from the maximum. You can control how much total memory the server >>>> consumes by reducing the number of nfsd threads. >>>> >>> >>> I want a tuneable for TESTING, i.e. lower default (for now), but allow >>> people to grab a stock Linux kernel, increase tunable, and do testing. >>> Not everyone is happy with doing the voodoo of self-build testing, >>> even more so in the (dark) "Age Of SecureBoot", where a signed kernel >>> is mandatory. Therefore: Tuneable. >> >> That's appropriate for experimentation, but not a good long-term >> solution that should go into the upstream source code. > > I disagree. How should - in the age of "secureboot enforcement", which > implies that only kernels with cryptographic signatures can be loaded > on servers - data be collected? > >> >> A tuneable in the upstream source base means the upstream community and >> distributors have to support it for a very long time, and these are hard >> to get rid of once they become irrelevant. > > No, this tunable is very likely to stay. It defines the DEFAULT for the kernel > >> >> We have to provide documentation. That documentation might contain >> recommended values, and those change over time. They spread out over >> the internet and the stale recommended values become a liability. >> >> Admins and users frequently set tuneables incorrectly and that results >> in bugs and support calls. >> >> It increases the size of test matrices. >> >> Adding only one of these might not result in a significant increase in >> maintenance cost, but if we allow one tuneable, then we have to allow >> all of them, and that becomes a living nightmare. > > That never ever was a problem for any of the UNIX System V > derivatives, which all have kernel tunables loaded from /etc/system. > No one ever complained, and Linux has the same concept with sysctl I think you missed my point. It's not good design to add tuneables that /aren't/ /generally/ /useful/ -- why does the rsize/wsize maximum need to be changed for particular deployments? Basic approach these days is: - After an experimentation period, will the tuneable still be useful? - Can the tuneable be abused? - Is there a mechanism or heuristic that can enable the system discover a good setting automatically? Only after we have strong technical answers for those questions does it make sense to add a public and documented administrative setting. Otherwise, a tuneable simply adds complexity of implementation, of our documentation, and of our testing requirements. We have to demonstrate, first and foremost, that the tuneable adds sufficient value to warrant the costs. >> So, not as simple and low-cost as you might think to just "add a >> tuneable" in upstream. And not a sensible choice when all you need is a >> temporary adjustment for testing. >> >> Do you have a reason why, after we agree on an increase, this should >> be a setting that admins will need to lower the value from a default of, >> say, 4MB or more? If so, then it makes sense to consider a tuneable (or >> better, a self-tuning mechanism). For a temporary setting for the >> purpose of experimentation, writing your own patch is the better and >> less costly approach. > > Testing, profiling, performance measurements, and a 4M default might > be a problem for embedded machines with only 16MB. We can make the internal maximum scale down automatically with available physical memory. That doesn't require an exposed global setting. > So yes, I think Linux either needs a tunable, or just GIVE UP thinking > about a bigger TCP buffer size. Let me summarize my position: I'm OK with the long term goal of increasing NFSD's maximum rsize/wsize I'm not convinced it needs to have an exposed tuneable (after a period of experimentation). We should consider ways of setting the maximum payload size automatically, for example. Even if we need an exposed tuneable, I'm not convinced this needs to be a global setting. Perhaps per export or per connection makes sense. Let's think about this before committing to changes to the public administrative API. There are other development priorities that might potentially conflict with increasing RPCSVC_MAXPAYLOAD. There is currently an effort, for example, to replace NFSD's send and receive buffer page arrays with folios. That will have direct impact on how larger rsize and wsize is implemented, and it may avoid the need for reducing that maximum on small memory systems. Lastly, there have been some suggestions about how to add a temporary tuneable that could be made available in hardened distro-built kernels without the baggage of an unchangeable API contract (the NFSD netlink protocol). That might give you the ability to adjust this value without us having to support it forever. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX 2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher 2025-01-29 7:32 ` Cedric Blancher @ 2025-04-07 11:34 ` Cedric Blancher 2025-04-07 13:58 ` Chuck Lever 1 sibling, 1 reply; 9+ messages in thread From: Cedric Blancher @ 2025-04-07 11:34 UTC (permalink / raw) To: Linux NFS Mailing List On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: > > Good morning! > > IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, > giving the NFSv4.1 session mechanism some headroom for negotiation. > For over a decade the default value is 1M (1*1024*1024u), which IMO > causes problems with anything faster than 2500baseT. Chuck pointed out that the new /sys/kernel/debug/ subdir could be used to host "experimental" tunables. Plan: Add a /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD tunable file, RPCSVC_MAXPAYLOAD defaults to 4M, on connection start it copies the value of /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD into a private variable to make it unchangeable during connection lifetime, because Chuck is worried that svc_rqst::rq_pages might blow up in our face. Would that be a plan? Ced -- Cedric Blancher <cedric.blancher@gmail.com> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX 2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher @ 2025-04-07 13:58 ` Chuck Lever 0 siblings, 0 replies; 9+ messages in thread From: Chuck Lever @ 2025-04-07 13:58 UTC (permalink / raw) To: Cedric Blancher, Linux NFS Mailing List On 4/7/25 7:34 AM, Cedric Blancher wrote: > On Wed, 22 Jan 2025 at 11:07, Cedric Blancher <cedric.blancher@gmail.com> wrote: >> >> Good morning! >> >> IMO it might be good to increase RPCSVC_MAXPAYLOAD to at least 8M, >> giving the NFSv4.1 session mechanism some headroom for negotiation. >> For over a decade the default value is 1M (1*1024*1024u), which IMO >> causes problems with anything faster than 2500baseT. > > Chuck pointed out that the new /sys/kernel/debug/ subdir could be used > to host "experimental" tunables. Indeed. That hurdle will be addressed in v6.16. The patch that adds the new debugfs directory for NFSD is now in nfsd-next. > Plan: > Add a /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD tunable file, > RPCSVC_MAXPAYLOAD defaults to 4M, on connection start it copies the > value of /sys/kernel/debug/nfsd/RPCSVC_MAXPAYLOAD into a private > variable to make it unchangeable during connection lifetime, because > Chuck is worried that svc_rqst::rq_pages might blow up in our face. > > Would that be a plan? Right now I'm still concerned about the bulky size of the rq_pages array. rq_vec and rq_bvec are a problem as well. maxpayload determines the size of the rq_pages array in each nfsd thread. It's not a per-connection thing. So, the size of the array is picked up when the threads are created. You'd have to set the number of threads to zero, change maxpayload, then set the number of threads to a positive integer -- at that point each of the freshly created threads would pick up the new maxpayload value (for example). Making the rq_pages array size dynamic has a few issues. It's in the middle of struct svc_rqst. We might move it to the end of that structure, or we could make rq_pages a pointer to a separate memory allocation. That would introduce some code churn. Or, we could simply allocate the maximum size (4MB worth of pages is more than 1000 array entries) for every thread, all the time. That's simple, but wasteful. I am a little concerned about introducing that overhead on every NFSD operation, even for small ones. The RPCSVC_MAXPAYLOAD macro is currently an integer constant and is used in several places for bounds-checking during request marshaling and unmarshaling. It would have to be replaced with a global variable; that variable would likely be stored in svc_serv, to go along with the rq_pages array size allocated in each thread in the serv's thread pool. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-04-07 13:58 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-22 10:07 Increase RPCSVC_MAXPAYLOAD to 8M? Cedric Blancher 2025-01-29 7:32 ` Cedric Blancher 2025-01-29 15:02 ` Chuck Lever 2025-02-06 8:45 ` Cedric Blancher 2025-02-06 14:25 ` Chuck Lever 2025-03-04 6:43 ` Cedric Blancher 2025-03-04 14:40 ` Chuck Lever 2025-04-07 11:34 ` Increase RPCSVC_MAXPAYLOAD to 8M, part DEUX Cedric Blancher 2025-04-07 13:58 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox