* [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. @ 2024-04-22 2:09 NeilBrown 2024-04-22 5:00 ` Petr Vorel 2024-04-22 13:34 ` Chuck Lever 0 siblings, 2 replies; 10+ messages in thread From: NeilBrown @ 2024-04-22 2:09 UTC (permalink / raw) To: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: Petr Vorel, linux-nfs The calculation of how many clients the nfs server can manage is only an heuristic. Triggering the laundromat to clean up old clients when we have more than the heuristic limit is valid, but refusing to create new clients is not. Client creation should only fail if there really isn't enough memory available. This is not known to have caused a problem is production use, but testing of lots of clients reports an error and it is not clear that this error is justified. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index daf83823ba48..8a40bb6a4a67 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -2223,10 +2223,9 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name, struct nfs4_client *clp; int i; - if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) { + if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) mod_delayed_work(laundry_wq, &nn->laundromat_work, 0); - return NULL; - } + clp = kmem_cache_zalloc(client_slab, GFP_KERNEL); if (clp == NULL) return NULL; -- 2.44.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-22 2:09 [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients NeilBrown @ 2024-04-22 5:00 ` Petr Vorel 2024-04-22 13:34 ` Chuck Lever 1 sibling, 0 replies; 10+ messages in thread From: Petr Vorel @ 2024-04-22 5:00 UTC (permalink / raw) To: NeilBrown Cc: Chuck Lever, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs Hi all, > The calculation of how many clients the nfs server can manage is only an > heuristic. Triggering the laundromat to clean up old clients when we > have more than the heuristic limit is valid, but refusing to create new > clients is not. Client creation should only fail if there really isn't > enough memory available. > This is not known to have caused a problem is production use, but > testing of lots of clients reports an error and it is not clear that > this error is justified. > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index daf83823ba48..8a40bb6a4a67 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -2223,10 +2223,9 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name, > struct nfs4_client *clp; > int i; > - if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) { > + if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) > mod_delayed_work(laundry_wq, &nn->laundromat_work, 0); > - return NULL; > - } LGTM. Reviewed-by: Petr Vorel <pvorel@suse.cz> Kind regards, Petr > + > clp = kmem_cache_zalloc(client_slab, GFP_KERNEL); > if (clp == NULL) > return NULL; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-22 2:09 [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients NeilBrown 2024-04-22 5:00 ` Petr Vorel @ 2024-04-22 13:34 ` Chuck Lever 2024-04-22 23:33 ` NeilBrown 2024-04-23 15:12 ` Petr Vorel 1 sibling, 2 replies; 10+ messages in thread From: Chuck Lever @ 2024-04-22 13:34 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Petr Vorel, linux-nfs On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: > The calculation of how many clients the nfs server can manage is only an > heuristic. Triggering the laundromat to clean up old clients when we > have more than the heuristic limit is valid, but refusing to create new > clients is not. Client creation should only fail if there really isn't > enough memory available. > > This is not known to have caused a problem is production use, but > testing of lots of clients reports an error and it is not clear that > this error is justified. It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 clients to 1024 per 1GB of system memory"). In cases like these, the recourse is to add more memory to the test system. However, that commit claims that the client is told to retry; I don't expect client creation to fail outright. Can you describe the failure mode you see? Meanwhile, we need to have broader and more regular testing of NFSD on memory-starved systems. That's a long-term project. > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index daf83823ba48..8a40bb6a4a67 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -2223,10 +2223,9 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name, > struct nfs4_client *clp; > int i; > > - if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) { > + if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) > mod_delayed_work(laundry_wq, &nn->laundromat_work, 0); > - return NULL; > - } > + > clp = kmem_cache_zalloc(client_slab, GFP_KERNEL); > if (clp == NULL) > return NULL; > -- > 2.44.0 > > -- Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-22 13:34 ` Chuck Lever @ 2024-04-22 23:33 ` NeilBrown 2024-04-23 13:15 ` Chuck Lever III 2024-04-23 15:12 ` Petr Vorel 1 sibling, 1 reply; 10+ messages in thread From: NeilBrown @ 2024-04-22 23:33 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Petr Vorel, linux-nfs On Mon, 22 Apr 2024, Chuck Lever wrote: > On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: > > The calculation of how many clients the nfs server can manage is only an > > heuristic. Triggering the laundromat to clean up old clients when we > > have more than the heuristic limit is valid, but refusing to create new > > clients is not. Client creation should only fail if there really isn't > > enough memory available. > > > > This is not known to have caused a problem is production use, but > > testing of lots of clients reports an error and it is not clear that > > this error is justified. > > It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 > clients to 1024 per 1GB of system memory"). In cases like these, > the recourse is to add more memory to the test system. Does each client really need 1MB? Obviously we don't want all memory to be used by client state.... > > However, that commit claims that the client is told to retry; I > don't expect client creation to fail outright. Can you describe the > failure mode you see? The failure mode is repeated client retries that never succeed. I think an outright failure would be preferable - it would be more clear than memory must be added. The server has N active clients and M courtesy clients. Triggering reclaim when N+M exceeds a limit and M>0 makes sense. A hard failure (NFS4ERR_RESOURCE) when N exceeds a limit makes sense. A soft failure (NFS4ERR_DELAY) while reclaim is running makes sense. I don't think a retry while N exceeds the limit makes sense. Do we have a count of active vs courtesy clients? > > Meanwhile, we need to have broader and more regular testing of NFSD > on memory-starved systems. That's a long-term project. :-) Thanks, NeilBrown > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/nfsd/nfs4state.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > index daf83823ba48..8a40bb6a4a67 100644 > > --- a/fs/nfsd/nfs4state.c > > +++ b/fs/nfsd/nfs4state.c > > @@ -2223,10 +2223,9 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name, > > struct nfs4_client *clp; > > int i; > > > > - if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) { > > + if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) > > mod_delayed_work(laundry_wq, &nn->laundromat_work, 0); > > - return NULL; > > - } > > + > > clp = kmem_cache_zalloc(client_slab, GFP_KERNEL); > > if (clp == NULL) > > return NULL; > > -- > > 2.44.0 > > > > > > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-22 23:33 ` NeilBrown @ 2024-04-23 13:15 ` Chuck Lever III 2024-04-23 18:02 ` Dai Ngo 2024-04-25 0:08 ` NeilBrown 0 siblings, 2 replies; 10+ messages in thread From: Chuck Lever III @ 2024-04-23 13:15 UTC (permalink / raw) To: Neil Brown, Dai Ngo Cc: Jeff Layton, Olga Kornievskaia, Tom Talpey, Petr Vorel, Linux NFS Mailing List > On Apr 22, 2024, at 7:34 PM, NeilBrown <neilb@suse.de> wrote: > > On Mon, 22 Apr 2024, Chuck Lever wrote: >>> On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: >>> The calculation of how many clients the nfs server can manage is only an >>> heuristic. Triggering the laundromat to clean up old clients when we >>> have more than the heuristic limit is valid, but refusing to create new >>> clients is not. Client creation should only fail if there really isn't >>> enough memory available. >>> >>> This is not known to have caused a problem is production use, but >>> testing of lots of clients reports an error and it is not clear that >>> this error is justified. >> >> It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 >> clients to 1024 per 1GB of system memory"). In cases like these, >> the recourse is to add more memory to the test system. > > Does each client really need 1MB? > Obviously we don't want all memory to be used by client state.... > >> >> However, that commit claims that the client is told to retry; I >> don't expect client creation to fail outright. Can you describe the >> failure mode you see? > > The failure mode is repeated client retries that never succeed. I think > an outright failure would be preferable - it would be more clear than > memory must be added. > > The server has N active clients and M courtesy clients. > Triggering reclaim when N+M exceeds a limit and M>0 makes sense. > A hard failure (NFS4ERR_RESOURCE) when N exceeds a limit makes sense. > A soft failure (NFS4ERR_DELAY) while reclaim is running makes sense. > > I don't think a retry while N exceeds the limit makes sense. It’s not optimal, I agree. NFSD has to limit the total number of active and courtesy clients, because otherwise it would be subject to an easy (d)DoS attack, which Dai demonstrated to me before I accepted his patch. A malicious actor or broken clients can continue to create leases on the server until it stops responding. I think failing outright would accomplish the mitigation as well as delaying does, but delaying once or twice gives some slack that allows a mount attempt to succeed eventually even when the server temporarily exceeds the maximum client count. Also IMO there could be a rate-limited pr_warn on the server that fires to indicate when a low-memory situation has been reached. The problem with NFS4ERR_RESOURCE, however, is that NFSv4.1 and newer do not have that status code. All versions of NFS have DELAY/JUKEBOX. I recognize that you are tweaking only SETCLIENTID here, but I think behavior should be consistent for all minor versions of NFSv4. > Do we have a count of active vs courtesy clients? Dai can correct me if I’m wrong, but I believe NFSD maintains a count of both. But only the active leases really matter, becase courtesy clients can be dropped as memory becomes tight. Dropping an active lease would be somewhat more catastrophic. — Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-23 13:15 ` Chuck Lever III @ 2024-04-23 18:02 ` Dai Ngo 2024-04-25 0:08 ` NeilBrown 1 sibling, 0 replies; 10+ messages in thread From: Dai Ngo @ 2024-04-23 18:02 UTC (permalink / raw) To: Chuck Lever III, Neil Brown Cc: Jeff Layton, Olga Kornievskaia, Tom Talpey, Petr Vorel, Linux NFS Mailing List On 4/23/24 6:15 AM, Chuck Lever III wrote: >> On Apr 22, 2024, at 7:34 PM, NeilBrown <neilb@suse.de> wrote: >> >> On Mon, 22 Apr 2024, Chuck Lever wrote: >>>> On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: >>>> The calculation of how many clients the nfs server can manage is only an >>>> heuristic. Triggering the laundromat to clean up old clients when we >>>> have more than the heuristic limit is valid, but refusing to create new >>>> clients is not. Client creation should only fail if there really isn't >>>> enough memory available. >>>> >>>> This is not known to have caused a problem is production use, but >>>> testing of lots of clients reports an error and it is not clear that >>>> this error is justified. >>> It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 >>> clients to 1024 per 1GB of system memory"). In cases like these, >>> the recourse is to add more memory to the test system. >> Does each client really need 1MB? >> Obviously we don't want all memory to be used by client state.... >> >>> However, that commit claims that the client is told to retry; I >>> don't expect client creation to fail outright. Can you describe the >>> failure mode you see? >> The failure mode is repeated client retries that never succeed. I think >> an outright failure would be preferable - it would be more clear than >> memory must be added. >> >> The server has N active clients and M courtesy clients. >> Triggering reclaim when N+M exceeds a limit and M>0 makes sense. >> A hard failure (NFS4ERR_RESOURCE) when N exceeds a limit makes sense. >> A soft failure (NFS4ERR_DELAY) while reclaim is running makes sense. >> >> I don't think a retry while N exceeds the limit makes sense. > It’s not optimal, I agree. > > NFSD has to limit the total number of active and courtesy > clients, because otherwise it would be subject to an easy > (d)DoS attack, which Dai demonstrated to me before I > accepted his patch. A malicious actor or broken clients > can continue to create leases on the server until it stops > responding. > > I think failing outright would accomplish the mitigation > as well as delaying does, but delaying once or twice > gives some slack that allows a mount attempt to succeed > eventually even when the server temporarily exceeds the > maximum client count. > > Also IMO there could be a rate-limited pr_warn on the > server that fires to indicate when a low-memory situation > has been reached. > > The problem with NFS4ERR_RESOURCE, however, is that > NFSv4.1 and newer do not have that status code. All > versions of NFS have DELAY/JUKEBOX. > > I recognize that you are tweaking only SETCLIENTID here, > but I think behavior should be consistent for all minor > versions of NFSv4. > > >> Do we have a count of active vs courtesy clients? > Dai can correct me if I’m wrong, but I believe NFSD > maintains a count of both. NFSD maintains both counts for active clients, nfs4_client_count, and courtesy clients, nfsd_courtesy_clients. However the 'real' active client count is 'nfs4_client_count - nfsd_courtesy_clients). > > But only the active leases really matter, becase > courtesy clients can be dropped as memory becomes tight. Yes, when the NFSD shrinker is activated it calls courtesy_client_reaper to remove courtesy clients. -Dai > Dropping an active lease would be somewhat more > catastrophic. > > > — > Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-23 13:15 ` Chuck Lever III 2024-04-23 18:02 ` Dai Ngo @ 2024-04-25 0:08 ` NeilBrown 2024-04-25 13:26 ` Chuck Lever 1 sibling, 1 reply; 10+ messages in thread From: NeilBrown @ 2024-04-25 0:08 UTC (permalink / raw) To: Chuck Lever III Cc: Dai Ngo, Jeff Layton, Olga Kornievskaia, Tom Talpey, Petr Vorel, Linux NFS Mailing List On Tue, 23 Apr 2024, Chuck Lever III wrote: > > > On Apr 22, 2024, at 7:34 PM, NeilBrown <neilb@suse.de> wrote: > > > > On Mon, 22 Apr 2024, Chuck Lever wrote: > >>> On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: > >>> The calculation of how many clients the nfs server can manage is only an > >>> heuristic. Triggering the laundromat to clean up old clients when we > >>> have more than the heuristic limit is valid, but refusing to create new > >>> clients is not. Client creation should only fail if there really isn't > >>> enough memory available. > >>> > >>> This is not known to have caused a problem is production use, but > >>> testing of lots of clients reports an error and it is not clear that > >>> this error is justified. > >> > >> It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 > >> clients to 1024 per 1GB of system memory"). In cases like these, > >> the recourse is to add more memory to the test system. > > > > Does each client really need 1MB? > > Obviously we don't want all memory to be used by client state.... > > > >> > >> However, that commit claims that the client is told to retry; I > >> don't expect client creation to fail outright. Can you describe the > >> failure mode you see? > > > > The failure mode is repeated client retries that never succeed. I think > > an outright failure would be preferable - it would be more clear than > > memory must be added. > > > > The server has N active clients and M courtesy clients. > > Triggering reclaim when N+M exceeds a limit and M>0 makes sense. > > A hard failure (NFS4ERR_RESOURCE) when N exceeds a limit makes sense. > > A soft failure (NFS4ERR_DELAY) while reclaim is running makes sense. > > > > I don't think a retry while N exceeds the limit makes sense. > > It’s not optimal, I agree. > > NFSD has to limit the total number of active and courtesy > clients, because otherwise it would be subject to an easy > (d)DoS attack, which Dai demonstrated to me before I > accepted his patch. A malicious actor or broken clients > can continue to create leases on the server until it stops > responding. > > I think failing outright would accomplish the mitigation > as well as delaying does, but delaying once or twice > gives some slack that allows a mount attempt to succeed > eventually even when the server temporarily exceeds the > maximum client count. I doubt that the set of active clients is so dynamic that it is worth waiting in case some client goes away soon. If we hit the limit then we probably already have more clients than we can reasonably handle and it is time to indicate failure. > > Also IMO there could be a rate-limited pr_warn on the > server that fires to indicate when a low-memory situation > has been reached. Yes, server side warnings would be a good idea. > > The problem with NFS4ERR_RESOURCE, however, is that > NFSv4.1 and newer do not have that status code. All > versions of NFS have DELAY/JUKEBOX. I didn't realise that. Lots of code in nfs4xdr.c returns nfserr_resource. For v4.1 it appears to get translated to nfserr_rep_too_big_too_cache or nfserr_rep_too_big, which might not always make sense. > > I recognize that you are tweaking only SETCLIENTID here, > but I think behavior should be consistent for all minor > versions of NFSv4. I really want to change EXCHANGEID too. Maybe we should use NFS4ERR_SERVERFAULT. It seems to be a catch-all for "some fatal error". The server has failed to allocate required resources. Thanks, NeilBrown > > > > Do we have a count of active vs courtesy clients? > > Dai can correct me if I’m wrong, but I believe NFSD > maintains a count of both. > > But only the active leases really matter, becase > courtesy clients can be dropped as memory becomes tight. > Dropping an active lease would be somewhat more > catastrophic. > > > — > Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-25 0:08 ` NeilBrown @ 2024-04-25 13:26 ` Chuck Lever 0 siblings, 0 replies; 10+ messages in thread From: Chuck Lever @ 2024-04-25 13:26 UTC (permalink / raw) To: NeilBrown Cc: Dai Ngo, Jeff Layton, Olga Kornievskaia, Tom Talpey, Petr Vorel, Linux NFS Mailing List On Thu, Apr 25, 2024 at 10:08:35AM +1000, NeilBrown wrote: > On Tue, 23 Apr 2024, Chuck Lever III wrote: > > > > > On Apr 22, 2024, at 7:34 PM, NeilBrown <neilb@suse.de> wrote: > > > > > > On Mon, 22 Apr 2024, Chuck Lever wrote: > > >>> On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: > > >>> The calculation of how many clients the nfs server can manage is only an > > >>> heuristic. Triggering the laundromat to clean up old clients when we > > >>> have more than the heuristic limit is valid, but refusing to create new > > >>> clients is not. Client creation should only fail if there really isn't > > >>> enough memory available. > > >>> > > >>> This is not known to have caused a problem is production use, but > > >>> testing of lots of clients reports an error and it is not clear that > > >>> this error is justified. > > >> > > >> It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 > > >> clients to 1024 per 1GB of system memory"). In cases like these, > > >> the recourse is to add more memory to the test system. > > > > > > Does each client really need 1MB? > > > Obviously we don't want all memory to be used by client state.... > > > > > >> > > >> However, that commit claims that the client is told to retry; I > > >> don't expect client creation to fail outright. Can you describe the > > >> failure mode you see? > > > > > > The failure mode is repeated client retries that never succeed. I think > > > an outright failure would be preferable - it would be more clear than > > > memory must be added. > > > > > > The server has N active clients and M courtesy clients. > > > Triggering reclaim when N+M exceeds a limit and M>0 makes sense. > > > A hard failure (NFS4ERR_RESOURCE) when N exceeds a limit makes sense. > > > A soft failure (NFS4ERR_DELAY) while reclaim is running makes sense. > > > > > > I don't think a retry while N exceeds the limit makes sense. > > > > It’s not optimal, I agree. > > > > NFSD has to limit the total number of active and courtesy > > clients, because otherwise it would be subject to an easy > > (d)DoS attack, which Dai demonstrated to me before I > > accepted his patch. A malicious actor or broken clients > > can continue to create leases on the server until it stops > > responding. > > > > I think failing outright would accomplish the mitigation > > as well as delaying does, but delaying once or twice > > gives some slack that allows a mount attempt to succeed > > eventually even when the server temporarily exceeds the > > maximum client count. > > I doubt that the set of active clients is so dynamic that it is worth > waiting in case some client goes away soon. If we hit the limit then we > probably already have more clients than we can reasonably handle and it > is time to indicate failure. > > > Also IMO there could be a rate-limited pr_warn on the > > server that fires to indicate when a low-memory situation > > has been reached. > > Yes, server side warnings would be a good idea. > > > The problem with NFS4ERR_RESOURCE, however, is that > > NFSv4.1 and newer do not have that status code. All > > versions of NFS have DELAY/JUKEBOX. > > I didn't realise that. Lots of code in nfs4xdr.c returns > nfserr_resource. For v4.1 it appears to get translated to > nfserr_rep_too_big_too_cache or nfserr_rep_too_big, which might not > always make sense. Yes. It's confusing, but that's how NFSv4.1 support was grafted into NFSD's XDR layer. > > I recognize that you are tweaking only SETCLIENTID here, > > but I think behavior should be consistent for all minor > > versions of NFSv4. > > I really want to change EXCHANGEID too. IIRC, CREATE_SESSION can also fail based on the number of clients and the server's physical memory configuration, so it needs some attention as well. > Maybe we should use NFS4ERR_SERVERFAULT. It seems to be a > catch-all for "some fatal error". The server has failed to > allocate required resources. We need to be aware of how clients might respond to whichever status codes are chosen. SETCLIENTID and SETCLIENTID_CONFIRM are permitted to return NFS4ERR_RESOURCE, and these are implemented separately from their NFSv4.1 equivalents. So perhaps they can return something saner than SERVERFAULT. -- Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-22 13:34 ` Chuck Lever 2024-04-22 23:33 ` NeilBrown @ 2024-04-23 15:12 ` Petr Vorel 2024-04-23 15:26 ` Chuck Lever 1 sibling, 1 reply; 10+ messages in thread From: Petr Vorel @ 2024-04-23 15:12 UTC (permalink / raw) To: Chuck Lever Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs > On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: > > The calculation of how many clients the nfs server can manage is only an > > heuristic. Triggering the laundromat to clean up old clients when we > > have more than the heuristic limit is valid, but refusing to create new > > clients is not. Client creation should only fail if there really isn't > > enough memory available. > > This is not known to have caused a problem is production use, but > > testing of lots of clients reports an error and it is not clear that > > this error is justified. > It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 > clients to 1024 per 1GB of system memory"). In cases like these, > the recourse is to add more memory to the test system. FYI the system is using 1468 MB + 2048 MB swap $ free -m total used free shared buff/cache available Mem: 1468 347 589 4 686 1121 Swap: 2048 0 2048 Indeed increasing the memory to 3430 MB makes test happy. It's of course up to you to see whether this is just unrealistic / artificial problem which does not influence users and thus is v2 Neil sent is not worth of merging. Kind regards, Petr > However, that commit claims that the client is told to retry; I > don't expect client creation to fail outright. Can you describe the > failure mode you see? > Meanwhile, we need to have broader and more regular testing of NFSD > on memory-starved systems. That's a long-term project. > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/nfsd/nfs4state.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > index daf83823ba48..8a40bb6a4a67 100644 > > --- a/fs/nfsd/nfs4state.c > > +++ b/fs/nfsd/nfs4state.c > > @@ -2223,10 +2223,9 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name, > > struct nfs4_client *clp; > > int i; > > - if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) { > > + if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) > > mod_delayed_work(laundry_wq, &nn->laundromat_work, 0); > > - return NULL; > > - } > > + > > clp = kmem_cache_zalloc(client_slab, GFP_KERNEL); > > if (clp == NULL) > > return NULL; > > -- > > 2.44.0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients. 2024-04-23 15:12 ` Petr Vorel @ 2024-04-23 15:26 ` Chuck Lever 0 siblings, 0 replies; 10+ messages in thread From: Chuck Lever @ 2024-04-23 15:26 UTC (permalink / raw) To: Petr Vorel Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs On Tue, Apr 23, 2024 at 05:12:56PM +0200, Petr Vorel wrote: > > On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: > > > The calculation of how many clients the nfs server can manage is only an > > > heuristic. Triggering the laundromat to clean up old clients when we > > > have more than the heuristic limit is valid, but refusing to create new > > > clients is not. Client creation should only fail if there really isn't > > > enough memory available. > > > > This is not known to have caused a problem is production use, but > > > testing of lots of clients reports an error and it is not clear that > > > this error is justified. > > > It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 > > clients to 1024 per 1GB of system memory"). In cases like these, > > the recourse is to add more memory to the test system. > > FYI the system is using 1468 MB + 2048 MB swap > > $ free -m > total used free shared buff/cache available > Mem: 1468 347 589 4 686 1121 > Swap: 2048 0 2048 > > Indeed increasing the memory to 3430 MB makes test happy. It's of course up to > you to see whether this is just unrealistic / artificial problem which does not > influence users and thus is v2 Neil sent is not worth of merging. IMO, if you want to handle a large client cohort, NFSD will need to have adequate memory available. In production scenarios, I think it is not realistic to expect a 1.5GB server to handle more than a few dozen NFSv4 clients, given the amount of lease, session, and open/lock state that can be in flight. However, in testing scenarios, it's reasonable and even necessary to experiment with low-memory servers. I don't disagree that failing the mount attempt outright is a good thing to do. But to make that fly, we need to figure out how to make NFSv4.1+ behave that way too. -- Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-04-25 13:27 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-22 2:09 [PATCH] nfsd: don't fail OP_SETCLIENTID when there are lots of clients NeilBrown 2024-04-22 5:00 ` Petr Vorel 2024-04-22 13:34 ` Chuck Lever 2024-04-22 23:33 ` NeilBrown 2024-04-23 13:15 ` Chuck Lever III 2024-04-23 18:02 ` Dai Ngo 2024-04-25 0:08 ` NeilBrown 2024-04-25 13:26 ` Chuck Lever 2024-04-23 15:12 ` Petr Vorel 2024-04-23 15:26 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox