* [PATCH 0/2] NLM: fix use after free in lockd @ 2008-01-28 19:29 Jeff Layton 2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton 0 siblings, 1 reply; 13+ messages in thread From: Jeff Layton @ 2008-01-28 19:29 UTC (permalink / raw) To: bfields, neilb; +Cc: linux-nfs This first patch in this set replaces the patch that I had originally proposed for fixing the use after free problem. The earlier patch seemed to work in some situations, but when the timing was a bit different it was still possible for lockd to go down before the RPC callback occurred. The fix is to just tear down the rpc client altogether. The second patch in the set prevents a problem whereby lockd could stay up indefinitely looping through nlmsvc_retry_blocked. This set depends on the patchset I proposed earlier to convert lockd to use the kthread API. Comments and/or suggestions appreciated. Signed-off-by: Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts 2008-01-28 19:29 [PATCH 0/2] NLM: fix use after free in lockd Jeff Layton @ 2008-01-28 19:29 ` Jeff Layton 2008-01-28 19:29 ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton 2008-01-28 23:18 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields 0 siblings, 2 replies; 13+ messages in thread From: Jeff Layton @ 2008-01-28 19:29 UTC (permalink / raw) To: bfields, neilb; +Cc: linux-nfs It's possible for a RPC to outlive the lockd daemon that created it, so we need to make sure that all RPC's are killed when lockd is coming down. When nlm_shutdown_hosts is called, kill off all RPC tasks associated with the host. Since we need to wait until they have all gone away, we might as well just shut down the RPC client altogether. Signed-off-by: Jeff Layton <jlayton@redhat.com> --- fs/lockd/host.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/lockd/host.c b/fs/lockd/host.c index ebec009..ca6b16f 100644 --- a/fs/lockd/host.c +++ b/fs/lockd/host.c @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void) /* First, make all hosts eligible for gc */ dprintk("lockd: nuking all hosts...\n"); for (chain = nlm_hosts; chain < nlm_hosts + NLM_HOST_NRHASH; ++chain) { - hlist_for_each_entry(host, pos, chain, h_hash) + hlist_for_each_entry(host, pos, chain, h_hash) { host->h_expires = jiffies - 1; + if (host->h_rpcclnt) { + rpc_shutdown_client(host->h_rpcclnt); + host->h_rpcclnt = NULL; + } + } } /* Then, perform a garbage collection pass */ -- 1.5.3.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down 2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton @ 2008-01-28 19:29 ` Jeff Layton 2008-01-29 0:05 ` J. Bruce Fields 2008-01-28 23:18 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields 1 sibling, 1 reply; 13+ messages in thread From: Jeff Layton @ 2008-01-28 19:29 UTC (permalink / raw) To: bfields, neilb; +Cc: linux-nfs It's possible to end up continually looping in nlmsvc_retry_blocked() even when the lockd kthread has been requested to come down. I've been able to reproduce this fairly easily by having an RPC grant callback queued up to an RPC client that's not bound. If the other host is unresponsive, then the block never gets taken off the list, and lockd continually respawns new RPC's. If lockd is going down anyway, we don't want to keep retrying the blocks, so allow it to break out of this loop. Additionally, in that situation kthread_stop will have already woken lockd, so when nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so that it doesn't end up calling svc_recv and waiting for a long time. This patch prevents this continual looping and allows lockd to eventually come down, but this can quite some time since lockd won't check the condition in the nlmsvc_retry_blocked while loop until nlmsvc_grant_blocked returns. That won't happen until all of the portmap requests time out. Signed-off-by: Jeff Layton <jlayton@redhat.com> --- fs/lockd/svc.c | 4 ++++ fs/lockd/svclock.c | 3 ++- 2 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index 5752e1b..a0e5318 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -166,6 +166,10 @@ lockd(void *vrqstp) } else if (time_before(grace_period_expire, jiffies)) clear_grace_period(); + /* nlmsvc_retry_blocked can block, so check for kthread_stop */ + if (kthread_should_stop()) + break; + /* * Find a socket with data available and call its * recvfrom routine. diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index 2f4d8fa..d8324b6 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -29,6 +29,7 @@ #include <linux/sunrpc/svc.h> #include <linux/lockd/nlm.h> #include <linux/lockd/lockd.h> +#include <linux/kthread.h> #define NLMDBG_FACILITY NLMDBG_SVCLOCK @@ -867,7 +868,7 @@ nlmsvc_retry_blocked(void) unsigned long timeout = MAX_SCHEDULE_TIMEOUT; struct nlm_block *block; - while (!list_empty(&nlm_blocked)) { + while (!list_empty(&nlm_blocked) && !kthread_should_stop()) { block = list_entry(nlm_blocked.next, struct nlm_block, b_list); if (block->b_when == NLM_NEVER) -- 1.5.3.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down 2008-01-28 19:29 ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton @ 2008-01-29 0:05 ` J. Bruce Fields 2008-01-29 1:12 ` Jeff Layton 0 siblings, 1 reply; 13+ messages in thread From: J. Bruce Fields @ 2008-01-29 0:05 UTC (permalink / raw) To: Jeff Layton; +Cc: neilb, linux-nfs On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > It's possible to end up continually looping in nlmsvc_retry_blocked() > even when the lockd kthread has been requested to come down. I've been > able to reproduce this fairly easily by having an RPC grant callback > queued up to an RPC client that's not bound. If the other host is > unresponsive, then the block never gets taken off the list, and lockd > continually respawns new RPC's. > > If lockd is going down anyway, we don't want to keep retrying the > blocks, so allow it to break out of this loop. Additionally, in that > situation kthread_stop will have already woken lockd, so when > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so > that it doesn't end up calling svc_recv and waiting for a long time. Stupid question: how is this different from the kthread_stop() coming just after this kthread_should_stop() call but before we call svc_recv()? What keeps us from waiting in svc_recv() indefinitely after kthread_stop()? > This patch prevents this continual looping and allows lockd to > eventually come down, but this can quite some time since lockd > won't check the condition in the nlmsvc_retry_blocked while loop > until nlmsvc_grant_blocked returns. That won't happen until > all of the portmap requests time out. So, shutdown problems aside, does that mean an unresponsive client can cause lock to stop serving lock requests in normal operation? --b. > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > --- > fs/lockd/svc.c | 4 ++++ > fs/lockd/svclock.c | 3 ++- > 2 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c > index 5752e1b..a0e5318 100644 > --- a/fs/lockd/svc.c > +++ b/fs/lockd/svc.c > @@ -166,6 +166,10 @@ lockd(void *vrqstp) > } else if (time_before(grace_period_expire, jiffies)) > clear_grace_period(); > > + /* nlmsvc_retry_blocked can block, so check for kthread_stop */ > + if (kthread_should_stop()) > + break; > + > /* > * Find a socket with data available and call its > * recvfrom routine. > diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c > index 2f4d8fa..d8324b6 100644 > --- a/fs/lockd/svclock.c > +++ b/fs/lockd/svclock.c > @@ -29,6 +29,7 @@ > #include <linux/sunrpc/svc.h> > #include <linux/lockd/nlm.h> > #include <linux/lockd/lockd.h> > +#include <linux/kthread.h> > > #define NLMDBG_FACILITY NLMDBG_SVCLOCK > > @@ -867,7 +868,7 @@ nlmsvc_retry_blocked(void) > unsigned long timeout = MAX_SCHEDULE_TIMEOUT; > struct nlm_block *block; > > - while (!list_empty(&nlm_blocked)) { > + while (!list_empty(&nlm_blocked) && !kthread_should_stop()) { > block = list_entry(nlm_blocked.next, struct nlm_block, b_list); > > if (block->b_when == NLM_NEVER) > -- > 1.5.3.7 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down 2008-01-29 0:05 ` J. Bruce Fields @ 2008-01-29 1:12 ` Jeff Layton [not found] ` <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Jeff Layton @ 2008-01-29 1:12 UTC (permalink / raw) To: J. Bruce Fields; +Cc: neilb, linux-nfs On Mon, 28 Jan 2008 19:05:17 -0500 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > > It's possible to end up continually looping in > > nlmsvc_retry_blocked() even when the lockd kthread has been > > requested to come down. I've been able to reproduce this fairly > > easily by having an RPC grant callback queued up to an RPC client > > that's not bound. If the other host is unresponsive, then the block > > never gets taken off the list, and lockd continually respawns new > > RPC's. > > > > If lockd is going down anyway, we don't want to keep retrying the > > blocks, so allow it to break out of this loop. Additionally, in that > > situation kthread_stop will have already woken lockd, so when > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so > > that it doesn't end up calling svc_recv and waiting for a long time. > > Stupid question: how is this different from the kthread_stop() coming > just after this kthread_should_stop() call but before we call > svc_recv()? What keeps us from waiting in svc_recv() indefinitely > after kthread_stop()? > Nothing at all, unfortunately :-( Since we've taken signalling out of the lockd_down codepath, this gets a lot trickier than it used to be. I'm starting to wonder if we should go back to sending a signal on lockd_down. > > This patch prevents this continual looping and allows lockd to > > eventually come down, but this can quite some time since lockd > > won't check the condition in the nlmsvc_retry_blocked while loop > > until nlmsvc_grant_blocked returns. That won't happen until > > all of the portmap requests time out. > > So, shutdown problems aside, does that mean an unresponsive client can > cause lock to stop serving lock requests in normal operation? > Possibly. I think that in general what happens is that blocks eventually get reinserted into the list with an expire time that is well after the current jiffies and we eventually hit this condition: if (time_after(block->b_when,jiffies)) { timeout = block->b_when - jiffies; break; } ...but I was seeing lockd loop in this loop over several iterations without returning back up to the main lockd loop. I think what was happening was that we were inserting the block with a later jiffies in nlmsvc_grant_blocked, but the rpcbind attempts would end up taking longer than that timeout. So when we returned from nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again. As to why we don't see this more often, I'm not sure. I probably need to experiment a bit more with it to make sure I'm understanding this correctly. Let me do that and get back to you... Thanks, -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down [not found] ` <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-01-29 1:23 ` J. Bruce Fields 2008-01-29 2:29 ` Jeff Layton 0 siblings, 1 reply; 13+ messages in thread From: J. Bruce Fields @ 2008-01-29 1:23 UTC (permalink / raw) To: Jeff Layton; +Cc: neilb, linux-nfs On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote: > On Mon, 28 Jan 2008 19:05:17 -0500 > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > > > It's possible to end up continually looping in > > > nlmsvc_retry_blocked() even when the lockd kthread has been > > > requested to come down. I've been able to reproduce this fairly > > > easily by having an RPC grant callback queued up to an RPC client > > > that's not bound. If the other host is unresponsive, then the block > > > never gets taken off the list, and lockd continually respawns new > > > RPC's. > > > > > > If lockd is going down anyway, we don't want to keep retrying the > > > blocks, so allow it to break out of this loop. Additionally, in that > > > situation kthread_stop will have already woken lockd, so when > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so > > > that it doesn't end up calling svc_recv and waiting for a long time. > > > > Stupid question: how is this different from the kthread_stop() coming > > just after this kthread_should_stop() call but before we call > > svc_recv()? What keeps us from waiting in svc_recv() indefinitely > > after kthread_stop()? > > > > Nothing at all, unfortunately :-( > > Since we've taken signalling out of the lockd_down codepath, this gets > a lot trickier than it used to be. I'm starting to wonder if we should > go back to sending a signal on lockd_down. OK, thanks. So do I keep these patches, or not? This sounds like a regression (if perhaps a very minor one--I'm not quite clear). Help! --b. > > > > This patch prevents this continual looping and allows lockd to > > > eventually come down, but this can quite some time since lockd > > > won't check the condition in the nlmsvc_retry_blocked while loop > > > until nlmsvc_grant_blocked returns. That won't happen until > > > all of the portmap requests time out. > > > > So, shutdown problems aside, does that mean an unresponsive client can > > cause lock to stop serving lock requests in normal operation? > > > > Possibly. > > I think that in general what happens is that blocks eventually get > reinserted into the list with an expire time that is well after the > current jiffies and we eventually hit this condition: > > if (time_after(block->b_when,jiffies)) { > timeout = block->b_when - jiffies; > break; > } > > ...but I was seeing lockd loop in this loop over several iterations > without returning back up to the main lockd loop. I think what was > happening was that we were inserting the block with a later jiffies in > nlmsvc_grant_blocked, but the rpcbind attempts would end up taking > longer than that timeout. So when we returned from > nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again. > > As to why we don't see this more often, I'm not sure. I probably need > to experiment a bit more with it to make sure I'm understanding this > correctly. > > Let me do that and get back to you... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down 2008-01-29 1:23 ` J. Bruce Fields @ 2008-01-29 2:29 ` Jeff Layton [not found] ` <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Jeff Layton @ 2008-01-29 2:29 UTC (permalink / raw) To: J. Bruce Fields; +Cc: neilb, linux-nfs On Mon, 28 Jan 2008 20:23:51 -0500 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote: > > On Mon, 28 Jan 2008 19:05:17 -0500 > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > > > > It's possible to end up continually looping in > > > > nlmsvc_retry_blocked() even when the lockd kthread has been > > > > requested to come down. I've been able to reproduce this fairly > > > > easily by having an RPC grant callback queued up to an RPC client > > > > that's not bound. If the other host is unresponsive, then the block > > > > never gets taken off the list, and lockd continually respawns new > > > > RPC's. > > > > > > > > If lockd is going down anyway, we don't want to keep retrying the > > > > blocks, so allow it to break out of this loop. Additionally, in that > > > > situation kthread_stop will have already woken lockd, so when > > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so > > > > that it doesn't end up calling svc_recv and waiting for a long time. > > > > > > Stupid question: how is this different from the kthread_stop() coming > > > just after this kthread_should_stop() call but before we call > > > svc_recv()? What keeps us from waiting in svc_recv() indefinitely > > > after kthread_stop()? > > > > > > > Nothing at all, unfortunately :-( > > > > Since we've taken signalling out of the lockd_down codepath, this gets > > a lot trickier than it used to be. I'm starting to wonder if we should > > go back to sending a signal on lockd_down. > > OK, thanks. So do I keep these patches, or not? This sounds like a > regression (if perhaps a very minor one--I'm not quite clear). Help! > Well, if we reintroduce signalling in lockd_down then this particular problem goes away. It may be reasonable to add that back into the mix for now. As to the more general question of whether to keep these patches... Most of them are pretty innocuous, but the patch to convert to kthread API is the biggest change. 2.6.25 might be a bit aggressive for that. It wouldn't hurt to let the conversion to kthread API brew until 2.6.26. lockd still does have a use after free problem though, so patch 1/2 here might be worth considering even without changing things over to kthreads. I've not tested it on a kernel without the kthread patches, however. I can do that if you think that's the right approach. > > > > > > This patch prevents this continual looping and allows lockd to > > > > eventually come down, but this can quite some time since lockd > > > > won't check the condition in the nlmsvc_retry_blocked while loop > > > > until nlmsvc_grant_blocked returns. That won't happen until > > > > all of the portmap requests time out. > > > > > > So, shutdown problems aside, does that mean an unresponsive client can > > > cause lock to stop serving lock requests in normal operation? > > > > > > > Possibly. > > > > I think that in general what happens is that blocks eventually get > > reinserted into the list with an expire time that is well after the > > current jiffies and we eventually hit this condition: > > > > if (time_after(block->b_when,jiffies)) { > > timeout = block->b_when - jiffies; > > break; > > } > > > > ...but I was seeing lockd loop in this loop over several iterations > > without returning back up to the main lockd loop. I think what was > > happening was that we were inserting the block with a later jiffies in > > nlmsvc_grant_blocked, but the rpcbind attempts would end up taking > > longer than that timeout. So when we returned from > > nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again. > > > > As to why we don't see this more often, I'm not sure. I probably need > > to experiment a bit more with it to make sure I'm understanding this > > correctly. > > > > Let me do that and get back to you... -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org>]
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down [not found] ` <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org> @ 2008-01-29 3:02 ` J. Bruce Fields 2008-01-29 11:23 ` Jeff Layton 0 siblings, 1 reply; 13+ messages in thread From: J. Bruce Fields @ 2008-01-29 3:02 UTC (permalink / raw) To: Jeff Layton; +Cc: neilb, linux-nfs On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote: > On Mon, 28 Jan 2008 20:23:51 -0500 > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote: > > > On Mon, 28 Jan 2008 19:05:17 -0500 > > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > > > > > It's possible to end up continually looping in > > > > > nlmsvc_retry_blocked() even when the lockd kthread has been > > > > > requested to come down. I've been able to reproduce this fairly > > > > > easily by having an RPC grant callback queued up to an RPC client > > > > > that's not bound. If the other host is unresponsive, then the block > > > > > never gets taken off the list, and lockd continually respawns new > > > > > RPC's. > > > > > > > > > > If lockd is going down anyway, we don't want to keep retrying the > > > > > blocks, so allow it to break out of this loop. Additionally, in that > > > > > situation kthread_stop will have already woken lockd, so when > > > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so > > > > > that it doesn't end up calling svc_recv and waiting for a long time. > > > > > > > > Stupid question: how is this different from the kthread_stop() coming > > > > just after this kthread_should_stop() call but before we call > > > > svc_recv()? What keeps us from waiting in svc_recv() indefinitely > > > > after kthread_stop()? > > > > > > > > > > Nothing at all, unfortunately :-( > > > > > > Since we've taken signalling out of the lockd_down codepath, this gets > > > a lot trickier than it used to be. I'm starting to wonder if we should > > > go back to sending a signal on lockd_down. > > > > OK, thanks. So do I keep these patches, or not? This sounds like a > > regression (if perhaps a very minor one--I'm not quite clear). Help! > > > > Well, if we reintroduce signalling in lockd_down then this particular > problem goes away. It may be reasonable to add that back into the mix > for now. > > As to the more general question of whether to keep these patches... > > Most of them are pretty innocuous, but the patch to convert to kthread > API is the biggest change. 2.6.25 might be a bit aggressive for that. It > wouldn't hurt to let the conversion to kthread API brew until 2.6.26. > > lockd still does have a use after free problem though, so patch 1/2 > here might be worth considering even without changing things over to > kthreads. I've not tested it on a kernel without the kthread patches, > however. I can do that if you think that's the right approach. Pfft, I was hoping you'd tell me what to do. But, yes, it sounds like dropping the kthread conversion and sending in that one bugfix is probably wisest. (But I assume we should keep the first of those three patches, "SUNRPC: spin svc_rqst initialization to its own function".) --b. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down 2008-01-29 3:02 ` J. Bruce Fields @ 2008-01-29 11:23 ` Jeff Layton [not found] ` <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Jeff Layton @ 2008-01-29 11:23 UTC (permalink / raw) To: J. Bruce Fields; +Cc: neilb, linux-nfs On Mon, 28 Jan 2008 22:02:49 -0500 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote: > > On Mon, 28 Jan 2008 20:23:51 -0500 > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote: > > > > On Mon, 28 Jan 2008 19:05:17 -0500 > > > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > > > > > > It's possible to end up continually looping in > > > > > > nlmsvc_retry_blocked() even when the lockd kthread has been > > > > > > requested to come down. I've been able to reproduce this > > > > > > fairly easily by having an RPC grant callback queued up to > > > > > > an RPC client that's not bound. If the other host is > > > > > > unresponsive, then the block never gets taken off the list, > > > > > > and lockd continually respawns new RPC's. > > > > > > > > > > > > If lockd is going down anyway, we don't want to keep > > > > > > retrying the blocks, so allow it to break out of this loop. > > > > > > Additionally, in that situation kthread_stop will have > > > > > > already woken lockd, so when nlmsvc_retry_blocked returns, > > > > > > have lockd check for kthread_stop() so that it doesn't end > > > > > > up calling svc_recv and waiting for a long time. > > > > > > > > > > Stupid question: how is this different from the > > > > > kthread_stop() coming just after this kthread_should_stop() > > > > > call but before we call svc_recv()? What keeps us from > > > > > waiting in svc_recv() indefinitely after kthread_stop()? > > > > > > > > > > > > > Nothing at all, unfortunately :-( > > > > > > > > Since we've taken signalling out of the lockd_down codepath, > > > > this gets a lot trickier than it used to be. I'm starting to > > > > wonder if we should go back to sending a signal on lockd_down. > > > > > > OK, thanks. So do I keep these patches, or not? This sounds > > > like a regression (if perhaps a very minor one--I'm not quite > > > clear). Help! > > > > > > > Well, if we reintroduce signalling in lockd_down then this > > particular problem goes away. It may be reasonable to add that back > > into the mix for now. > > > > As to the more general question of whether to keep these patches... > > > > Most of them are pretty innocuous, but the patch to convert to > > kthread API is the biggest change. 2.6.25 might be a bit aggressive > > for that. It wouldn't hurt to let the conversion to kthread API > > brew until 2.6.26. > > > > lockd still does have a use after free problem though, so patch 1/2 > > here might be worth considering even without changing things over to > > kthreads. I've not tested it on a kernel without the kthread > > patches, however. I can do that if you think that's the right > > approach. > > Pfft, I was hoping you'd tell me what to do. > > But, yes, it sounds like dropping the kthread conversion and sending > in that one bugfix is probably wisest. (But I assume we should keep > the first of those three patches, "SUNRPC: spin svc_rqst > initialization to its own function".) > Yes. The first patch fixes a possible NULL pointer dereference so I think we want that one here. This patch: Subject: [PATCH 2/4] SUNRPC: export svc_sock_update_bufs ...should be safe too, though it's not strictly needed until we convert lockd to the kthread API. So at this point we have 2 questions: 1) is there a way to make sure we don't block in svc_recv() without resorting to signals and can we reasonably mix signaling + kthread_stop? There may be a chicken and egg problem involved here though I haven't thought it through yet... 2) can an unresponsive client make lockd loop continuously in nlmsvc_retry_blocked() and prevent it from serving new lock requests? I'll try to get answers to both of these... -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down [not found] ` <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-01-29 12:32 ` Jeff Layton [not found] ` <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Jeff Layton @ 2008-01-29 12:32 UTC (permalink / raw) To: J. Bruce Fields; +Cc: neilb, linux-nfs On Tue, 29 Jan 2008 06:23:12 -0500 Jeff Layton <jlayton@redhat.com> wrote: > On Mon, 28 Jan 2008 22:02:49 -0500 > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote: > > > On Mon, 28 Jan 2008 20:23:51 -0500 > > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > > > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote: > > > > > On Mon, 28 Jan 2008 19:05:17 -0500 > > > > > "J. Bruce Fields" <bfields@fieldses.org> wrote: > > > > > > > > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote: > > > > > > > It's possible to end up continually looping in > > > > > > > nlmsvc_retry_blocked() even when the lockd kthread has > > > > > > > been requested to come down. I've been able to reproduce > > > > > > > this fairly easily by having an RPC grant callback queued > > > > > > > up to an RPC client that's not bound. If the other host is > > > > > > > unresponsive, then the block never gets taken off the > > > > > > > list, and lockd continually respawns new RPC's. > > > > > > > > > > > > > > If lockd is going down anyway, we don't want to keep > > > > > > > retrying the blocks, so allow it to break out of this > > > > > > > loop. Additionally, in that situation kthread_stop will > > > > > > > have already woken lockd, so when nlmsvc_retry_blocked > > > > > > > returns, have lockd check for kthread_stop() so that it > > > > > > > doesn't end up calling svc_recv and waiting for a long > > > > > > > time. > > > > > > > > > > > > Stupid question: how is this different from the > > > > > > kthread_stop() coming just after this kthread_should_stop() > > > > > > call but before we call svc_recv()? What keeps us from > > > > > > waiting in svc_recv() indefinitely after kthread_stop()? > > > > > > > > > > > > > > > > Nothing at all, unfortunately :-( > > > > > > > > > > Since we've taken signalling out of the lockd_down codepath, > > > > > this gets a lot trickier than it used to be. I'm starting to > > > > > wonder if we should go back to sending a signal on lockd_down. > > > > > > > > OK, thanks. So do I keep these patches, or not? This sounds > > > > like a regression (if perhaps a very minor one--I'm not quite > > > > clear). Help! > > > > > > > > > > Well, if we reintroduce signalling in lockd_down then this > > > particular problem goes away. It may be reasonable to add that > > > back into the mix for now. > > > > > > As to the more general question of whether to keep these > > > patches... > > > > > > Most of them are pretty innocuous, but the patch to convert to > > > kthread API is the biggest change. 2.6.25 might be a bit > > > aggressive for that. It wouldn't hurt to let the conversion to > > > kthread API brew until 2.6.26. > > > > > > lockd still does have a use after free problem though, so patch > > > 1/2 here might be worth considering even without changing things > > > over to kthreads. I've not tested it on a kernel without the > > > kthread patches, however. I can do that if you think that's the > > > right approach. > > > > Pfft, I was hoping you'd tell me what to do. > > > > But, yes, it sounds like dropping the kthread conversion and sending > > in that one bugfix is probably wisest. (But I assume we should keep > > the first of those three patches, "SUNRPC: spin svc_rqst > > initialization to its own function".) > > > > Yes. The first patch fixes a possible NULL pointer dereference so I > think we want that one here. This patch: > > Subject: [PATCH 2/4] SUNRPC: export svc_sock_update_bufs > > ...should be safe too, though it's not strictly needed until we > convert lockd to the kthread API. > > So at this point we have 2 questions: > > 1) is there a way to make sure we don't block in svc_recv() without > resorting to signals and can we reasonably mix signaling + > kthread_stop? There may be a chicken and egg problem involved here > though I haven't thought it through yet... > Still looking at this, though I'll note that CIFS has some kthreads that deal with sockets. When shutting those down, we generally send a signal to them in addition to the kthread_stop call. I'll look at how they do that and plan to try out something similar here for lockd. > 2) can an unresponsive client make lockd loop continuously in > nlmsvc_retry_blocked() and prevent it from serving new lock requests? > Yes. This is definitely the case. I can trivially hang lockd with the right set of steps. Let me know if you want them, though you can probably figure them out. I'm not sure what the right way to fix this is. Maybe this timeout in nlmsvc_grant_blocked should be longer? Something longer than the amount of time that it takes to do an rpc_ping on an unbound RPC client: /* Schedule next grant callback in 30 seconds */ nlmsvc_insert_block(block, 30 * HZ); The more I look at lockd and its problems, I'm starting to think that we might be better served by using two threads. Have one that just handles svc_recv/svc_process and another thread that handles nlmsvc_retry_blocked and maybe nlmsvc_invalidate_all when things are signalled. This second thread could even be a workqueue and the main lockd thread could queue tasks to it as needed. Finally, there's one more question that I forgot: 3) does this patch fix the use after free bug independently of the kthread patches: Subject: Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts ...the answer is yes. It does seem to. Bruce, would you like me to roll up a new patchset to make it clear what I'd like to see for 2.6.25? -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]
* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down [not found] ` <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org> @ 2008-01-29 14:41 ` J. Bruce Fields 0 siblings, 0 replies; 13+ messages in thread From: J. Bruce Fields @ 2008-01-29 14:41 UTC (permalink / raw) To: Jeff Layton; +Cc: neilb, linux-nfs On Tue, Jan 29, 2008 at 07:32:41AM -0500, Jeff Layton wrote: > Bruce, would you like me to roll up a new patchset to make it clear > what I'd like to see for 2.6.25? That would be fantastic. Thanks for all your work on this. --b. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts 2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton 2008-01-28 19:29 ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton @ 2008-01-28 23:18 ` J. Bruce Fields 2008-01-29 0:04 ` Jeff Layton 1 sibling, 1 reply; 13+ messages in thread From: J. Bruce Fields @ 2008-01-28 23:18 UTC (permalink / raw) To: Jeff Layton; +Cc: neilb, linux-nfs On Mon, Jan 28, 2008 at 02:29:10PM -0500, Jeff Layton wrote: > It's possible for a RPC to outlive the lockd daemon that created it, so > we need to make sure that all RPC's are killed when lockd is coming > down. When nlm_shutdown_hosts is called, kill off all RPC tasks > associated with the host. Since we need to wait until they have all gone > away, we might as well just shut down the RPC client altogether. > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > --- > fs/lockd/host.c | 7 ++++++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/fs/lockd/host.c b/fs/lockd/host.c > index ebec009..ca6b16f 100644 > --- a/fs/lockd/host.c > +++ b/fs/lockd/host.c > @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void) > /* First, make all hosts eligible for gc */ > dprintk("lockd: nuking all hosts...\n"); > for (chain = nlm_hosts; chain < nlm_hosts + NLM_HOST_NRHASH; ++chain) { > - hlist_for_each_entry(host, pos, chain, h_hash) > + hlist_for_each_entry(host, pos, chain, h_hash) { > host->h_expires = jiffies - 1; > + if (host->h_rpcclnt) { So the only difference from the previous patch is this test and the following assignment? OK. --b. > + rpc_shutdown_client(host->h_rpcclnt); > + host->h_rpcclnt = NULL; > + } > + } > } > > /* Then, perform a garbage collection pass */ > -- > 1.5.3.7 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts 2008-01-28 23:18 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields @ 2008-01-29 0:04 ` Jeff Layton 0 siblings, 0 replies; 13+ messages in thread From: Jeff Layton @ 2008-01-29 0:04 UTC (permalink / raw) To: J. Bruce Fields; +Cc: neilb, linux-nfs On Mon, 28 Jan 2008 18:18:26 -0500 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Mon, Jan 28, 2008 at 02:29:10PM -0500, Jeff Layton wrote: > > It's possible for a RPC to outlive the lockd daemon that created > > it, so we need to make sure that all RPC's are killed when lockd is > > coming down. When nlm_shutdown_hosts is called, kill off all RPC > > tasks associated with the host. Since we need to wait until they > > have all gone away, we might as well just shut down the RPC client > > altogether. > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > > --- > > fs/lockd/host.c | 7 ++++++- > > 1 files changed, 6 insertions(+), 1 deletions(-) > > > > diff --git a/fs/lockd/host.c b/fs/lockd/host.c > > index ebec009..ca6b16f 100644 > > --- a/fs/lockd/host.c > > +++ b/fs/lockd/host.c > > @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void) > > /* First, make all hosts eligible for gc */ > > dprintk("lockd: nuking all hosts...\n"); > > for (chain = nlm_hosts; chain < nlm_hosts + > > NLM_HOST_NRHASH; ++chain) { > > - hlist_for_each_entry(host, pos, chain, h_hash) > > + hlist_for_each_entry(host, pos, chain, h_hash) { > > host->h_expires = jiffies - 1; > > + if (host->h_rpcclnt) { > > So the only difference from the previous patch is this test and the > following assignment? OK. > > --b. > No, the old patch used rpc_killall_tasks and this one uses rpc_shutdown_client. I don't believe rpc_killall_tasks gives any guarantee about when the tasks actually complete, and we need to know that they're gone before we allow lockd to exit. > > + > > rpc_shutdown_client(host->h_rpcclnt); > > + host->h_rpcclnt = NULL; > > + } > > + } > > } > > > > /* Then, perform a garbage collection pass */ > > -- > > 1.5.3.7 > > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-01-29 14:41 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-28 19:29 [PATCH 0/2] NLM: fix use after free in lockd Jeff Layton
2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton
2008-01-28 19:29 ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton
2008-01-29 0:05 ` J. Bruce Fields
2008-01-29 1:12 ` Jeff Layton
[not found] ` <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-01-29 1:23 ` J. Bruce Fields
2008-01-29 2:29 ` Jeff Layton
[not found] ` <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2008-01-29 3:02 ` J. Bruce Fields
2008-01-29 11:23 ` Jeff Layton
[not found] ` <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-01-29 12:32 ` Jeff Layton
[not found] ` <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-01-29 14:41 ` J. Bruce Fields
2008-01-28 23:18 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields
2008-01-29 0:04 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox