[PATCH 0/2] NLM: fix use after free in lockd

Linux NFS development
 help / color / mirror / Atom feed

* [PATCH 0/2] NLM: fix use after free in lockd
@ 2008-01-28 19:29 Jeff Layton
  2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-01-28 19:29 UTC (permalink / raw)
  To: bfields, neilb; +Cc: linux-nfs

This first patch in this set replaces the patch that I had originally
proposed for fixing the use after free problem. The earlier patch seemed
to work in some situations, but when the timing was a bit different it
was still possible for lockd to go down before the RPC callback
occurred. The fix is to just tear down the rpc client altogether.

The second patch in the set prevents a problem whereby lockd could stay
up indefinitely looping through nlmsvc_retry_blocked. 

This set depends on the patchset I proposed earlier to convert lockd to
use the kthread API.

Comments and/or suggestions appreciated.

Signed-off-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts
  2008-01-28 19:29 [PATCH 0/2] NLM: fix use after free in lockd Jeff Layton
@ 2008-01-28 19:29 ` Jeff Layton
  2008-01-28 19:29   ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton
  2008-01-28 23:18   ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff Layton @ 2008-01-28 19:29 UTC (permalink / raw)
  To: bfields, neilb; +Cc: linux-nfs

It's possible for a RPC to outlive the lockd daemon that created it, so
we need to make sure that all RPC's are killed when lockd is coming
down. When nlm_shutdown_hosts is called, kill off all RPC tasks
associated with the host. Since we need to wait until they have all gone
away, we might as well just shut down the RPC client altogether.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/lockd/host.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index ebec009..ca6b16f 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -379,8 +379,13 @@ nlm_shutdown_hosts(void)
 	/* First, make all hosts eligible for gc */
 	dprintk("lockd: nuking all hosts...\n");
 	for (chain = nlm_hosts; chain < nlm_hosts + NLM_HOST_NRHASH; ++chain) {
-		hlist_for_each_entry(host, pos, chain, h_hash)
+		hlist_for_each_entry(host, pos, chain, h_hash) {
 			host->h_expires = jiffies - 1;
+			if (host->h_rpcclnt) {
+				rpc_shutdown_client(host->h_rpcclnt);
+				host->h_rpcclnt = NULL;
+			}
+		}
 	}
 
 	/* Then, perform a garbage collection pass */
-- 
1.5.3.7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
  2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton
@ 2008-01-28 19:29   ` Jeff Layton
  2008-01-29  0:05     ` J. Bruce Fields
  2008-01-28 23:18   ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-01-28 19:29 UTC (permalink / raw)
  To: bfields, neilb; +Cc: linux-nfs

It's possible to end up continually looping in nlmsvc_retry_blocked()
even when the lockd kthread has been requested to come down. I've been
able to reproduce this fairly easily by having an RPC grant callback
queued up to an RPC client that's not bound. If the other host is
unresponsive, then the block never gets taken off the list, and lockd
continually respawns new RPC's.

If lockd is going down anyway, we don't want to keep retrying the
blocks, so allow it to break out of this loop. Additionally, in that
situation kthread_stop will have already woken lockd, so when
nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
that it doesn't end up calling svc_recv and waiting for a long time.

This patch prevents this continual looping and allows lockd to
eventually come down, but this can quite some time since lockd
won't check the condition in the nlmsvc_retry_blocked while loop
until nlmsvc_grant_blocked returns. That won't happen until
all of the portmap requests time out.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/lockd/svc.c     |    4 ++++
 fs/lockd/svclock.c |    3 ++-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 5752e1b..a0e5318 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -166,6 +166,10 @@ lockd(void *vrqstp)
 		} else if (time_before(grace_period_expire, jiffies))
 			clear_grace_period();

+		/* nlmsvc_retry_blocked can block, so check for kthread_stop */
+		if (kthread_should_stop())
+			break;
+
 		/*
 		 * Find a socket with data available and call its
 		 * recvfrom routine.
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index 2f4d8fa..d8324b6 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -29,6 +29,7 @@
 #include <linux/sunrpc/svc.h>
 #include <linux/lockd/nlm.h>
 #include <linux/lockd/lockd.h>
+#include <linux/kthread.h>

 #define NLMDBG_FACILITY		NLMDBG_SVCLOCK

@@ -867,7 +868,7 @@ nlmsvc_retry_blocked(void)
 	unsigned long	timeout = MAX_SCHEDULE_TIMEOUT;
 	struct nlm_block *block;

-	while (!list_empty(&nlm_blocked)) {
+	while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
 		block = list_entry(nlm_blocked.next, struct nlm_block, b_list);

 		if (block->b_when == NLM_NEVER)
-- 
1.5.3.7

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
  2008-01-28 19:29   ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton
@ 2008-01-29  0:05     ` J. Bruce Fields
  2008-01-29  1:12       ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-01-29  0:05 UTC (permalink / raw)
  To: Jeff Layton; +Cc: neilb, linux-nfs

On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> It's possible to end up continually looping in nlmsvc_retry_blocked()
> even when the lockd kthread has been requested to come down. I've been
> able to reproduce this fairly easily by having an RPC grant callback
> queued up to an RPC client that's not bound. If the other host is
> unresponsive, then the block never gets taken off the list, and lockd
> continually respawns new RPC's.
> 
> If lockd is going down anyway, we don't want to keep retrying the
> blocks, so allow it to break out of this loop. Additionally, in that
> situation kthread_stop will have already woken lockd, so when
> nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> that it doesn't end up calling svc_recv and waiting for a long time.

Stupid question: how is this different from the kthread_stop() coming
just after this kthread_should_stop() call but before we call
svc_recv()?  What keeps us from waiting in svc_recv() indefinitely after
kthread_stop()?

> This patch prevents this continual looping and allows lockd to
> eventually come down, but this can quite some time since lockd
> won't check the condition in the nlmsvc_retry_blocked while loop
> until nlmsvc_grant_blocked returns. That won't happen until
> all of the portmap requests time out.

So, shutdown problems aside, does that mean an unresponsive client can
cause lock to stop serving lock requests in normal operation?

--b.

> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/lockd/svc.c     |    4 ++++
>  fs/lockd/svclock.c |    3 ++-
>  2 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> index 5752e1b..a0e5318 100644
> --- a/fs/lockd/svc.c
> +++ b/fs/lockd/svc.c
> @@ -166,6 +166,10 @@ lockd(void *vrqstp)
>  		} else if (time_before(grace_period_expire, jiffies))
>  			clear_grace_period();
>  
> +		/* nlmsvc_retry_blocked can block, so check for kthread_stop */
> +		if (kthread_should_stop())
> +			break;
> +
>  		/*
>  		 * Find a socket with data available and call its
>  		 * recvfrom routine.
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index 2f4d8fa..d8324b6 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -29,6 +29,7 @@
>  #include <linux/sunrpc/svc.h>
>  #include <linux/lockd/nlm.h>
>  #include <linux/lockd/lockd.h>
> +#include <linux/kthread.h>
>  
>  #define NLMDBG_FACILITY		NLMDBG_SVCLOCK
>  
> @@ -867,7 +868,7 @@ nlmsvc_retry_blocked(void)
>  	unsigned long	timeout = MAX_SCHEDULE_TIMEOUT;
>  	struct nlm_block *block;
>  
> -	while (!list_empty(&nlm_blocked)) {
> +	while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
>  		block = list_entry(nlm_blocked.next, struct nlm_block, b_list);
>  
>  		if (block->b_when == NLM_NEVER)
> -- 
> 1.5.3.7
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
  2008-01-29  0:05     ` J. Bruce Fields
@ 2008-01-29  1:12       ` Jeff Layton
       [not found]         ` <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-01-29  1:12 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: neilb, linux-nfs

On Mon, 28 Jan 2008 19:05:17 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > It's possible to end up continually looping in
> > nlmsvc_retry_blocked() even when the lockd kthread has been
> > requested to come down. I've been able to reproduce this fairly
> > easily by having an RPC grant callback queued up to an RPC client
> > that's not bound. If the other host is unresponsive, then the block
> > never gets taken off the list, and lockd continually respawns new
> > RPC's.
> > 
> > If lockd is going down anyway, we don't want to keep retrying the
> > blocks, so allow it to break out of this loop. Additionally, in that
> > situation kthread_stop will have already woken lockd, so when
> > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > that it doesn't end up calling svc_recv and waiting for a long time.
> 
> Stupid question: how is this different from the kthread_stop() coming
> just after this kthread_should_stop() call but before we call
> svc_recv()?  What keeps us from waiting in svc_recv() indefinitely
> after kthread_stop()?
> 

Nothing at all, unfortunately :-(

Since we've taken signalling out of the lockd_down codepath, this gets
a lot trickier than it used to be. I'm starting to wonder if we should
go back to sending a signal on lockd_down.

> > This patch prevents this continual looping and allows lockd to
> > eventually come down, but this can quite some time since lockd
> > won't check the condition in the nlmsvc_retry_blocked while loop
> > until nlmsvc_grant_blocked returns. That won't happen until
> > all of the portmap requests time out.
> 
> So, shutdown problems aside, does that mean an unresponsive client can
> cause lock to stop serving lock requests in normal operation?
> 

Possibly.

I think that in general what happens is that blocks eventually get
reinserted into the list with an expire time that is well after the
current jiffies and we eventually hit this condition:

                if (time_after(block->b_when,jiffies)) {
                        timeout = block->b_when - jiffies;
                        break;
                }

...but I was seeing lockd loop in this loop over several iterations
without returning back up to the main lockd loop. I think what was
happening was that we were inserting the block with a later jiffies in
nlmsvc_grant_blocked, but the rpcbind attempts would end up taking
longer than that timeout. So when we returned from
nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again.

As to why we don't see this more often, I'm not sure. I probably need
to experiment a bit more with it to make sure I'm understanding this
correctly.

Let me do that and get back to you...

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
       [not found]         ` <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-01-29  1:23           ` J. Bruce Fields
  2008-01-29  2:29             ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-01-29  1:23 UTC (permalink / raw)
  To: Jeff Layton; +Cc: neilb, linux-nfs

On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> On Mon, 28 Jan 2008 19:05:17 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > It's possible to end up continually looping in
> > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > requested to come down. I've been able to reproduce this fairly
> > > easily by having an RPC grant callback queued up to an RPC client
> > > that's not bound. If the other host is unresponsive, then the block
> > > never gets taken off the list, and lockd continually respawns new
> > > RPC's.
> > > 
> > > If lockd is going down anyway, we don't want to keep retrying the
> > > blocks, so allow it to break out of this loop. Additionally, in that
> > > situation kthread_stop will have already woken lockd, so when
> > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > > that it doesn't end up calling svc_recv and waiting for a long time.
> > 
> > Stupid question: how is this different from the kthread_stop() coming
> > just after this kthread_should_stop() call but before we call
> > svc_recv()?  What keeps us from waiting in svc_recv() indefinitely
> > after kthread_stop()?
> > 
> 
> Nothing at all, unfortunately :-(
> 
> Since we've taken signalling out of the lockd_down codepath, this gets
> a lot trickier than it used to be. I'm starting to wonder if we should
> go back to sending a signal on lockd_down.

OK, thanks.  So do I keep these patches, or not?  This sounds like a
regression (if perhaps a very minor one--I'm not quite clear).  Help!

--b.

> 
> > > This patch prevents this continual looping and allows lockd to
> > > eventually come down, but this can quite some time since lockd
> > > won't check the condition in the nlmsvc_retry_blocked while loop
> > > until nlmsvc_grant_blocked returns. That won't happen until
> > > all of the portmap requests time out.
> > 
> > So, shutdown problems aside, does that mean an unresponsive client can
> > cause lock to stop serving lock requests in normal operation?
> > 
> 
> Possibly.
> 
> I think that in general what happens is that blocks eventually get
> reinserted into the list with an expire time that is well after the
> current jiffies and we eventually hit this condition:
> 
>                 if (time_after(block->b_when,jiffies)) {
>                         timeout = block->b_when - jiffies;
>                         break;
>                 }
> 
> ...but I was seeing lockd loop in this loop over several iterations
> without returning back up to the main lockd loop. I think what was
> happening was that we were inserting the block with a later jiffies in
> nlmsvc_grant_blocked, but the rpcbind attempts would end up taking
> longer than that timeout. So when we returned from
> nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again.
> 
> As to why we don't see this more often, I'm not sure. I probably need
> to experiment a bit more with it to make sure I'm understanding this
> correctly.
> 
> Let me do that and get back to you...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
  2008-01-29  1:23           ` J. Bruce Fields
@ 2008-01-29  2:29             ` Jeff Layton
       [not found]               ` <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-01-29  2:29 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: neilb, linux-nfs

On Mon, 28 Jan 2008 20:23:51 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > On Mon, 28 Jan 2008 19:05:17 -0500
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > It's possible to end up continually looping in
> > > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > > requested to come down. I've been able to reproduce this fairly
> > > > easily by having an RPC grant callback queued up to an RPC client
> > > > that's not bound. If the other host is unresponsive, then the block
> > > > never gets taken off the list, and lockd continually respawns new
> > > > RPC's.
> > > > 
> > > > If lockd is going down anyway, we don't want to keep retrying the
> > > > blocks, so allow it to break out of this loop. Additionally, in that
> > > > situation kthread_stop will have already woken lockd, so when
> > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > > > that it doesn't end up calling svc_recv and waiting for a long time.
> > > 
> > > Stupid question: how is this different from the kthread_stop() coming
> > > just after this kthread_should_stop() call but before we call
> > > svc_recv()?  What keeps us from waiting in svc_recv() indefinitely
> > > after kthread_stop()?
> > > 
> > 
> > Nothing at all, unfortunately :-(
> > 
> > Since we've taken signalling out of the lockd_down codepath, this gets
> > a lot trickier than it used to be. I'm starting to wonder if we should
> > go back to sending a signal on lockd_down.
> 
> OK, thanks.  So do I keep these patches, or not?  This sounds like a
> regression (if perhaps a very minor one--I'm not quite clear).  Help!
> 

Well, if we reintroduce signalling in lockd_down then this particular
problem goes away. It may be reasonable to add that back into the mix
for now.

As to the more general question of whether to keep these patches...

Most of them are pretty innocuous, but the patch to convert to kthread
API is the biggest change. 2.6.25 might be a bit aggressive for that. It
wouldn't hurt to let the conversion to kthread API brew until 2.6.26.

lockd still does have a use after free problem though, so patch 1/2
here might be worth considering even without changing things over to
kthreads. I've not tested it on a kernel without the kthread patches,
however. I can do that if you think that's the right approach.

> > 
> > > > This patch prevents this continual looping and allows lockd to
> > > > eventually come down, but this can quite some time since lockd
> > > > won't check the condition in the nlmsvc_retry_blocked while loop
> > > > until nlmsvc_grant_blocked returns. That won't happen until
> > > > all of the portmap requests time out.
> > > 
> > > So, shutdown problems aside, does that mean an unresponsive client can
> > > cause lock to stop serving lock requests in normal operation?
> > > 
> > 
> > Possibly.
> > 
> > I think that in general what happens is that blocks eventually get
> > reinserted into the list with an expire time that is well after the
> > current jiffies and we eventually hit this condition:
> > 
> >                 if (time_after(block->b_when,jiffies)) {
> >                         timeout = block->b_when - jiffies;
> >                         break;
> >                 }
> > 
> > ...but I was seeing lockd loop in this loop over several iterations
> > without returning back up to the main lockd loop. I think what was
> > happening was that we were inserting the block with a later jiffies in
> > nlmsvc_grant_blocked, but the rpcbind attempts would end up taking
> > longer than that timeout. So when we returned from
> > nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again.
> > 
> > As to why we don't see this more often, I'm not sure. I probably need
> > to experiment a bit more with it to make sure I'm understanding this
> > correctly.
> > 
> > Let me do that and get back to you...


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org>]

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
       [not found]               ` <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org>
@ 2008-01-29  3:02                 ` J. Bruce Fields
  2008-01-29 11:23                   ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-01-29  3:02 UTC (permalink / raw)
  To: Jeff Layton; +Cc: neilb, linux-nfs

On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote:
> On Mon, 28 Jan 2008 20:23:51 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > > On Mon, 28 Jan 2008 19:05:17 -0500
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > 
> > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > > It's possible to end up continually looping in
> > > > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > > > requested to come down. I've been able to reproduce this fairly
> > > > > easily by having an RPC grant callback queued up to an RPC client
> > > > > that's not bound. If the other host is unresponsive, then the block
> > > > > never gets taken off the list, and lockd continually respawns new
> > > > > RPC's.
> > > > > 
> > > > > If lockd is going down anyway, we don't want to keep retrying the
> > > > > blocks, so allow it to break out of this loop. Additionally, in that
> > > > > situation kthread_stop will have already woken lockd, so when
> > > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > > > > that it doesn't end up calling svc_recv and waiting for a long time.
> > > > 
> > > > Stupid question: how is this different from the kthread_stop() coming
> > > > just after this kthread_should_stop() call but before we call
> > > > svc_recv()?  What keeps us from waiting in svc_recv() indefinitely
> > > > after kthread_stop()?
> > > > 
> > > 
> > > Nothing at all, unfortunately :-(
> > > 
> > > Since we've taken signalling out of the lockd_down codepath, this gets
> > > a lot trickier than it used to be. I'm starting to wonder if we should
> > > go back to sending a signal on lockd_down.
> > 
> > OK, thanks.  So do I keep these patches, or not?  This sounds like a
> > regression (if perhaps a very minor one--I'm not quite clear).  Help!
> > 
> 
> Well, if we reintroduce signalling in lockd_down then this particular
> problem goes away. It may be reasonable to add that back into the mix
> for now.
> 
> As to the more general question of whether to keep these patches...
> 
> Most of them are pretty innocuous, but the patch to convert to kthread
> API is the biggest change. 2.6.25 might be a bit aggressive for that. It
> wouldn't hurt to let the conversion to kthread API brew until 2.6.26.
> 
> lockd still does have a use after free problem though, so patch 1/2
> here might be worth considering even without changing things over to
> kthreads. I've not tested it on a kernel without the kthread patches,
> however. I can do that if you think that's the right approach.

Pfft, I was hoping you'd tell me what to do.

But, yes, it sounds like dropping the kthread conversion and sending in
that one bugfix is probably wisest.  (But I assume we should keep the
first of those three patches, "SUNRPC: spin svc_rqst initialization to
its own function".)

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
  2008-01-29  3:02                 ` J. Bruce Fields
@ 2008-01-29 11:23                   ` Jeff Layton
       [not found]                     ` <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-01-29 11:23 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: neilb, linux-nfs

On Mon, 28 Jan 2008 22:02:49 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote:
> > On Mon, 28 Jan 2008 20:23:51 -0500
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > > > On Mon, 28 Jan 2008 19:05:17 -0500
> > > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > > 
> > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > > > It's possible to end up continually looping in
> > > > > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > > > > requested to come down. I've been able to reproduce this
> > > > > > fairly easily by having an RPC grant callback queued up to
> > > > > > an RPC client that's not bound. If the other host is
> > > > > > unresponsive, then the block never gets taken off the list,
> > > > > > and lockd continually respawns new RPC's.
> > > > > > 
> > > > > > If lockd is going down anyway, we don't want to keep
> > > > > > retrying the blocks, so allow it to break out of this loop.
> > > > > > Additionally, in that situation kthread_stop will have
> > > > > > already woken lockd, so when nlmsvc_retry_blocked returns,
> > > > > > have lockd check for kthread_stop() so that it doesn't end
> > > > > > up calling svc_recv and waiting for a long time.
> > > > > 
> > > > > Stupid question: how is this different from the
> > > > > kthread_stop() coming just after this kthread_should_stop()
> > > > > call but before we call svc_recv()?  What keeps us from
> > > > > waiting in svc_recv() indefinitely after kthread_stop()?
> > > > > 
> > > > 
> > > > Nothing at all, unfortunately :-(
> > > > 
> > > > Since we've taken signalling out of the lockd_down codepath,
> > > > this gets a lot trickier than it used to be. I'm starting to
> > > > wonder if we should go back to sending a signal on lockd_down.
> > > 
> > > OK, thanks.  So do I keep these patches, or not?  This sounds
> > > like a regression (if perhaps a very minor one--I'm not quite
> > > clear).  Help!
> > > 
> > 
> > Well, if we reintroduce signalling in lockd_down then this
> > particular problem goes away. It may be reasonable to add that back
> > into the mix for now.
> > 
> > As to the more general question of whether to keep these patches...
> > 
> > Most of them are pretty innocuous, but the patch to convert to
> > kthread API is the biggest change. 2.6.25 might be a bit aggressive
> > for that. It wouldn't hurt to let the conversion to kthread API
> > brew until 2.6.26.
> > 
> > lockd still does have a use after free problem though, so patch 1/2
> > here might be worth considering even without changing things over to
> > kthreads. I've not tested it on a kernel without the kthread
> > patches, however. I can do that if you think that's the right
> > approach.
> 
> Pfft, I was hoping you'd tell me what to do.
> 
> But, yes, it sounds like dropping the kthread conversion and sending
> in that one bugfix is probably wisest.  (But I assume we should keep
> the first of those three patches, "SUNRPC: spin svc_rqst
> initialization to its own function".)
> 

Yes. The first patch fixes a possible NULL pointer dereference so I
think we want that one here. This patch:

    Subject: [PATCH 2/4] SUNRPC: export svc_sock_update_bufs

...should be safe too, though it's not strictly needed until we convert
lockd to the kthread API.

So at this point we have 2 questions:

1) is there a way to make sure we don't block in svc_recv() without
resorting to signals and can we reasonably mix signaling +
kthread_stop? There may be a chicken and egg problem involved here
though I haven't thought it through yet...

2) can an unresponsive client make lockd loop continuously in
nlmsvc_retry_blocked() and prevent it from serving new lock requests?

I'll try to get answers to both of these...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
       [not found]                     ` <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-01-29 12:32                       ` Jeff Layton
       [not found]                         ` <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-01-29 12:32 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: neilb, linux-nfs

On Tue, 29 Jan 2008 06:23:12 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Mon, 28 Jan 2008 22:02:49 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote:
> > > On Mon, 28 Jan 2008 20:23:51 -0500
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > 
> > > > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > > > > On Mon, 28 Jan 2008 19:05:17 -0500
> > > > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > > > 
> > > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > > > > It's possible to end up continually looping in
> > > > > > > nlmsvc_retry_blocked() even when the lockd kthread has
> > > > > > > been requested to come down. I've been able to reproduce
> > > > > > > this fairly easily by having an RPC grant callback queued
> > > > > > > up to an RPC client that's not bound. If the other host is
> > > > > > > unresponsive, then the block never gets taken off the
> > > > > > > list, and lockd continually respawns new RPC's.
> > > > > > > 
> > > > > > > If lockd is going down anyway, we don't want to keep
> > > > > > > retrying the blocks, so allow it to break out of this
> > > > > > > loop. Additionally, in that situation kthread_stop will
> > > > > > > have already woken lockd, so when nlmsvc_retry_blocked
> > > > > > > returns, have lockd check for kthread_stop() so that it
> > > > > > > doesn't end up calling svc_recv and waiting for a long
> > > > > > > time.
> > > > > > 
> > > > > > Stupid question: how is this different from the
> > > > > > kthread_stop() coming just after this kthread_should_stop()
> > > > > > call but before we call svc_recv()?  What keeps us from
> > > > > > waiting in svc_recv() indefinitely after kthread_stop()?
> > > > > > 
> > > > > 
> > > > > Nothing at all, unfortunately :-(
> > > > > 
> > > > > Since we've taken signalling out of the lockd_down codepath,
> > > > > this gets a lot trickier than it used to be. I'm starting to
> > > > > wonder if we should go back to sending a signal on lockd_down.
> > > > 
> > > > OK, thanks.  So do I keep these patches, or not?  This sounds
> > > > like a regression (if perhaps a very minor one--I'm not quite
> > > > clear).  Help!
> > > > 
> > > 
> > > Well, if we reintroduce signalling in lockd_down then this
> > > particular problem goes away. It may be reasonable to add that
> > > back into the mix for now.
> > > 
> > > As to the more general question of whether to keep these
> > > patches...
> > > 
> > > Most of them are pretty innocuous, but the patch to convert to
> > > kthread API is the biggest change. 2.6.25 might be a bit
> > > aggressive for that. It wouldn't hurt to let the conversion to
> > > kthread API brew until 2.6.26.
> > > 
> > > lockd still does have a use after free problem though, so patch
> > > 1/2 here might be worth considering even without changing things
> > > over to kthreads. I've not tested it on a kernel without the
> > > kthread patches, however. I can do that if you think that's the
> > > right approach.
> > 
> > Pfft, I was hoping you'd tell me what to do.
> > 
> > But, yes, it sounds like dropping the kthread conversion and sending
> > in that one bugfix is probably wisest.  (But I assume we should keep
> > the first of those three patches, "SUNRPC: spin svc_rqst
> > initialization to its own function".)
> > 
> 
> Yes. The first patch fixes a possible NULL pointer dereference so I
> think we want that one here. This patch:
> 
>     Subject: [PATCH 2/4] SUNRPC: export svc_sock_update_bufs
> 
> ...should be safe too, though it's not strictly needed until we
> convert lockd to the kthread API.
> 
> So at this point we have 2 questions:
> 
> 1) is there a way to make sure we don't block in svc_recv() without
> resorting to signals and can we reasonably mix signaling +
> kthread_stop? There may be a chicken and egg problem involved here
> though I haven't thought it through yet...
> 

Still looking at this, though I'll note that CIFS has some kthreads
that deal with sockets. When shutting those down, we generally send
a signal to them in addition to the kthread_stop call. I'll look at how
they do that and plan to try out something similar here for lockd.

> 2) can an unresponsive client make lockd loop continuously in
> nlmsvc_retry_blocked() and prevent it from serving new lock requests?
> 

Yes. This is definitely the case. I can trivially hang lockd with the
right set of steps. Let me know if you want them, though you can
probably figure them out.

I'm not sure what the right way to fix this is. Maybe this timeout in
nlmsvc_grant_blocked should be longer? Something longer than the amount
of time that it takes to do an rpc_ping on an unbound RPC client:

        /* Schedule next grant callback in 30 seconds */
        nlmsvc_insert_block(block, 30 * HZ);

The more I look at lockd and its problems, I'm starting to think that
we might be better served by using two threads. Have one that just
handles svc_recv/svc_process and another thread that handles
nlmsvc_retry_blocked and maybe nlmsvc_invalidate_all when things are
signalled. This second thread could even be a workqueue and the main
lockd thread could queue tasks to it as needed.

Finally, there's one more question that I forgot:

3) does this patch fix the use after free bug independently of the
kthread patches:

Subject: Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts

...the answer is yes. It does seem to.

Bruce, would you like me to roll up a new patchset to make it clear
what I'd like to see for 2.6.25?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>]

* Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down
       [not found]                         ` <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-01-29 14:41                           ` J. Bruce Fields
  0 siblings, 0 replies; 13+ messages in thread
From: J. Bruce Fields @ 2008-01-29 14:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: neilb, linux-nfs

On Tue, Jan 29, 2008 at 07:32:41AM -0500, Jeff Layton wrote:
> Bruce, would you like me to roll up a new patchset to make it clear
> what I'd like to see for 2.6.25?

That would be fantastic.  Thanks for all your work on this.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts
  2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton
  2008-01-28 19:29   ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton
@ 2008-01-28 23:18   ` J. Bruce Fields
  2008-01-29  0:04     ` Jeff Layton
  1 sibling, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2008-01-28 23:18 UTC (permalink / raw)
  To: Jeff Layton; +Cc: neilb, linux-nfs

On Mon, Jan 28, 2008 at 02:29:10PM -0500, Jeff Layton wrote:
> It's possible for a RPC to outlive the lockd daemon that created it, so
> we need to make sure that all RPC's are killed when lockd is coming
> down. When nlm_shutdown_hosts is called, kill off all RPC tasks
> associated with the host. Since we need to wait until they have all gone
> away, we might as well just shut down the RPC client altogether.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/lockd/host.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> index ebec009..ca6b16f 100644
> --- a/fs/lockd/host.c
> +++ b/fs/lockd/host.c
> @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void)
>  	/* First, make all hosts eligible for gc */
>  	dprintk("lockd: nuking all hosts...\n");
>  	for (chain = nlm_hosts; chain < nlm_hosts + NLM_HOST_NRHASH; ++chain) {
> -		hlist_for_each_entry(host, pos, chain, h_hash)
> +		hlist_for_each_entry(host, pos, chain, h_hash) {
>  			host->h_expires = jiffies - 1;
> +			if (host->h_rpcclnt) {

So the only difference from the previous patch is this test and the
following assignment?  OK.

--b.

> +				rpc_shutdown_client(host->h_rpcclnt);
> +				host->h_rpcclnt = NULL;
> +			}
> +		}
>  	}
>  
>  	/* Then, perform a garbage collection pass */
> -- 
> 1.5.3.7
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts
  2008-01-28 23:18   ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields
@ 2008-01-29  0:04     ` Jeff Layton
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2008-01-29  0:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: neilb, linux-nfs

On Mon, 28 Jan 2008 18:18:26 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Jan 28, 2008 at 02:29:10PM -0500, Jeff Layton wrote:
> > It's possible for a RPC to outlive the lockd daemon that created
> > it, so we need to make sure that all RPC's are killed when lockd is
> > coming down. When nlm_shutdown_hosts is called, kill off all RPC
> > tasks associated with the host. Since we need to wait until they
> > have all gone away, we might as well just shut down the RPC client
> > altogether.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/lockd/host.c |    7 ++++++-
> >  1 files changed, 6 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> > index ebec009..ca6b16f 100644
> > --- a/fs/lockd/host.c
> > +++ b/fs/lockd/host.c
> > @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void)
> >  	/* First, make all hosts eligible for gc */
> >  	dprintk("lockd: nuking all hosts...\n");
> >  	for (chain = nlm_hosts; chain < nlm_hosts +
> > NLM_HOST_NRHASH; ++chain) {
> > -		hlist_for_each_entry(host, pos, chain, h_hash)
> > +		hlist_for_each_entry(host, pos, chain, h_hash) {
> >  			host->h_expires = jiffies - 1;
> > +			if (host->h_rpcclnt) {
> 
> So the only difference from the previous patch is this test and the
> following assignment?  OK.
> 
> --b.
> 

No, the old patch used rpc_killall_tasks and this one uses
rpc_shutdown_client. I don't believe rpc_killall_tasks gives any
guarantee about when the tasks actually complete, and we need to know
that they're gone before we allow lockd to exit.

> > +
> > rpc_shutdown_client(host->h_rpcclnt);
> > +				host->h_rpcclnt = NULL;
> > +			}
> > +		}
> >  	}
> >  
> >  	/* Then, perform a garbage collection pass */
> > -- 
> > 1.5.3.7
> > 


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-01-29 14:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-28 19:29 [PATCH 0/2] NLM: fix use after free in lockd Jeff Layton
2008-01-28 19:29 ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts Jeff Layton
2008-01-28 19:29   ` [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down Jeff Layton
2008-01-29  0:05     ` J. Bruce Fields
2008-01-29  1:12       ` Jeff Layton
     [not found]         ` <20080128201210.376097e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-01-29  1:23           ` J. Bruce Fields
2008-01-29  2:29             ` Jeff Layton
     [not found]               ` <20080128212942.77035005-PC62bkCOHzGdMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2008-01-29  3:02                 ` J. Bruce Fields
2008-01-29 11:23                   ` Jeff Layton
     [not found]                     ` <20080129062312.33ba5a54-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-01-29 12:32                       ` Jeff Layton
     [not found]                         ` <20080129073241.7c5c2ef1-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-01-29 14:41                           ` J. Bruce Fields
2008-01-28 23:18   ` [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts J. Bruce Fields
2008-01-29  0:04     ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox