rpc.gssd still spammed in 2.6.35

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* rpc.gssd still spammed in 2.6.35
@ 2010-10-28  0:24 Brian De Wolf
  2010-10-28 14:00 ` Trond Myklebust
  0 siblings, 1 reply; 3+ messages in thread
From: Brian De Wolf @ 2010-10-28  0:24 UTC (permalink / raw)
  To: linux-nfs@vger.kernel.org

Greetings,

I recently started testing a build of 2.6.35 to hopefully relieve some
issues we have on our login boxes.  Specifically, I was after this
commit:
http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commit;h=126e216a8730532dfb685205309275f87e3d133e

The issue we've run into is that some user loses their credentials,
but has a process looping on a read/write of their Kerberized NFSv4 home
directory without checking the return value.  Not only did this spam
logs, but it also prevents rpc.gssd from handling anyone else's logins,
effectively taking down the service for anyone not already connected.

I was hoping this commit would protect rpc.gssd from any potential
flooding of requests, but it all depends on how the user loses their
credentials. If their credentials have expired or their caches become
corrupt, rpc.gssd returns EKEYEXPIRED and the kernel rate limits the
requests to rpc.gssd via negative caching.

If the user's credential cache gets destroyed, however, rpc.gssd
returns EACCES, and the user process can cause the kernel to hammer
rpc.gssd. The kicker here is that pam_krb5 destroys credentials on
logout by default, so if someone's using screen or long background
processes in their home directory, it's a ticking time bomb waiting to
destroy rpc.gssd.

That's assuming a benign user, as well.  A malicious user could easily
kdestroy, wait for their credentials to expire from the cache in the
kernel, and start tying up rpc.gssd with failed requests.

With this in mind, I initially patched the kernel to negative cache
entries with EACCES errors, in addition to EKEYEXPIRED errors.  But the
more that I thought about it, the more it seemed appropriate to subject
all possible errors to negative caching.  The underlying question is,
is there any possible error from rpc.gssd where it would be appropriate
to allow a process to cause another request to rpc.gssd immediately?
If there isn't, negative caching all errors seems reasonable.

Here's a simple patch implementing the behavior of negative caching of
every failed request, as a proof of concept, I guess.  With it applied,
I have yet to produce a scenario where rpc.gssd becomes unresponsive.

Let me know what you think.  I'd love to see a fix for this behavior
enter the kernel at some point, as it's been rather disruptive on our
login boxes lately.

diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 3835ce3..38bdf90 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -362,7 +362,7 @@ gss_handle_downcall_result(struct gss_cred *gss_cred, struct gss_upcall_msg *gss
                clear_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
                gss_cred_set_ctx(&gss_cred->gc_base, gss_msg->ctx);
                break;
-       case -EKEYEXPIRED:
+       default:
                set_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
        }
        gss_cred->gc_upcall_timestamp = jiffies;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: rpc.gssd still spammed in 2.6.35
  2010-10-28  0:24 rpc.gssd still spammed in 2.6.35 Brian De Wolf
@ 2010-10-28 14:00 ` Trond Myklebust
  2010-10-28 23:15   ` Brian De Wolf
  0 siblings, 1 reply; 3+ messages in thread
From: Trond Myklebust @ 2010-10-28 14:00 UTC (permalink / raw)
  To: Brian De Wolf; +Cc: linux-nfs@vger.kernel.org

On Wed, 2010-10-27 at 17:24 -0700, Brian De Wolf wrote:
> Greetings,
> 
> I recently started testing a build of 2.6.35 to hopefully relieve some
> issues we have on our login boxes.  Specifically, I was after this
> commit:
> http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commit;h=126e216a8730532dfb685205309275f87e3d133e
> 
> The issue we've run into is that some user loses their credentials,
> but has a process looping on a read/write of their Kerberized NFSv4 home
> directory without checking the return value.  Not only did this spam
> logs, but it also prevents rpc.gssd from handling anyone else's logins,
> effectively taking down the service for anyone not already connected.
> 
> I was hoping this commit would protect rpc.gssd from any potential
> flooding of requests, but it all depends on how the user loses their
> credentials. If their credentials have expired or their caches become
> corrupt, rpc.gssd returns EKEYEXPIRED and the kernel rate limits the
> requests to rpc.gssd via negative caching.
> 
> If the user's credential cache gets destroyed, however, rpc.gssd
> returns EACCES, and the user process can cause the kernel to hammer
> rpc.gssd. The kicker here is that pam_krb5 destroys credentials on
> logout by default, so if someone's using screen or long background
> processes in their home directory, it's a ticking time bomb waiting to
> destroy rpc.gssd.
> 
> That's assuming a benign user, as well.  A malicious user could easily
> kdestroy, wait for their credentials to expire from the cache in the
> kernel, and start tying up rpc.gssd with failed requests.
> 
> 
> With this in mind, I initially patched the kernel to negative cache
> entries with EACCES errors, in addition to EKEYEXPIRED errors.  But the
> more that I thought about it, the more it seemed appropriate to subject
> all possible errors to negative caching.  The underlying question is,
> is there any possible error from rpc.gssd where it would be appropriate
> to allow a process to cause another request to rpc.gssd immediately?
> If there isn't, negative caching all errors seems reasonable.
> 
> Here's a simple patch implementing the behavior of negative caching of
> every failed request, as a proof of concept, I guess.  With it applied,
> I have yet to produce a scenario where rpc.gssd becomes unresponsive.
> 
> Let me know what you think.  I'd love to see a fix for this behavior
> enter the kernel at some point, as it's been rather disruptive on our
> login boxes lately.
> 
> 
> diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
> index 3835ce3..38bdf90 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -362,7 +362,7 @@ gss_handle_downcall_result(struct gss_cred *gss_cred, struct gss_upcall_msg *gss
>                 clear_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
>                 gss_cred_set_ctx(&gss_cred->gc_base, gss_msg->ctx);
>                 break;
> -       case -EKEYEXPIRED:
> +       default:
>                 set_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
>         }
>         gss_cred->gc_upcall_timestamp = jiffies;

What about the rpc_pipefs errors, EAGAIN, EPIPE and ETIMEDOUT? Why
should they result in the cred being marked as negative?

rpc.gssd itself will only pass down 3 errors: 0, EKEYEXPIRED and EACCES.

Trond

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: rpc.gssd still spammed in 2.6.35
  2010-10-28 14:00 ` Trond Myklebust
@ 2010-10-28 23:15   ` Brian De Wolf
  0 siblings, 0 replies; 3+ messages in thread
From: Brian De Wolf @ 2010-10-28 23:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs@vger.kernel.org

On Thu, 28 Oct 2010 07:00:19 -0700
Trond Myklebust <Trond.Myklebust@netapp.com> wrote:

> What about the rpc_pipefs errors, EAGAIN, EPIPE and ETIMEDOUT? Why
> should they result in the cred being marked as negative?
> 

I have a limited grasp of the exact mechanics going on, but the general
reasoning I have in my mind is this:

If a given credential request causes an error to be returned, be it
from rpc_pipefs or rpc.gssd, there are two possible reasons for the
failure:

1) rpc.gssd is missing or unresponsive.  If this is the case, it doesn't
matter if you can retry immediately or if you wait 5 seconds, it's
still going to fail.

2) Something about the request has caused either rpc_pipefs or rpc.gssd
to produce an error, while other requests still process normally. If
this is the case, we should prioritize the requests that will succeed
by penalizing the requests that don't via negative caching of their
failures. Otherwise those failing requests can flood rpc.gssd and
prevent those that can succeed from ever being attempted (and this is
what has been happening in my environment).

The only problem I can see with it is that, if a request fails and the
keys become available within 5 seconds, the user just has to wait it
out. I don't think I can usually "kinit" with my password in 5 seconds,
but I could see an automated system being interfered with.  I haven't
experimented with it, but I suspect a sub-second negative cache timeout
would still protect rpc.gssd from flooding while not causing extra
disruption to use.

I'd really just like to see some sort of rate-limiting on the failures
heading into rpc.gssd so that it can continue processing valid requests.

> rpc.gssd itself will only pass down 3 errors: 0, EKEYEXPIRED and EACCES.
> 

Is this set in stone?  My fear is that, if rpc.gssd is ever improved to
return even more error codes or can somehow be coerced to return some
other unexpected error code, rpc.gssd can be taken out of service by
flooding it with requests that subvert the negative caching.

Sorry if I'm out of touch with the internals or what's best for the
kernel.  I'm just a sysadmin dabbling in the kernel, trying to fix some
problems I've been running into...

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-10-28 23:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-28  0:24 rpc.gssd still spammed in 2.6.35 Brian De Wolf
2010-10-28 14:00 ` Trond Myklebust
2010-10-28 23:15   ` Brian De Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).