From: "Robert Milkowski" <rmilkowski@gmail.com>
To: <linux-nfs@vger.kernel.org>, <trond.myklebust@hammerspace.com>,
"'Chuck Lever'" <chuck.lever@oracle.com>,
"'Schumaker, Anna'" <Anna.Schumaker@netapp.com>
Subject: RE: NFS v4 + kerberos: 4 minute window of slowness
Date: Thu, 14 May 2020 21:39:09 +0100 [thread overview]
Message-ID: <109301d62a2f$b88f0ae0$29ad20a0$@gmail.com> (raw)
In-Reply-To: <20ce01d62303$70571e80$51055b80$@gmail.com>
+Trond, Chuck, Anna
Ping...
> -----Original Message-----
> From: Robert Milkowski <rmilkowski@gmail.com>
> Sent: 05 May 2020 18:35
> To: linux-nfs@vger.kernel.org
> Subject: NFS v4 + kerberos: 4 minute window of slowness
>
> Hi,
>
> Currently the last 4 minutes before kernel gss context expires, all
> writes to NFSv4 are synchronous and all dirty pages associated with the
> file being written to are being destaged.
> This will continue for the 4 minutes until the context expires, at which
> point it gets refreshed and everything gets back to normal.
>
> The rpc.gssd by default sets the timeout to match the Kerberos service
> ticket, but this can be lowered by using -t option.
> In fact many sites set it to lower value, like for example 30 minutes.
> This means that every 30 minutes, the last 4 minutes results in severely
> slower writes (assuming these are buffered - no O_DSYNC, etc.).
>
> In extreme case, when one sets the timeout to 5 minutes, during the 4
> minutes out of the minutes, there will be the slowness observed.
>
>
> I understand the idea behind this mechanism - I guess it tries to avoid
> situation when a gss context can't be refreshed (due to error or account
> disabled, etc.), and it expires suddenly nfs client wouldn't be able to
> destage all the buffered writes. The way it is currently implemented
> though is rather crude.
> In my opinion, instead of making everything slow for the whole 4
> minutes, it should first try to refresh the context immediately and if
> successful things go back to normal, if it can't refresh the context
> then it should continue with the previous one and revert to the current
> behaviour. I implemented a naïve quick fix which does exactly that
> (attached at the end of this email).
>
>
> How to re-produce.
>
>
> $ uname -r
> 5.7.0-rc4+
>
> $ grep -- -t /etc/sysconfig/nfs
> RPCGSSDARGS="-t 300"
>
> I'm setting it to 5 minutes so I can quickly see the behaviour without
> having to wait for too long.
>
>
> Now, let's generate a small write every 10s to a file on nfsv4,sec=krb5
> filesystem and record how long each write takes.
> Since these are buffered writes it should be very quick most of the
> time.
>
> $ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa >f1; rm
> f1; sleep 10; done
>
> 15:22:41 write(1, "aa\n", 3) = 3 <0.000108>
> 15:22:51 write(1, "aa\n", 3) = 3 <0.000113>
> 15:23:01 write(1, "aa\n", 3) = 3 <0.000111>
> 15:23:11 write(1, "aa\n", 3) = 3 <0.000112>
> 15:23:21 write(1, "aa\n", 3) = 3 <0.001510> <<<<<<
> becomes
> slow
> 15:23:31 write(1, "aa\n", 3) = 3 <0.001622>
> 15:23:41 write(1, "aa\n", 3) = 3 <0.001553>
> 15:23:51 write(1, "aa\n", 3) = 3 <0.001495>
> ...
> 15:27:01 write(1, "aa\n", 3) = 3 <0.001528>
> 15:27:12 write(1, "aa\n", 3) = 3 <0.001553>
> 15:27:22 write(1, "aa\n", 3) = 3 <0.000104> <<<<<<
> becomes
> fast again
> 15:27:32 write(1, "aa\n", 3) = 3 <0.000125>
> 15:27:42 write(1, "aa\n", 3) = 3 <0.000129>
> 15:27:52 write(1, "aa\n", 3) = 3 <0.000113>
> 15:28:02 write(1, "aa\n", 3) = 3 <0.000112>
> 15:28:12 write(1, "aa\n", 3) = 3 <0.000112>
> 15:28:22 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< slow
> ...
> 15:32:02 write(1, "aa\n", 3) = 3 <0.001501>
> 15:32:12 write(1, "aa\n", 3) = 3 <0.001440>
> 15:32:22 write(1, "aa\n", 3) = 3 <0.000136> <<<<<< fast
> 15:32:32 write(1, "aa\n", 3) = 3 <0.000109>
> 15:32:42 write(1, "aa\n", 3) = 3 <0.000110>
> 15:32:52 write(1, "aa\n", 3) = 3 <0.000112>
> 15:33:02 write(1, "aa\n", 3) = 3 <0.000103>
> 15:33:12 write(1, "aa\n", 3) = 3 <0.000112>
> 15:33:22 write(1, "aa\n", 3) = 3 <0.001405> <<<<<< slow
> 15:33:32 write(1, "aa\n", 3) = 3 <0.001393>
> 15:33:42 write(1, "aa\n", 3) = 3 <0.001479>
> ...
>
>
>
> So we have 4 minute long windows of slowness followed by 1 minute window
> when writes are fast.
>
> 15:23:21 - 15:27:22 slow
> 15:27:22 - 15:28:22 fast
> 15:28:22 - 15:32:22 slow
> 15:32:22 - 15:33:22 fast
>
>
>
> After some tracing with systemtap and looking at the source code, I
> found where the issue is coming from.
> The nfs_file_write() function ends up calling nfs_ctx_key_to_expire() on
> each write, which in turn calls gss_key_timeout() which has hard-coded
> value of 240s (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo).
>
>
> nfs_file_write()
> ...
> result = nfs_key_timeout_notify(file, inode);
> if (result)
> return result;
> ...
> if (nfs_need_check_write(file, inode)) {
> int err = nfs_wb_all(inode); ...
>
>
> /*
> * Avoid buffered writes when a open context credential's key would
> * expire soon.
> *
> * Returns -EACCES if the key will expire within RPC_KEY_EXPIRE_FAIL.
> *
> * Return 0 and set a credential flag which triggers the inode to flush
> * and performs NFS_FILE_SYNC writes if the key will expired within
> * RPC_KEY_EXPIRE_TIMEO.
> */
> int
> nfs_key_timeout_notify(struct file *filp, struct inode *inode) {
> struct nfs_open_context *ctx = nfs_file_open_context(filp);
>
> if (nfs_ctx_key_to_expire(ctx, inode) &&
> !ctx->ll_cred)
> /* Already expired! */
> return -EACCES;
> return 0;
> }
>
>
> nfs_need_check_write()
> ...
> if (nfs_ctx_key_to_expire(ctx, inode))
> return 1;
> return 0;
>
>
>
> nfs_write_end()
> ...
> if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
> status = nfs_wb_all(mapping->host); ...
>
>
>
> /*
> * Test if the open context credential key is marked to expire soon.
> */
> bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode
> *inode)
> {
> struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth;
> struct rpc_cred *cred = ctx->ll_cred;
> struct auth_cred acred = {
> .cred = ctx->cred,
> };
>
> if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) {
> put_rpccred(cred);
> ctx->ll_cred = NULL;
> cred = NULL;
> }
> if (!cred)
> cred = auth->au_ops->lookup_cred(auth, &acred, 0);
> if (!cred || IS_ERR(cred))
> return true;
> ctx->ll_cred = cred;
> return !!(cred->cr_ops->crkey_timeout &&
> cred->cr_ops->crkey_timeout(cred));
> }
>
>
>
> net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout =
> gss_key_timeout,
>
>
> /*
> * Returns -EACCES if GSS context is NULL or will expire within the
> * timeout (miliseconds)
> */
> static int
> gss_key_timeout(struct rpc_cred *rc)
> {
> struct gss_cred *gss_cred = container_of(rc, struct gss_cred,
> gc_base);
> struct gss_cl_ctx *ctx;
> unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ);
> int ret = 0;
>
> rcu_read_lock();
> ctx = rcu_dereference(gss_cred->gc_ctx);
> if (!ctx || time_after(timeout, ctx->gc_expiry))
> ret = -EACCES;
> rcu_read_unlock();
>
> return ret;
> }
>
>
>
>
> #define GSS_KEY_EXPIRE_TIMEO 240
> static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;
>
>
>
>
>
> A naïve attempt at a fix:
>
>
> diff --git a/net/sunrpc/auth_gss/auth_gss.c
> b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc)
>
> rcu_read_lock();
> ctx = rcu_dereference(gss_cred->gc_ctx);
> + if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry))
> + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
> if (!ctx || time_after(timeout, ctx->gc_expiry))
> ret = -EACCES;
> rcu_read_unlock();
>
>
>
>
> With the above patch, if there is a write within 300s before a context
> is to expire (use RPCGSSDARGS="-t 400" or any value larger than 300 to
> test), it will now try to refresh the context and if successful then
> writes will be fast again (assuming -t option is >300s and/or krb ticket
> is valid for more than 300s).
>
> What I haven't tested nor analysed code is what would happen if it now
> fails to refresh the context, but after a quick glance at gss_refresh()
> it does seem it would continue using the previous cred...
>
> Is this the correct approach to fix the issue, or can you suggest some
> other approach?
>
>
>
> --
> Robert Milkowski
next prev parent reply other threads:[~2020-05-14 20:39 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-05 17:34 NFS v4 + kerberos: 4 minute window of slowness Robert Milkowski
2020-05-14 20:39 ` Robert Milkowski [this message]
2020-05-20 10:48 ` Robert Milkowski
2021-10-22 18:22 ` Dan Mezhiborsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='109301d62a2f$b88f0ae0$29ad20a0$@gmail.com' \
--to=rmilkowski@gmail.com \
--cc=Anna.Schumaker@netapp.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trond.myklebust@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).