From: Greg Banks <gnb@sgi.com>
To: Neil Brown <neilb@suse.de>
Cc: Linux NFS Mailing List <nfs@lists.sourceforge.net>,
Linux Filesystem Mailing List <linux-fsdevel@vger.kernel.org>
Subject: [PATCH,RESEND] make knfsd interact cleanly with HSMs
Date: Fri, 5 May 2006 21:52:54 +1000 [thread overview]
Message-ID: <20060505115254.GA13916@sgi.com> (raw)
G'day,
The NFSv3 protocol specifies an error, NFS3ERR_JUKEBOX, which a server
should return when an I/O operation will take a very long time.
This causes a different pattern of retries in clients, and avoids
a number of serious problems associated with I/Os which take longer
than an RPC timeout. The Linux knfsd server has code to generate the
jukebox error and many NFS clients are known to have working code to
handle it.
One scenario in which a server should emit the JUKEBOX error is when
a file data which the client is attempting to access is managed by
an HSM (Hierarchical Storage Manager) and is not present on the disk
and needs to be brought in from tape. Due to the nature of tapes this
operation can take minutes rather than the milliseconds normally seen
for local file data.
Currently the Linux knfsd handles this situation poorly. A READ NFS
call will cause the nfsd thread handling it to block until the file
is available, without sending a reply to the NFS client. After a
few seconds the client retries, and this second READ call causes
another nfsd to block behind the first one. A few seconds later and
the client's retries have blocked *all* the nfsd threads, and all NFS
service from the server stops until the original file arrives on disk.
WRITEs and SETATTRs which truncate the file are marginally better, in
that the knfsd dupcache will catch the retries and drop them without
blocking an nfsd (the dupcache *will* catch the retries because the
cache entry remains in RC_INPROG state and is not reused until the
first call finishes). However the first call still blocks, so given
WRITEs to enough offline files the server can still be locked up.
There are also client-side implications, depending on the client
implementation. For example, on a Linux client an RPC retry loop uses
an RPC request slot, so reads from enough separate offline files can
lock up a mountpoint.
This patch seeks to remedy the interaction between knfsd and HSMs by
providing mechanisms to allow knfsd to tell an underlying filesystem
(which supports HSMs) not to block for reads, writes and truncates
of offline files. It's a port of a Linux 2.4 patch used in SGI's
ProPack distro since 2004 and in SLES9 since SP2. The patch:
* provides a new ATTR_NO_BLOCK flag which the kernel can
use to tell a filesystem's inode_ops->setattr() operation not
to block when truncating an offline file. XFS already obeys
this flag (inside a #ifdef)
* changes knfsd to provide ATTR_NO_BLOCK when it does the VFS
calls to implement the SETATTR NFS call.
* changes knfsd to supply the O_NONBLOCK flag in the temporary
struct file it uses for VFS reads and writes, in order to ask
the filesystem not to block when reading or writing an offline
file. XFS already obeys this new semantic for O_NONBLOCK
(and in SLES9 so does JFS).
* adds code to translate the -EAGAIN the filesystem returns when
it would have blocked, to the -ETIMEDOUT that knfsd expects.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
(SLES9 patch Acked-by: okir@suse.de)
---
This is a resend of
http://marc.theaimsgroup.com/?l=linux-nfs&m=111087383132762&w=2
fs/nfsd/vfs.c | 33 +++++++++++++++++++++++++++++++--
include/linux/fs.h | 1 +
2 files changed, 32 insertions(+), 2 deletions(-)
Index: linux/fs/nfsd/vfs.c
===================================================================
--- linux.orig/fs/nfsd/vfs.c 2006-05-05 19:49:38.434101243 +1000
+++ linux/fs/nfsd/vfs.c 2006-05-05 21:00:22.568841897 +1000
@@ -327,6 +327,16 @@ nfsd_setattr(struct svc_rqst *rqstp, str
goto out_nfserr;
}
DQUOT_INIT(inode);
+
+
+ /*
+ * Tell a Hierarchical Storage Manager (e.g. via DMAPI) to
+ * return EAGAIN when an action would take minutes instead of
+ * milliseconds so that NFS can reply to the client with
+ * NFSERR_JUKEBOX instead of blocking an nfsd thread.
+ */
+ if (rqstp->rq_vers >= 3)
+ iap->ia_valid |= ATTR_NO_BLOCK;
}
imode = inode->i_mode;
@@ -349,6 +359,9 @@ nfsd_setattr(struct svc_rqst *rqstp, str
if (!check_guard || guardtime == inode->i_ctime.tv_sec) {
fh_lock(fhp);
err = notify_change(dentry, iap);
+ /* to get NFSERR_JUKEBOX on the wire, need -ETIMEDOUT */
+ if (err == -EAGAIN)
+ err = -ETIMEDOUT;
err = nfserrno(err);
fh_unlock(fhp);
}
@@ -834,6 +847,10 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
if (ra && ra->p_set)
file->f_ra = ra->p_ra;
+ /* Support HSMs -- see comment in nfsd_setattr() */
+ if (rqstp->rq_vers >= 3)
+ file->f_flags |= O_NONBLOCK;
+
if (file->f_op->sendfile) {
svc_pushback_unused_pages(rqstp);
err = file->f_op->sendfile(file, &offset, *count,
@@ -859,8 +876,12 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
*count = err;
err = 0;
fsnotify_access(file->f_dentry);
- } else
+ } else {
+ /* to get NFSERR_JUKEBOX on the wire, need -ETIMEDOUT */
+ if (err == -EAGAIN)
+ err = -ETIMEDOUT;
err = nfserrno(err);
+ }
out:
return err;
}
@@ -918,6 +939,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, s
if (stable && !EX_WGATHER(exp))
file->f_flags |= O_SYNC;
+ /* Support HSMs -- see comment in nfsd_setattr() */
+ if (rqstp->rq_vers >= 3)
+ file->f_flags |= O_NONBLOCK;
+
/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset);
@@ -970,8 +995,12 @@ nfsd_vfs_write(struct svc_rqst *rqstp, s
dprintk("nfsd: write complete err=%d\n", err);
if (err >= 0)
err = 0;
- else
+ else {
+ /* to get NFSERR_JUKEBOX on the wire, need -ETIMEDOUT */
+ if (err == -EAGAIN)
+ err = -ETIMEDOUT;
err = nfserrno(err);
+ }
out:
return err;
}
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h 2006-05-05 19:49:40.587127963 +1000
+++ linux/include/linux/fs.h 2006-05-05 21:00:22.569818326 +1000
@@ -273,6 +273,7 @@ typedef void (dio_iodone_t)(struct kiocb
#define ATTR_KILL_SUID 2048
#define ATTR_KILL_SGID 4096
#define ATTR_FILE 8192
+#define ATTR_NO_BLOCK 32768 /* Return EAGAIN and don't block on long truncates */
/*
* This is the Inode Attributes structure, used for notify_change(). It
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
next reply other threads:[~2006-05-05 11:53 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-05-05 11:52 Greg Banks [this message]
2006-05-08 1:13 ` [PATCH,RESEND] make knfsd interact cleanly with HSMs Neil Brown
2006-05-08 6:42 ` [NFS] " Christoph Hellwig
2006-05-08 11:16 ` Neil Brown
2006-05-08 11:37 ` Nathan Scott
2006-05-08 17:55 ` Christoph Hellwig
2006-05-09 2:35 ` Greg Banks
2006-05-09 9:19 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060505115254.GA13916@sgi.com \
--to=gnb@sgi.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=neilb@suse.de \
--cc=nfs@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).