From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Phillips Subject: [RFC] Change filesystem mount without disconnecting clients Date: Tue, 21 Nov 2006 19:19:00 -0800 Message-ID: <4563C1A4.5060608@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: Robert Nelson Return-path: Received: from sc8-sf-mx1-b.sourceforge.net ([10.3.1.91] helo=mail.sourceforge.net) by sc8-sf-list2-new.sourceforge.net with esmtp (Exim 4.43) id 1Gmidj-0003X8-7Z for nfs@lists.sourceforge.net; Tue, 21 Nov 2006 19:19:11 -0800 Received: from smtp-out.google.com ([216.239.45.12]) by mail.sourceforge.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.44) id 1Gmidk-0000G8-6o for nfs@lists.sourceforge.net; Tue, 21 Nov 2006 19:19:12 -0800 Received: from mail2.smo.corp.google.com (mail2.smo.corp.google.com [172.29.48.30]) by smtp-out.google.com with ESMTP id kAM3J6XS001765 for ; Tue, 21 Nov 2006 19:19:06 -0800 To: nfs@lists.sourceforge.net List-Id: "Discussion of NFS under Linux development, interoperability, and testing." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: nfs-bounces@lists.sourceforge.net Errors-To: nfs-bounces@lists.sourceforge.net Hi all, This patch provides an interface to let us quietly change the block device on which a filesystem is mounted, without disrupting client TCP connections. Why would anybody want to do such a strange thing? Answer: remote block device replication. Each replication cycle results in a new virtual block device containing a new, consistent state of the filesystem. We want clients to see the changed filesystem transparently, without remounting, as if somebody had just gone in and directly operated on the local filesystem, adding files, deleting files, changing file contents, renaming and so on. This should all just work, even if clients have files open and are in the middle of operating on them. This can cause some file operations to error out, but will not crash the client or server. Operations on unchanged files should work as expected, in spite of the underlying block device having been changed.Note: to avoid state file handles we do need to take some care with the fsid, which is not within the scope of this patch (we just specify a known fsid in the exports file for the time being). The interface works as follows: write anything to /proc/fs/nfsd/suspend -> flush nfsd export cache and suspend nfs transaction processing read anything from /proc/fs/nfsd/suspend -> resume nfs transaction processing The suspend is accomplished by taking a write lock on the export cache's hash_sem, which by fortuitous circumstance encloses all nfs transaction processing. We then flush the export cache, driving the underlying filesystem mount count down to one, in which state it can be unmounted. Holding the hash_sem prevents mountd from reloading the export cache. To resume, we just release the write lock. This is used something like: echo foo >/proc/fs/nfsd/suspend umount /mnt/someexport mount /dev/somenewdev /mnt/someexport cat /proc/fs/nfsd/suspend This works pretty well, but does have the deficiency of suspending all nfsd activity, even for exports on a filesystem we are not touching. So a finer granularity lock would be nice, but first we are just interested in correctness. This interface is not supposed to be a keeper and we are not proposing this feature for merging by any means. We are interested in opinions on whether the approach is correct. For example, could the purge fail to drive the filesystem mount count to one? Is there any way past our locking to accidentally attempt to reload the export cache while we are still fiddling with the filesystem? We certainly do not claim to be competent knfsd hackers the moment, having looked at the code pretty much for the first time a week or two ago. We may well have missed something basic. The code that goes with this to do remote block device replication will be released pretty soon as an open source project, most likely in the next week or two. For today I will just claim that it works well and it does something that some people may find quite useful: it allows remote users to access a read-only copy of a filesystem, served from a local disk that is replicated from a read-write volume some place far away. Signed-off-by Robert Nelson Signed-off-by Daniel Phillips diff -urp 2.6.18.3.clean/fs/nfsd/export.c 2.6.18.3/fs/nfsd/export.c --- 2.6.18.3.clean/fs/nfsd/export.c 2006-11-18 19:28:22.000000000 -0800 +++ 2.6.18.3/fs/nfsd/export.c 2006-11-21 17:03:02.000000000 -0800 @@ -735,7 +735,7 @@ exp_readlock(void) down_read(&hash_sem); } -static inline void +void exp_writelock(void) { down_write(&hash_sem); @@ -747,7 +747,7 @@ exp_readunlock(void) up_read(&hash_sem); } -static inline void +void exp_writeunlock(void) { up_write(&hash_sem); @@ -1290,6 +1290,17 @@ exp_verify_string(char *cp, int max) } /* + * Flush exports table without calling RW semaphore. + * The caller is required to lock and unlock the export table. + */ +void +export_purge(void) +{ + cache_purge(&svc_expkey_cache); + cache_purge(&svc_export_cache); +} + +/* * Initialize the exports module. */ void diff -urp 2.6.18.3.clean/fs/nfsd/nfsctl.c 2.6.18.3/fs/nfsd/nfsctl.c --- 2.6.18.3.clean/fs/nfsd/nfsctl.c 2006-11-18 19:28:22.000000000 -0800 +++ 2.6.18.3/fs/nfsd/nfsctl.c 2006-11-21 16:44:50.000000000 -0800 @@ -38,7 +38,7 @@ unsigned int nfsd_versbits = ~0; /* - * We have a single directory with 9 nodes in it. + * We have a single directory with several nodes in it. */ enum { NFSD_Root = 1, @@ -53,6 +53,7 @@ enum { NFSD_Fh, NFSD_Threads, NFSD_Versions, + NFSD_Suspend, /* * The below MUST come last. Otherwise we leave a hole in nfsd_files[] * with !CONFIG_NFSD_V4 and simple_fill_super() goes oops @@ -139,6 +140,26 @@ static const struct file_operations tran .release = simple_transaction_release, }; +static ssize_t nfsctl_suspend_write(struct file *file, const char __user *buf, size_t size, loff_t *pos) +{ + printk("Suspending NFS transactions!\n"); + exp_writelock(); + export_purge(); + return size; +} + +static ssize_t nfsctl_suspend_read(struct file *file, char __user *buf, size_t size, loff_t *pos) +{ + printk("Resuming NFS transactions!\n"); + exp_writeunlock(); + return 0; +} + +static struct file_operations suspend_ops = { + .write = nfsctl_suspend_write, + .read = nfsctl_suspend_read, +}; + extern struct seq_operations nfs_exports_op; static int exports_open(struct inode *inode, struct file *file) { @@ -484,6 +505,7 @@ static int nfsd_fill_super(struct super_ [NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR}, [NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR}, [NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR}, + [NFSD_Suspend] = {"suspend", &suspend_ops, S_IWUSR|S_IRUSR}, #ifdef CONFIG_NFSD_V4 [NFSD_Leasetime] = {"nfsv4leasetime", &transaction_ops, S_IWUSR|S_IRUSR}, [NFSD_RecoveryDir] = {"nfsv4recoverydir", &transaction_ops, S_IWUSR|S_IRUSR}, diff -urp 2.6.18.3.clean/include/linux/nfsd/export.h 2.6.18.3/include/linux/nfsd/export.h --- 2.6.18.3.clean/include/linux/nfsd/export.h 2006-11-18 19:28:22.000000000 -0800 +++ 2.6.18.3/include/linux/nfsd/export.h 2006-11-21 17:01:55.000000000 -0800 @@ -84,6 +84,9 @@ struct svc_expkey { void nfsd_export_init(void); void nfsd_export_shutdown(void); void nfsd_export_flush(void); +void export_purge(void); +void exp_writelock(void); +void exp_writeunlock(void); void exp_readlock(void); void exp_readunlock(void); struct svc_export * exp_get_by_name(struct auth_domain *clp, ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs