* RE: [PATCH/RFC] Lustre VFS patch, version 2
@ 2004-06-02 23:15 Peter J. Braam
2004-06-03 13:59 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Peter J. Braam @ 2004-06-02 23:15 UTC (permalink / raw)
To: linux-kernel
Cc: hch, axboe, lmb, kevcorry, arjanv, iro, trond.myklebust, anton,
lustre-devel
[-- Attachment #1: Type: text/plain, Size: 7683 bytes --]
Hello!
The feedback of the Lustre patches was of very high quality, thanks a
lot for studying it carefully. Things are simpler now.
Oleg Drokin and I discussed the emails extensively and here is our reply.
We have attached another collection of patches, addressing many of the
concerns.
We felt it is was perhaps easier to keep this all in one long email.
People requested to see the code that uses the patch. We have uploaded that
to:
ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz
The client file system is the primary user of the kernel patch, in the
llite directory. The MDS server is a sample user of do_kern_mount. As
requested I have removed many other things from the tar ball to make
review simple (so this won't compile or run).
1. Export symbols concerns by Christoph Hellwig:
Indeed we can do without __iget, kernel_text_address, reparent_to_init
and exit_files.
We actually need do_kern_mount and truncate_complete_page. Do kern
mount is used because we use a file system namespace in servers in the
kernel without exporting it to user space (mds/handler.c). The server
file systems are ext3 file systems but we replace VFS locking with DLM
locks, and it would take considerable work to export that as a file
system.
Truncate_complete_page is used to remove pages in the middle of a file
mapping, when lock revocations happen (llite/file.c
ll_extent_lock_callback, calling ll_pgcache_remove_extent) .
2. lustre_version.patch concerns by Christoph Hellwig:
This one can easily be removed, but kernel version alone does not
necessarily represent anything useful. There are tons of people
patching their kernel with patches, even applying parts of newer
kernel and still leaving kernel version at its old value
(distributions immediately come to mind). So we still need something
to identify version of necessary bits. E.g. version of intent API.
3. Introduction of lock-less version of d_rehash (__d_rehash) by
Christoph Hellwig:
In some places lustre needs to do several things to dentry's with
dcache lock held already, e.g. traverse alias dentries in inode to
find one with same name and parent as the one we have already. Lustre
can invalidate busy dentries, which we put on a list. If these are
looked up again, concurrently, we find them on this list and re-use
them, to avoid having several identical aliases in an inode. See
llite/{dcache.c,namei.c} ll_revalidate and the lock callback function
ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to
manipulate dentries associated with raw inodes and names in ext3.
4. vfs intent API changes kernel exported concern API by Christoph
Hellwig:
With slight modification it is possible to reduce the changes to just
changes in the name of intent structure itself and some of its
fields.
This renaming was requested by Linus, but we can change names back
easily if needed, that would avoid any api change. Are there other
users, please let us know what to do?
All the functions can easily be split into valid intent expecting ones
(with some suffix in name like _it) and those that are part of old API
would just initialise the intent to something sensible and then call
corresponding intent-expecting function. No harm should be done to
external filesystems this way. We have modified vfs intent API patch
to achieve this.
5. Some objections from Trond Myklebust about open flags in exec, cwd
revalidation, and revalidate_counter patch:
We have fixed the exec open flags issue (our error). Also
revalidate_counter patch was dropped since we can do this inside
lustre as well. CWD revalidation can be converted to FS_REVAL_DOT in
fs flags instead, but we still need part of that patch, the
LOOKUP_LAST/LOOKUP_NOT_LAST part. Lustre needs to know when we reached
the last component in the path so that intent needs to be looked
at. (It seems we cannot use LOOKUP_CONTINUE for this reliably).
6. from Trond Myklebust:
> The vfs-intent_lustre-vanilla-2.6.patch + the "intent_release()"
> code. What if you end up crossing a mountpoint? How do you then know
> to which superblock/filesystem the private field belongs if there are
> more than one user of this mechanism?
Basically intent only makes sence for the last component. Our code
checks that and if we are doing lookup a component before the last,
then a dummy IT_LOOKUP intent is created on stack and we work with
that, perhaps the same is true for other filesystems that would like
to use this mechanism.
7. raw operations concerns by various people:
We have now implemented an alternative approach to this, that is
taking place when parent lookup is done, using intents. For setattr
we managed to remove the raw operations alltogether, (praying that we
haven't forgotten some awful problem we solved that led to the
introduction of setattr_raw in the first place).
The correctly filled intent is recognised by filesystem's lookup or
revalidate method. After the parent is looked up, based on the intent
the correct "raw" server call is executed, within the file
system. Then a special flag is set in intent, the caller of parent
lookup checks for the flag and if it is set, the functions returns
immediately with supplied (in intent)exit code, without instantiating
child dentries.
This needs some minor changes to VFS, though. There are at
least two approaches.
One is to not introduce any new methods and just rely on fs' metohds
to do everything, for this to work filesystem needs to know the
remaining path to be traversed (we can fill nd->last with remaining
path before calling into fs). In the root directory of the mount, we
need to call a revalidate (if supported by fs) on mountpoint to
intercept the intent, after we crossed mountpoint. We have this
approach implemented in that attached patch. Does it look better than
the raw operations?
Much simpler for us is to add additional inode operation
"process_intent" method that would be called when LOOKUP_PARENT sort
of lookup was requested and we are about to leave link_path_walk()
with nameidata structure filled and everything ready. Then the same
flag in intent will be set and everything else as in previous
approach.
We believe both methods are less intrusive than the raw methods, but
slightly more delicate.
8. Mountpoint-crossing issues during rename (and link) noticed by
Arjan van de Ven:
Well, indeed this can happen if source or destination is a mountpoint
on client but not server, this needs to be addressed by individual
filesystems that chose to implement those raw methods.
9. dev_readonly patch concerns by Jens Axboe:
We already clarified why we need it in this exact way. But there were
some valid suggestions to use other means like dm-flakey device mapper
module, so we decided to write a failure simulator DM.
10. "Have these patches undergone any siginifant test?" by Anton Blanchard:
There are two important questions I think:
- Do the patches cause damage?
Probably not anymore. SUSE has done testing and it appears the
original patch I attached didn't break things (after one fix was
made).
- Is Lustre stable?
On 2.4 Lustre is quite stable. On 2.6 we have done testing but,
for example, never more than on 40 nodes. We don't consider it
rock solid on 2.6, it does pass POSIX and just about every other
benchmark without failures.
Since the patches were modified for this discussion there are of
course some new issues which Oleg Drokin is now ironing out.
Our test results are visible at https://buffalo.lustre.org
Well, how close are we now to this being acceptable?
- Peter J. Braam & Oleg Drokin -
[-- Attachment #2: export-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 3228 bytes --]
fs/jbd/journal.c | 1 +
fs/super.c | 2 ++
include/linux/fs.h | 1 +
include/linux/mm.h | 3 +++
mm/truncate.c | 4 +++-
5 files changed, 10 insertions(+), 1 deletion(-)
Index: linux-2.6.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.6.orig/fs/jbd/journal.c 2004-05-26 20:25:49.000000000 +0300
+++ linux-2.6.6/fs/jbd/journal.c 2004-05-27 21:08:52.686693408 +0300
@@ -71,6 +71,7 @@
EXPORT_SYMBOL(journal_errno);
EXPORT_SYMBOL(journal_ack_err);
EXPORT_SYMBOL(journal_clear_err);
+EXPORT_SYMBOL(log_start_commit);
EXPORT_SYMBOL(log_wait_commit);
EXPORT_SYMBOL(journal_start_commit);
EXPORT_SYMBOL(journal_wipe);
Index: linux-2.6.6/fs/super.c
===================================================================
--- linux-2.6.6.orig/fs/super.c 2004-05-26 20:25:43.000000000 +0300
+++ linux-2.6.6/fs/super.c 2004-05-27 21:08:52.718688544 +0300
@@ -788,6 +788,8 @@
return (struct vfsmount *)sb;
}
+EXPORT_SYMBOL(do_kern_mount);
+
struct vfsmount *kern_mount(struct file_system_type *type)
{
return do_kern_mount(type->name, 0, type->name, NULL);
Index: linux-2.6.6/include/linux/mm.h
===================================================================
--- linux-2.6.6.orig/include/linux/mm.h 2004-05-26 20:26:11.000000000 +0300
+++ linux-2.6.6/include/linux/mm.h 2004-05-27 21:08:52.735685960 +0300
@@ -589,6 +589,9 @@
return 0;
}
+/* truncate.c */
+extern void truncate_complete_page(struct address_space *mapping,struct page *);
+
/* filemap.c */
extern unsigned long page_unuse(struct page *);
extern void truncate_inode_pages(struct address_space *, loff_t);
Index: linux-2.6.6/include/linux/fs.h
===================================================================
--- linux-2.6.6.orig/include/linux/fs.h 2004-05-27 21:08:45.986711960 +0300
+++ linux-2.6.6/include/linux/fs.h 2004-05-27 21:08:52.738685504 +0300
@@ -1137,6 +1137,7 @@
extern int unregister_filesystem(struct file_system_type *);
extern struct vfsmount *kern_mount(struct file_system_type *);
extern int may_umount(struct vfsmount *);
+struct vfsmount *do_kern_mount(const char *type, int flags, const char *name, void *data);
extern long do_mount(char *, char *, char *, unsigned long, void *);
extern int vfs_statfs(struct super_block *, struct kstatfs *);
Index: linux-2.6.6/mm/truncate.c
===================================================================
--- linux-2.6.6.orig/mm/truncate.c 2004-05-26 20:26:14.000000000 +0300
+++ linux-2.6.6/mm/truncate.c 2004-05-27 21:08:52.750683680 +0300
@@ -42,7 +42,7 @@
* its lock, b) when a concurrent invalidate_inode_pages got there first and
* c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
*/
-static void
+void
truncate_complete_page(struct address_space *mapping, struct page *page)
{
if (page->mapping != mapping)
@@ -58,6 +58,8 @@
page_cache_release(page); /* pagecache ref */
}
+EXPORT_SYMBOL(truncate_complete_page);
+
/*
* This is for invalidate_inode_pages(). That function can be called at
* any time, and is not supposed to throw away dirty pages. But pages can
[-- Attachment #3: header_guards-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 1481 bytes --]
%diffstat
blockgroup_lock.h | 4 +++-
percpu_counter.h | 4 ++++
2 files changed, 7 insertions(+), 1 deletion(-)
%patch
Index: linux-2.6.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.6.orig/include/linux/percpu_counter.h 2004-04-04 11:37:23.000000000 +0800
+++ linux-2.6.6/include/linux/percpu_counter.h 2004-05-22 16:08:16.000000000 +0800
@@ -3,6 +3,8 @@
*
* WARNING: these things are HUGE. 4 kbytes per counter on 32-way P4.
*/
+#ifndef _LINUX_PERCPU_COUNTER_H
+#define _LINUX_PERCPU_COUNTER_H
#include <linux/config.h>
#include <linux/spinlock.h>
@@ -101,3 +103,5 @@ static inline void percpu_counter_dec(st
{
percpu_counter_mod(fbc, -1);
}
+
+#endif /* _LINUX_PERCPU_COUNTER_H */
Index: linux-2.6.6/include/linux/blockgroup_lock.h
===================================================================
--- linux-2.6.6.orig/include/linux/blockgroup_lock.h 2004-04-04 11:36:26.000000000 +0800
+++ linux-2.6.6/include/linux/blockgroup_lock.h 2004-05-22 16:08:45.000000000 +0800
@@ -3,6 +3,8 @@
*
* Simple hashed spinlocking.
*/
+#ifndef _LINUX_BLOCKGROUP_LOCK_H
+#define _LINUX_BLOCKGROUP_LOCK_H
#include <linux/config.h>
#include <linux/spinlock.h>
@@ -55,4 +57,4 @@ static inline void bgl_lock_init(struct
#define sb_bgl_lock(sb, block_group) \
(&(sb)->s_blockgroup_lock.locks[(block_group) & (NR_BG_LOCKS-1)].lock)
-
+#endif
[-- Attachment #4: lustre_version.patch --]
[-- Type: application/octet-stream, Size: 482 bytes --]
Version 36: don't dput dentry after error (b=2350), zero page->private (3119)
Version 35: pass intent to real_lookup after revalidate failure (b=3285)
Version 34: fix ext3 iopen assertion failure (b=2517, b=2399)
include/linux/lustre_version.h | 1 +
1 files changed, 1 insertion(+)
--- /dev/null Fri Aug 30 17:31:37 2002
+++ linux-2.4.18-18.8.0-l12-braam/include/linux/lustre_version.h Thu Feb 13 07:58:33 2003
@@ -0,0 +1 @@
+#define LUSTRE_KERNEL_VERSION 36
_
[-- Attachment #5: vanilla-2.6.6 --]
[-- Type: application/octet-stream, Size: 374 bytes --]
lustre_version.patch
vfs_intent-flags_rename-vanilla-2.6.patch
vfs-dcache_locking-vanilla-2.6.patch
vfs-dcache_lustre_invalid-vanilla-2.6.patch
vfs-intent_api-vanilla-2.6.patch
vfs-raw_ops-vanilla-2.6.patch
export-vanilla-2.6.patch
header_guards-vanilla-2.6.patch
vfs-intent_lustre-vanilla-2.6.patch
vfs-do_truncate.patch
vfs-lookup_last-vanilla-2.6.patch
[-- Attachment #6: vfs_intent-flags_rename-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 8145 bytes --]
%diffstat
fs/cifs/dir.c | 14 +++++++-------
fs/exec.c | 4 ++--
fs/namei.c | 4 ++--
fs/nfs/dir.c | 14 +++++++-------
fs/nfs/nfs4proc.c | 6 +++---
include/linux/namei.h | 15 +++++++--------
6 files changed, 28 insertions(+), 29 deletions(-)
%patch
Index: linux-2.6.6/fs/exec.c
===================================================================
--- linux-2.6.6.orig/fs/exec.c 2004-05-22 00:46:19.000000000 +0800
+++ linux-2.6.6/fs/exec.c 2004-05-22 01:36:12.000000000 +0800
@@ -122,7 +122,7 @@ asmlinkage long sys_uselib(const char __
struct nameidata nd;
int error;
- nd.intent.open.flags = FMODE_READ;
+ nd.intent.it_flags = FMODE_READ;
error = __user_walk(library, LOOKUP_FOLLOW|LOOKUP_OPEN, &nd);
if (error)
goto out;
@@ -483,7 +483,7 @@ struct file *open_exec(const char *name)
int err;
struct file *file;
- nd.intent.open.flags = FMODE_READ;
+ nd.intent.it_flags = FMODE_READ;
err = path_lookup(name, LOOKUP_FOLLOW|LOOKUP_OPEN, &nd);
file = ERR_PTR(err);
Index: linux-2.6.6/fs/namei.c
===================================================================
--- linux-2.6.6.orig/fs/namei.c 2004-05-22 00:46:19.000000000 +0800
+++ linux-2.6.6/fs/namei.c 2004-05-22 01:36:46.000000000 +0800
@@ -1266,8 +1266,8 @@ int open_namei(const char * pathname, in
acc_mode |= MAY_APPEND;
/* Fill in the open() intent data */
- nd->intent.open.flags = flag;
- nd->intent.open.create_mode = mode;
+ nd->intent.it_flags = flag;
+ nd->intent.it_create_mode = mode;
/*
* The simplest case - just a plain lookup.
Index: linux-2.6.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.6.orig/fs/nfs/dir.c 2004-04-04 11:37:06.000000000 +0800
+++ linux-2.6.6/fs/nfs/dir.c 2004-05-22 01:58:56.000000000 +0800
@@ -705,7 +705,7 @@ int nfs_is_exclusive_create(struct inode
return 0;
if (!nd || (nd->flags & LOOKUP_CONTINUE) || !(nd->flags & LOOKUP_CREATE))
return 0;
- return (nd->intent.open.flags & O_EXCL) != 0;
+ return (nd->intent.it_flags & O_EXCL) != 0;
}
static struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
@@ -778,7 +778,7 @@ static int is_atomic_open(struct inode *
if (nd->flags & LOOKUP_DIRECTORY)
return 0;
/* Are we trying to write to a read only partition? */
- if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
+ if (IS_RDONLY(dir) && (nd->intent.it_flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
return 0;
return 1;
}
@@ -799,7 +799,7 @@ static struct dentry *nfs_atomic_lookup(
dentry->d_op = NFS_PROTO(dir)->dentry_ops;
/* Let vfs_create() deal with O_EXCL */
- if (nd->intent.open.flags & O_EXCL)
+ if (nd->intent.it_flags & O_EXCL)
goto no_entry;
/* Open the file on the server */
@@ -807,7 +807,7 @@ static struct dentry *nfs_atomic_lookup(
/* Revalidate parent directory attribute cache */
nfs_revalidate_inode(NFS_SERVER(dir), dir);
- if (nd->intent.open.flags & O_CREAT) {
+ if (nd->intent.it_flags & O_CREAT) {
nfs_begin_data_update(dir);
inode = nfs4_atomic_open(dir, dentry, nd);
nfs_end_data_update(dir);
@@ -823,7 +823,7 @@ static struct dentry *nfs_atomic_lookup(
break;
/* This turned out not to be a regular file */
case -ELOOP:
- if (!(nd->intent.open.flags & O_NOFOLLOW))
+ if (!(nd->intent.it_flags & O_NOFOLLOW))
goto no_open;
/* case -EISDIR: */
/* case -EINVAL: */
@@ -857,7 +857,7 @@ static int nfs_open_revalidate(struct de
dir = parent->d_inode;
if (!is_atomic_open(dir, nd))
goto no_open;
- openflags = nd->intent.open.flags;
+ openflags = nd->intent.it_flags;
if (openflags & O_CREAT) {
/* If this is a negative dentry, just drop it */
if (!inode)
@@ -1022,7 +1022,7 @@ static int nfs_create(struct inode *dir,
attr.ia_valid = ATTR_MODE;
if (nd && (nd->flags & LOOKUP_CREATE))
- open_flags = nd->intent.open.flags;
+ open_flags = nd->intent.it_flags;
/*
* The 0 argument passed into the create function should one day
Index: linux-2.6.6/fs/nfs/nfs4proc.c
===================================================================
--- linux-2.6.6.orig/fs/nfs/nfs4proc.c 2004-05-22 00:46:19.000000000 +0800
+++ linux-2.6.6/fs/nfs/nfs4proc.c 2004-05-22 01:59:41.000000000 +0800
@@ -475,17 +475,17 @@ nfs4_atomic_open(struct inode *dir, stru
struct nfs4_state *state;
if (nd->flags & LOOKUP_CREATE) {
- attr.ia_mode = nd->intent.open.create_mode;
+ attr.ia_mode = nd->intent.it_create_mode;
attr.ia_valid = ATTR_MODE;
if (!IS_POSIXACL(dir))
attr.ia_mode &= ~current->fs->umask;
} else {
attr.ia_valid = 0;
- BUG_ON(nd->intent.open.flags & O_CREAT);
+ BUG_ON(nd->intent.it_flags & O_CREAT);
}
cred = rpcauth_lookupcred(NFS_SERVER(dir)->client->cl_auth, 0);
- state = nfs4_do_open(dir, &dentry->d_name, nd->intent.open.flags, &attr, cred);
+ state = nfs4_do_open(dir, &dentry->d_name, nd->intent.it_flags, &attr, cred);
put_rpccred(cred);
if (IS_ERR(state))
return (struct inode *)state;
Index: linux-2.6.6/fs/cifs/dir.c
===================================================================
--- linux-2.6.6.orig/fs/cifs/dir.c 2004-05-22 00:46:19.000000000 +0800
+++ linux-2.6.6/fs/cifs/dir.c 2004-05-22 02:00:12.000000000 +0800
@@ -146,22 +146,22 @@ cifs_create(struct inode *inode, struct
if(nd) {
cFYI(1,("In create for inode %p dentry->inode %p nd flags = 0x%x for %s",inode, direntry->d_inode, nd->flags,full_path));
- if ((nd->intent.open.flags & O_ACCMODE) == O_RDONLY)
+ if ((nd->intent.it_flags & O_ACCMODE) == O_RDONLY)
desiredAccess = GENERIC_READ;
- else if ((nd->intent.open.flags & O_ACCMODE) == O_WRONLY)
+ else if ((nd->intent.it_flags & O_ACCMODE) == O_WRONLY)
desiredAccess = GENERIC_WRITE;
- else if ((nd->intent.open.flags & O_ACCMODE) == O_RDWR) {
+ else if ((nd->intent.it_flags & O_ACCMODE) == O_RDWR) {
/* GENERIC_ALL is too much permission to request */
/* can cause unnecessary access denied on create */
/* desiredAccess = GENERIC_ALL; */
desiredAccess = GENERIC_READ | GENERIC_WRITE;
}
- if((nd->intent.open.flags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
+ if((nd->intent.it_flags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
disposition = FILE_CREATE;
- else if((nd->intent.open.flags & (O_CREAT | O_TRUNC)) == (O_CREAT | O_TRUNC))
+ else if((nd->intent.it_flags & (O_CREAT | O_TRUNC)) == (O_CREAT | O_TRUNC))
disposition = FILE_OVERWRITE_IF;
- else if((nd->intent.open.flags & O_CREAT) == O_CREAT)
+ else if((nd->intent.it_flags & O_CREAT) == O_CREAT)
disposition = FILE_OPEN_IF;
else {
cFYI(1,("Create flag not set in create function"));
@@ -311,7 +311,7 @@ cifs_lookup(struct inode *parent_dir_ino
parent_dir_inode, direntry->d_name.name, direntry));
if(nd) { /* BB removeme */
- cFYI(1,("In lookup nd flags 0x%x open intent flags 0x%x",nd->flags,nd->intent.open.flags));
+ cFYI(1,("In lookup nd flags 0x%x open intent flags 0x%x",nd->flags,nd->intent.it_flags));
} /* BB removeme BB */
/* BB Add check of incoming data - e.g. frame not longer than maximum SMB - let server check the namelen BB */
Index: linux-2.6.6/include/linux/namei.h
===================================================================
--- linux-2.6.6.orig/include/linux/namei.h 2004-04-04 11:36:55.000000000 +0800
+++ linux-2.6.6/include/linux/namei.h 2004-05-22 01:46:25.000000000 +0800
@@ -5,9 +5,12 @@
struct vfsmount;
-struct open_intent {
- int flags;
- int create_mode;
+#define INTENT_MAGIC 0x19620323
+struct lookup_intent {
+ int it_magic;
+ int it_op;
+ int it_flags;
+ int it_create_mode;
};
struct nameidata {
@@ -16,11 +19,7 @@ struct nameidata {
struct qstr last;
unsigned int flags;
int last_type;
-
- /* Intent data */
- union {
- struct open_intent open;
- } intent;
+ struct lookup_intent intent;
};
/*
[-- Attachment #7: vfs-dcache_locking-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 2685 bytes --]
%diffstat
fs/dcache.c | 22 ++++++++++++++++++----
include/linux/dcache.h | 2 ++
2 files changed, 20 insertions(+), 4 deletions(-)
%patch
Index: linux-2.6.6/fs/dcache.c
===================================================================
--- linux-2.6.6.orig/fs/dcache.c 2004-05-22 00:46:19.000000000 +0800
+++ linux-2.6.6/fs/dcache.c 2004-05-22 02:11:17.000000000 +0800
@@ -1115,13 +1115,20 @@ void d_delete(struct dentry * dentry)
* Adds a dentry to the hash according to its name.
*/
-void d_rehash(struct dentry * entry)
+void __d_rehash(struct dentry * entry)
{
struct hlist_head *list = d_hash(entry->d_parent, entry->d_name.hash);
- spin_lock(&dcache_lock);
entry->d_vfs_flags &= ~DCACHE_UNHASHED;
entry->d_bucket = list;
hlist_add_head_rcu(&entry->d_hash, list);
+}
+
+EXPORT_SYMBOL(__d_rehash);
+
+void d_rehash(struct dentry * entry)
+{
+ spin_lock(&dcache_lock);
+ __d_rehash(entry);
spin_unlock(&dcache_lock);
}
@@ -1185,12 +1192,11 @@ static inline void switch_names(struct d
* dcache entries should not be moved in this way.
*/
-void d_move(struct dentry * dentry, struct dentry * target)
+void __d_move(struct dentry * dentry, struct dentry * target)
{
if (!dentry->d_inode)
printk(KERN_WARNING "VFS: moving negative dcache entry\n");
- spin_lock(&dcache_lock);
write_seqlock(&rename_lock);
/*
* XXXX: do we really need to take target->d_lock?
@@ -1243,6 +1249,14 @@ already_unhashed:
spin_unlock(&target->d_lock);
spin_unlock(&dentry->d_lock);
write_sequnlock(&rename_lock);
+}
+
+EXPORT_SYMBOL(__d_move);
+
+void d_move(struct dentry *dentry, struct dentry *target)
+{
+ spin_lock(&dcache_lock);
+ __d_move(dentry, target);
spin_unlock(&dcache_lock);
}
Index: linux-2.6.6/include/linux/dcache.h
===================================================================
--- linux-2.6.6.orig/include/linux/dcache.h 2004-05-22 00:46:20.000000000 +0800
+++ linux-2.6.6/include/linux/dcache.h 2004-05-22 02:10:01.000000000 +0800
@@ -224,6 +224,7 @@ extern int have_submounts(struct dentry
* This adds the entry to the hash queues.
*/
extern void d_rehash(struct dentry *);
+extern void __d_rehash(struct dentry *);
/**
* d_add - add dentry to hash queues
@@ -242,6 +243,7 @@ static inline void d_add(struct dentry *
/* used for rename() and baskets */
extern void d_move(struct dentry *, struct dentry *);
+extern void __d_move(struct dentry *, struct dentry *);
/* appendix may either be NULL or be used for transname suffixes */
extern struct dentry * d_lookup(struct dentry *, struct qstr *);
[-- Attachment #8: vfs-dcache_lustre_invalid-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 1252 bytes --]
%diffstat
fs/dcache.c | 7 +++++++
include/linux/dcache.h | 1 +
2 files changed, 8 insertions(+)
%patch
Index: linux-2.6.6/fs/dcache.c
===================================================================
--- linux-2.6.6.orig/fs/dcache.c 2004-05-22 02:11:17.000000000 +0800
+++ linux-2.6.6/fs/dcache.c 2004-05-22 02:14:46.000000000 +0800
@@ -217,6 +217,13 @@ int d_invalidate(struct dentry * dentry)
spin_unlock(&dcache_lock);
return 0;
}
+
+ /* network invalidation by Lustre */
+ if (dentry->d_flags & DCACHE_LUSTRE_INVALID) {
+ spin_unlock(&dcache_lock);
+ return 0;
+ }
+
/*
* Check whether to do a partial shrink_dcache
* to get rid of unused child entries.
Index: linux-2.6.6/include/linux/dcache.h
===================================================================
--- linux-2.6.6.orig/include/linux/dcache.h 2004-05-22 02:10:01.000000000 +0800
+++ linux-2.6.6/include/linux/dcache.h 2004-05-22 02:15:17.000000000 +0800
@@ -153,6 +153,7 @@ d_iput: no no yes
#define DCACHE_REFERENCED 0x0008 /* Recently used, don't discard. */
#define DCACHE_UNHASHED 0x0010
+#define DCACHE_LUSTRE_INVALID 0x0020 /* invalidated by Lustre */
extern spinlock_t dcache_lock;
[-- Attachment #9: vfs-do_truncate.patch --]
[-- Type: application/octet-stream, Size: 3284 bytes --]
fs/exec.c | 2 +-
fs/namei.c | 2 +-
fs/open.c | 8 +++++---
include/linux/fs.h | 3 ++-
4 files changed, 9 insertions(+), 6 deletions(-)
Index: linux-2.6.6/fs/namei.c
===================================================================
--- linux-2.6.6.orig/fs/namei.c 2004-05-30 23:17:06.267030976 +0300
+++ linux-2.6.6/fs/namei.c 2004-05-30 23:23:15.642877312 +0300
@@ -1270,7 +1270,7 @@
if (!error) {
DQUOT_INIT(inode);
- error = do_truncate(dentry, 0);
+ error = do_truncate(dentry, 0, 1);
}
put_write_access(inode);
if (error)
Index: linux-2.6.6/fs/open.c
===================================================================
--- linux-2.6.6.orig/fs/open.c 2004-05-30 20:05:26.857206992 +0300
+++ linux-2.6.6/fs/open.c 2004-05-30 23:24:38.908219056 +0300
@@ -189,7 +189,7 @@
return error;
}
-int do_truncate(struct dentry *dentry, loff_t length)
+int do_truncate(struct dentry *dentry, loff_t length, int called_from_open)
{
int err;
struct iattr newattrs;
@@ -202,6 +202,8 @@
newattrs.ia_valid = ATTR_SIZE | ATTR_CTIME;
down(&dentry->d_inode->i_sem);
down_write(&dentry->d_inode->i_alloc_sem);
+ if (called_from_open)
+ newattrs.ia_valid |= ATTR_FROM_OPEN;
err = notify_change(dentry, &newattrs);
up_write(&dentry->d_inode->i_alloc_sem);
up(&dentry->d_inode->i_sem);
@@ -259,7 +261,7 @@
error = locks_verify_truncate(inode, NULL, length);
if (!error) {
DQUOT_INIT(inode);
- error = do_truncate(nd.dentry, length);
+ error = do_truncate(nd.dentry, length, 0);
}
put_write_access(inode);
@@ -311,7 +313,7 @@
error = locks_verify_truncate(inode, file, length);
if (!error)
- error = do_truncate(dentry, length);
+ error = do_truncate(dentry, length, 0);
out_putf:
fput(file);
out:
Index: linux-2.6.6/fs/exec.c
===================================================================
--- linux-2.6.6.orig/fs/exec.c 2004-05-30 20:05:26.862206232 +0300
+++ linux-2.6.6/fs/exec.c 2004-05-30 23:23:15.648876400 +0300
@@ -1395,7 +1395,7 @@
goto close_fail;
if (!file->f_op->write)
goto close_fail;
- if (do_truncate(file->f_dentry, 0) != 0)
+ if (do_truncate(file->f_dentry, 0, 0) != 0)
goto close_fail;
retval = binfmt->core_dump(signr, regs, file);
Index: linux-2.6.6/include/linux/fs.h
===================================================================
--- linux-2.6.6.orig/include/linux/fs.h 2004-05-30 23:20:11.979798344 +0300
+++ linux-2.6.6/include/linux/fs.h 2004-05-30 23:25:29.167578472 +0300
@@ -249,6 +249,7 @@
#define ATTR_ATTR_FLAG 1024
#define ATTR_KILL_SUID 2048
#define ATTR_KILL_SGID 4096
+#define ATTR_FROM_OPEN 8192 /* called from open path, ie O_TRUNC */
/*
* This is the Inode Attributes structure, used for notify_change(). It
@@ -1189,7 +1190,7 @@
/* fs/open.c */
-extern int do_truncate(struct dentry *, loff_t start);
+extern int do_truncate(struct dentry *, loff_t start, int called_from_open);
extern struct file *filp_open(const char *, int, int);
extern struct file * dentry_open(struct dentry *, struct vfsmount *, int);
extern struct file * dentry_open_it(struct dentry *, struct vfsmount *, int, struct lookup_intent *);
[-- Attachment #10: vfs-intent_api-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 16232 bytes --]
fs/exec.c | 10 ++++---
fs/namei.c | 69 ++++++++++++++++++++++++++++++++++++++++++++------
fs/namespace.c | 1
fs/open.c | 28 +++++++++++++++-----
fs/stat.c | 10 +++++--
fs/xattr.c | 12 +++++---
include/linux/fs.h | 2 +
include/linux/namei.h | 27 +++++++++++++++++++
8 files changed, 134 insertions(+), 25 deletions(-)
Index: linux-2.6.6/include/linux/namei.h
===================================================================
--- linux-2.6.6.orig/include/linux/namei.h 2004-05-30 19:46:50.238958768 +0300
+++ linux-2.6.6/include/linux/namei.h 2004-05-30 20:05:26.849208208 +0300
@@ -2,17 +2,36 @@
#define _LINUX_NAMEI_H
#include <linux/linkage.h>
+#include <linux/string.h>
struct vfsmount;
+/* intent opcodes */
+#define IT_OPEN (1)
+#define IT_CREAT (1<<1)
+#define IT_READDIR (1<<2)
+#define IT_GETATTR (1<<3)
+#define IT_LOOKUP (1<<4)
+#define IT_UNLINK (1<<5)
+#define IT_TRUNC (1<<6)
+#define IT_GETXATTR (1<<7)
+
#define INTENT_MAGIC 0x19620323
struct lookup_intent {
int it_magic;
int it_op;
+ void (*it_op_release)(struct lookup_intent *);
int it_flags;
int it_create_mode;
};
+static inline void intent_init(struct lookup_intent *it, int op)
+{
+ memset(it, 0, sizeof(*it));
+ it->it_magic = INTENT_MAGIC;
+ it->it_op = op;
+}
+
struct nameidata {
struct dentry *dentry;
struct vfsmount *mnt;
@@ -48,14 +67,22 @@
#define LOOKUP_ACCESS (0x0400)
extern int FASTCALL(__user_walk(const char __user *, unsigned, struct nameidata *));
+extern int FASTCALL(__user_walk_it(const char __user *, unsigned, struct nameidata *));
#define user_path_walk(name,nd) \
__user_walk(name, LOOKUP_FOLLOW, nd)
+#define user_path_walk_it(name,nd) \
+ __user_walk_it(name, LOOKUP_FOLLOW, nd)
#define user_path_walk_link(name,nd) \
__user_walk(name, 0, nd)
+#define user_path_walk_link_it(name,nd) \
+ __user_walk_it(name, 0, nd)
extern int FASTCALL(path_lookup(const char *, unsigned, struct nameidata *));
+extern int FASTCALL(path_lookup_it(const char *, unsigned, struct nameidata *));
extern int FASTCALL(path_walk(const char *, struct nameidata *));
+extern int FASTCALL(path_walk_it(const char *, struct nameidata *));
extern int FASTCALL(link_path_walk(const char *, struct nameidata *));
extern void path_release(struct nameidata *);
+extern void intent_release(struct lookup_intent *);
extern struct dentry * lookup_one_len(const char *, struct dentry *, int);
extern struct dentry * lookup_hash(struct qstr *, struct dentry *);
Index: linux-2.6.6/include/linux/fs.h
===================================================================
--- linux-2.6.6.orig/include/linux/fs.h 2004-05-26 20:26:11.000000000 +0300
+++ linux-2.6.6/include/linux/fs.h 2004-05-30 20:05:26.852207752 +0300
@@ -576,6 +576,7 @@
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
+ struct lookup_intent *f_it;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
@@ -1190,6 +1191,7 @@
extern int do_truncate(struct dentry *, loff_t start);
extern struct file *filp_open(const char *, int, int);
extern struct file * dentry_open(struct dentry *, struct vfsmount *, int);
+extern struct file * dentry_open_it(struct dentry *, struct vfsmount *, int, struct lookup_intent *);
extern int filp_close(struct file *, fl_owner_t id);
extern char * getname(const char __user *);
Index: linux-2.6.6/fs/namei.c
===================================================================
--- linux-2.6.6.orig/fs/namei.c 2004-05-30 19:46:50.185966824 +0300
+++ linux-2.6.6/fs/namei.c 2004-05-30 20:05:26.855207296 +0300
@@ -272,8 +272,19 @@
return 0;
}
+void intent_release(struct lookup_intent *it)
+{
+ if (!it)
+ return;
+ if (it->it_magic != INTENT_MAGIC)
+ return;
+ if (it->it_op_release)
+ it->it_op_release(it);
+}
+
void path_release(struct nameidata *nd)
{
+ intent_release(&nd->intent);
dput(nd->dentry);
mntput(nd->mnt);
}
@@ -774,8 +785,14 @@
return err;
}
+int fastcall path_walk_it(const char * name, struct nameidata *nd)
+{
+ current->total_link_count = 0;
+ return link_path_walk(name, nd);
+}
int fastcall path_walk(const char * name, struct nameidata *nd)
{
+ intent_init(&nd->intent, IT_LOOKUP);
current->total_link_count = 0;
return link_path_walk(name, nd);
}
@@ -784,7 +801,7 @@
/* returns 1 if everything is done */
static int __emul_lookup_dentry(const char *name, struct nameidata *nd)
{
- if (path_walk(name, nd))
+ if (path_walk_it(name, nd))
return 0; /* something went wrong... */
if (!nd->dentry->d_inode || S_ISDIR(nd->dentry->d_inode->i_mode)) {
@@ -861,7 +878,18 @@
return 1;
}
-int fastcall path_lookup(const char *name, unsigned int flags, struct nameidata *nd)
+static inline int it_mode_from_lookup_flags(int flags)
+{
+ int mode = IT_LOOKUP;
+
+ if (flags & LOOKUP_OPEN)
+ mode = IT_OPEN;
+ if (flags & LOOKUP_CREATE)
+ mode |= IT_CREAT;
+ return mode;
+}
+
+int fastcall path_lookup_it(const char *name, unsigned int flags, struct nameidata *nd)
{
int retval;
@@ -896,6 +924,12 @@
return retval;
}
+int fastcall path_lookup(const char *name, unsigned int flags, struct nameidata *nd)
+{
+ intent_init(&nd->intent, it_mode_from_lookup_flags(flags));
+ return path_lookup_it(name, flags, nd);
+}
+
/*
* Restricted form of lookup. Doesn't follow links, single-component only,
* needs parent already locked. Doesn't follow mounts.
@@ -946,7 +980,7 @@
}
/* SMP-safe */
-struct dentry * lookup_one_len(const char * name, struct dentry * base, int len)
+struct dentry * lookup_one_len_it(const char * name, struct dentry * base, int len, struct nameidata *nd)
{
unsigned long hash;
struct qstr this;
@@ -966,11 +1000,16 @@
}
this.hash = end_name_hash(hash);
- return lookup_hash(&this, base);
+ return __lookup_hash(&this, base, nd);
access:
return ERR_PTR(-EACCES);
}
+struct dentry * lookup_one_len(const char * name, struct dentry * base, int len)
+{
+ return lookup_one_len_it(name, base, len, NULL);
+}
+
/*
* namei()
*
@@ -982,18 +1021,24 @@
* that namei follows links, while lnamei does not.
* SMP-safe
*/
-int fastcall __user_walk(const char __user *name, unsigned flags, struct nameidata *nd)
+int fastcall __user_walk_it(const char __user *name, unsigned flags, struct nameidata *nd)
{
char *tmp = getname(name);
int err = PTR_ERR(tmp);
if (!IS_ERR(tmp)) {
- err = path_lookup(tmp, flags, nd);
+ err = path_lookup_it(tmp, flags, nd);
putname(tmp);
}
return err;
}
+int fastcall __user_walk(const char __user *name, unsigned flags, struct nameidata *nd)
+{
+ intent_init(&nd->intent, it_mode_from_lookup_flags(flags));
+ return __user_walk_it(name, flags, nd);
+}
+
/*
* It's inline, so penalty for filesystems that don't use sticky bit is
* minimal.
@@ -1273,7 +1318,7 @@
* The simplest case - just a plain lookup.
*/
if (!(flag & O_CREAT)) {
- error = path_lookup(pathname, lookup_flags(flag)|LOOKUP_OPEN, nd);
+ error = path_lookup_it(pathname, lookup_flags(flag), nd);
if (error)
return error;
goto ok;
@@ -1282,7 +1327,8 @@
/*
* Create - we need to know the parent.
*/
- error = path_lookup(pathname, LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE, nd);
+ nd->intent.it_op |= IT_CREAT;
+ error = path_lookup_it(pathname, LOOKUP_PARENT, nd);
if (error)
return error;
@@ -2165,6 +2211,7 @@
__vfs_follow_link(struct nameidata *nd, const char *link)
{
int res = 0;
+ struct lookup_intent it = nd->intent;
char *name;
if (IS_ERR(link))
goto fail;
@@ -2175,6 +2222,9 @@
/* weird __emul_prefix() stuff did it */
goto out;
}
+ intent_init(&nd->intent, it.it_op);
+ nd->intent.it_flags = it.it_flags;
+ nd->intent.it_create_mode = it.it_create_mode;
res = link_path_walk(link, nd);
out:
if (current->link_count || res || nd->last_type!=LAST_NORM)
@@ -2249,6 +2299,7 @@
return res;
}
+
int page_symlink(struct inode *inode, const char *symname, int len)
{
struct address_space *mapping = inode->i_mapping;
@@ -2309,8 +2360,10 @@
EXPORT_SYMBOL(page_symlink);
EXPORT_SYMBOL(page_symlink_inode_operations);
EXPORT_SYMBOL(path_lookup);
+EXPORT_SYMBOL(path_lookup_it);
EXPORT_SYMBOL(path_release);
EXPORT_SYMBOL(path_walk);
+EXPORT_SYMBOL(path_walk_it);
EXPORT_SYMBOL(permission);
EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
Index: linux-2.6.6/fs/open.c
===================================================================
--- linux-2.6.6.orig/fs/open.c 2004-05-26 20:25:43.000000000 +0300
+++ linux-2.6.6/fs/open.c 2004-05-30 20:05:26.857206992 +0300
@@ -214,11 +214,12 @@
struct inode * inode;
int error;
+ intent_init(&nd.intent, IT_GETATTR);
error = -EINVAL;
if (length < 0) /* sorry, but loff_t says... */
goto out;
- error = user_path_walk(path, &nd);
+ error = user_path_walk_it(path, &nd);
if (error)
goto out;
inode = nd.dentry->d_inode;
@@ -473,6 +474,7 @@
kernel_cap_t old_cap;
int res;
+ intent_init(&nd.intent, IT_GETATTR);
if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
return -EINVAL;
@@ -496,7 +498,7 @@
else
current->cap_effective = current->cap_permitted;
- res = __user_walk(filename, LOOKUP_FOLLOW|LOOKUP_ACCESS, &nd);
+ res = __user_walk_it(filename, LOOKUP_FOLLOW|LOOKUP_ACCESS, &nd);
if (!res) {
res = permission(nd.dentry->d_inode, mode, &nd);
/* SuS v2 requires we report a read only fs too */
@@ -518,7 +520,8 @@
struct nameidata nd;
int error;
- error = __user_walk(filename, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &nd);
+ intent_init(&nd.intent, IT_GETATTR);
+ error = __user_walk_it(filename, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &nd);
if (error)
goto out;
@@ -569,7 +572,8 @@
struct nameidata nd;
int error;
- error = __user_walk(filename, LOOKUP_FOLLOW | LOOKUP_DIRECTORY | LOOKUP_NOALT, &nd);
+ intent_init(&nd.intent, IT_GETATTR);
+ error = __user_walk_it(filename, LOOKUP_FOLLOW | LOOKUP_DIRECTORY | LOOKUP_NOALT, &nd);
if (error)
goto out;
@@ -752,6 +756,7 @@
{
int namei_flags, error;
struct nameidata nd;
+ intent_init(&nd.intent, IT_OPEN);
namei_flags = flags;
if ((namei_flags+1) & O_ACCMODE)
@@ -761,14 +766,14 @@
error = open_namei(filename, namei_flags, mode, &nd);
if (!error)
- return dentry_open(nd.dentry, nd.mnt, flags);
+ return dentry_open_it(nd.dentry, nd.mnt, flags, &nd.intent);
return ERR_PTR(error);
}
EXPORT_SYMBOL(filp_open);
-struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags)
+struct file *dentry_open_it(struct dentry *dentry, struct vfsmount *mnt, int flags, struct lookup_intent *it)
{
struct file * f;
struct inode *inode;
@@ -780,6 +785,7 @@
goto cleanup_dentry;
f->f_flags = flags;
f->f_mode = (flags+1) & O_ACCMODE;
+ f->f_it = it;
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
error = get_write_access(inode);
@@ -799,6 +805,7 @@
error = f->f_op->open(inode,f);
if (error)
goto cleanup_all;
+ intent_release(it);
}
f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
@@ -823,11 +830,20 @@
cleanup_file:
put_filp(f);
cleanup_dentry:
+ intent_release(it);
dput(dentry);
mntput(mnt);
return ERR_PTR(error);
}
+struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags)
+{
+ struct lookup_intent it;
+ intent_init(&it, IT_LOOKUP);
+
+ return dentry_open_it(dentry, mnt, flags, &it);
+}
+
EXPORT_SYMBOL(dentry_open);
/*
Index: linux-2.6.6/fs/stat.c
===================================================================
--- linux-2.6.6.orig/fs/stat.c 2004-05-26 20:25:43.000000000 +0300
+++ linux-2.6.6/fs/stat.c 2004-05-30 23:46:08.545164440 +0300
@@ -58,15 +58,15 @@
}
return 0;
}
-
EXPORT_SYMBOL(vfs_getattr);
int vfs_stat(char __user *name, struct kstat *stat)
{
struct nameidata nd;
int error;
+ intent_init(&nd.intent, IT_GETATTR);
- error = user_path_walk(name, &nd);
+ error = user_path_walk_it(name, &nd);
if (!error) {
error = vfs_getattr(nd.mnt, nd.dentry, stat);
path_release(&nd);
@@ -80,8 +80,9 @@
{
struct nameidata nd;
int error;
+ intent_init(&nd.intent, IT_GETATTR);
- error = user_path_walk_link(name, &nd);
+ error = user_path_walk_link_it(name, &nd);
if (!error) {
error = vfs_getattr(nd.mnt, nd.dentry, stat);
path_release(&nd);
@@ -95,9 +96,12 @@
{
struct file *f = fget(fd);
int error = -EBADF;
+ struct nameidata nd;
+ intent_init(&nd.intent, IT_GETATTR);
if (f) {
error = vfs_getattr(f->f_vfsmnt, f->f_dentry, stat);
+ intent_release(&nd.intent);
fput(f);
}
return error;
Index: linux-2.6.6/fs/namespace.c
===================================================================
--- linux-2.6.6.orig/fs/namespace.c 2004-05-26 20:25:43.000000000 +0300
+++ linux-2.6.6/fs/namespace.c 2004-05-30 20:05:26.860206536 +0300
@@ -115,6 +115,7 @@
static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd)
{
+ memset(old_nd, 0, sizeof(*old_nd));
old_nd->dentry = mnt->mnt_mountpoint;
old_nd->mnt = mnt->mnt_parent;
mnt->mnt_parent = mnt;
Index: linux-2.6.6/fs/exec.c
===================================================================
--- linux-2.6.6.orig/fs/exec.c 2004-05-30 19:46:50.182967280 +0300
+++ linux-2.6.6/fs/exec.c 2004-05-30 20:05:26.862206232 +0300
@@ -122,8 +122,9 @@
struct nameidata nd;
int error;
+ intent_init(&nd.intent, IT_OPEN);
nd.intent.it_flags = FMODE_READ;
- error = __user_walk(library, LOOKUP_FOLLOW|LOOKUP_OPEN, &nd);
+ error = user_path_walk_it(library, &nd);
if (error)
goto out;
@@ -135,7 +136,7 @@
if (error)
goto exit;
- file = dentry_open(nd.dentry, nd.mnt, O_RDONLY);
+ file = dentry_open_it(nd.dentry, nd.mnt, O_RDONLY, &nd.intent);
error = PTR_ERR(file);
if (IS_ERR(file))
goto out;
@@ -483,8 +484,9 @@
int err;
struct file *file;
+ intent_init(&nd.intent, IT_OPEN);
nd.intent.it_flags = FMODE_READ;
- err = path_lookup(name, LOOKUP_FOLLOW|LOOKUP_OPEN, &nd);
+ err = path_lookup_it(name, LOOKUP_FOLLOW, &nd);
file = ERR_PTR(err);
if (!err) {
@@ -497,7 +499,7 @@
err = -EACCES;
file = ERR_PTR(err);
if (!err) {
- file = dentry_open(nd.dentry, nd.mnt, O_RDONLY);
+ file = dentry_open_it(nd.dentry, nd.mnt, O_RDONLY, &nd.intent);
if (!IS_ERR(file)) {
err = deny_write_access(file);
if (err) {
Index: linux-2.6.6/fs/xattr.c
===================================================================
--- linux-2.6.6.orig/fs/xattr.c 2004-05-26 20:25:43.000000000 +0300
+++ linux-2.6.6/fs/xattr.c 2004-05-30 20:05:26.863206080 +0300
@@ -161,7 +161,8 @@
struct nameidata nd;
ssize_t error;
- error = user_path_walk(path, &nd);
+ intent_init(&nd.intent, IT_GETXATTR);
+ error = user_path_walk_it(path, &nd);
if (error)
return error;
error = getxattr(nd.dentry, name, value, size);
@@ -176,7 +177,8 @@
struct nameidata nd;
ssize_t error;
- error = user_path_walk_link(path, &nd);
+ intent_init(&nd.intent, IT_GETXATTR);
+ error = user_path_walk_link_it(path, &nd);
if (error)
return error;
error = getxattr(nd.dentry, name, value, size);
@@ -242,7 +244,8 @@
struct nameidata nd;
ssize_t error;
- error = user_path_walk(path, &nd);
+ intent_init(&nd.intent, IT_GETXATTR);
+ error = user_path_walk_it(path, &nd);
if (error)
return error;
error = listxattr(nd.dentry, list, size);
@@ -256,7 +259,8 @@
struct nameidata nd;
ssize_t error;
- error = user_path_walk_link(path, &nd);
+ intent_init(&nd.intent, IT_GETXATTR);
+ error = user_path_walk_link_it(path, &nd);
if (error)
return error;
error = listxattr(nd.dentry, list, size);
[-- Attachment #11: vfs-intent_lustre-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 940 bytes --]
namei.h | 11 +++++++++++
1 files changed, 11 insertions(+)
Index: linux-2.6.6/include/linux/namei.h
===================================================================
--- linux-2.6.6.orig/include/linux/namei.h 2004-05-31 11:55:11.399239832 +0300
+++ linux-2.6.6/include/linux/namei.h 2004-05-31 11:56:45.338958824 +0300
@@ -22,6 +22,14 @@
#define IT_MKNOD (1<<12)
#define IT_SYMLINK (1<<13)
+struct lustre_intent_data {
+ int it_disposition;
+ int it_status;
+ __u64 it_lock_handle;
+ void *it_data;
+ int it_lock_mode;
+};
+
#define INTENT_MAGIC 0x19620323
#define IT_STATUS_RAW (1<<10) /* Setting this in it_flags on exit from lookup
means everything was done already and return
@@ -38,6 +46,9 @@
char *link; /* For symlink */
struct nameidata *source_nd; /* For link/rename */
} it_create;
+ union {
+ struct lustre_intent_data *lustre;
+ } d;
};
[-- Attachment #12: vfs-lookup_last-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 1803 bytes --]
fs/namei.c | 8 ++++++++
include/linux/namei.h | 3 +++
2 files changed, 11 insertions(+)
Index: linux-2.6.6/fs/namei.c
===================================================================
--- linux-2.6.6.orig/fs/namei.c 2004-05-27 21:24:45.151896688 +0300
+++ linux-2.6.6/fs/namei.c 2004-05-27 22:48:34.155371952 +0300
@@ -677,7 +677,9 @@
if (inode->i_op->follow_link) {
mntget(next.mnt);
+ nd->flags |= LOOKUP_LINK_NOTLAST;
err = do_follow_link(next.dentry, nd);
+ nd->flags &= ~LOOKUP_LINK_NOTLAST;
dput(next.dentry);
mntput(next.mnt);
if (err)
@@ -723,7 +725,9 @@
if (err < 0)
break;
}
+ nd->flags |= LOOKUP_LAST;
err = do_lookup(nd, &this, &next);
+ nd->flags &= ~LOOKUP_LAST;
if (err)
break;
follow_mount(&next.mnt, &next.dentry);
@@ -1344,7 +1348,9 @@
dir = nd->dentry;
nd->flags &= ~LOOKUP_PARENT;
down(&dir->d_inode->i_sem);
+ nd->flags |= LOOKUP_LAST;
dentry = __lookup_hash(&nd->last, nd->dentry, nd);
+ nd->flags &= ~LOOKUP_LAST;
do_last:
error = PTR_ERR(dentry);
@@ -1449,7 +1455,9 @@
}
dir = nd->dentry;
down(&dir->d_inode->i_sem);
+ nd->flags |= LOOKUP_LAST;
dentry = __lookup_hash(&nd->last, nd->dentry, nd);
+ nd->flags &= ~LOOKUP_LAST;
putname(nd->last.name);
goto do_last;
}
Index: linux-2.6.6/include/linux/namei.h
===================================================================
--- linux-2.6.6.orig/include/linux/namei.h 2004-05-27 21:24:45.078907784 +0300
+++ linux-2.6.6/include/linux/namei.h 2004-05-27 22:47:58.870736032 +0300
@@ -70,6 +70,9 @@
#define LOOKUP_CONTINUE 4
#define LOOKUP_PARENT 16
#define LOOKUP_NOALT 32
+#define LOOKUP_LAST 64
+#define LOOKUP_LINK_NOTLAST 128
+
/*
* Intent data
*/
[-- Attachment #13: vfs-raw_ops-vanilla-2.6.patch --]
[-- Type: application/octet-stream, Size: 6446 bytes --]
fs/namei.c | 73 ++++++++++++++++++++++++++++++++++++++++++++------
include/linux/namei.h | 17 +++++++++++
2 files changed, 82 insertions(+), 8 deletions(-)
Index: linux-2.6.6/fs/namei.c
===================================================================
--- linux-2.6.6.orig/fs/namei.c 2004-06-02 17:01:51.115405512 +0300
+++ linux-2.6.6/fs/namei.c 2004-06-02 17:05:18.898817632 +0300
@@ -560,12 +560,14 @@
return 0;
need_lookup:
+ nd->last = *name;
dentry = real_lookup(nd->dentry, name, nd);
if (IS_ERR(dentry))
goto fail;
goto done;
need_revalidate:
+ nd->last = *name;
if (dentry->d_op->d_revalidate(dentry, nd))
goto done;
if (d_invalidate(dentry))
@@ -606,6 +608,7 @@
unsigned long hash;
struct qstr this;
unsigned int c;
+ int span_mount = 0;
err = exec_permission_lite(inode, nd);
if (err == -EAGAIN) {
@@ -665,7 +668,8 @@
if (err)
break;
/* Check mountpoints.. */
- follow_mount(&next.mnt, &next.dentry);
+ if (follow_mount(&next.mnt, &next.dentry))
+ span_mount = 1;
err = -ENOENT;
inode = next.dentry->d_inode;
@@ -693,6 +697,12 @@
dput(nd->dentry);
nd->mnt = next.mnt;
nd->dentry = next.dentry;
+ if (span_mount && next.dentry->d_op &&
+ next.dentry->d_op->d_revalidate) {
+ nd->last = this;
+ next.dentry->d_op->d_revalidate(next.dentry, nd);
+ span_mount = 0;
+ }
}
err = -ENOTDIR;
if (!inode->i_op->lookup)
@@ -1523,9 +1533,18 @@
if (IS_ERR(tmp))
return PTR_ERR(tmp);
- error = path_lookup(tmp, LOOKUP_PARENT, &nd);
+ intent_init(&nd.intent, IT_MKNOD);
+ nd.intent.it_create_mode = mode;
+ nd.intent.it_create.dev = dev;
+
+ error = path_lookup_it(tmp, LOOKUP_PARENT, &nd);
if (error)
goto out;
+ if (nd.intent.it_flags & IT_STATUS_RAW) {
+ error = nd.intent.it_create.raw_status;
+ goto out2;
+ }
+
dentry = lookup_create(&nd, 0);
error = PTR_ERR(dentry);
@@ -1552,6 +1571,7 @@
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
+out2:
path_release(&nd);
out:
putname(tmp);
@@ -1594,9 +1614,15 @@
struct dentry *dentry;
struct nameidata nd;
- error = path_lookup(tmp, LOOKUP_PARENT, &nd);
+ intent_init(&nd.intent, IT_MKDIR);
+ nd.intent.it_create_mode = mode;
+ error = path_lookup_it(tmp, LOOKUP_PARENT, &nd);
if (error)
goto out;
+ if (nd.intent.it_flags & IT_STATUS_RAW) {
+ error = nd.intent.it_create.raw_status;
+ goto out2;
+ }
dentry = lookup_create(&nd, 1);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
@@ -1606,6 +1632,7 @@
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
+out2:
path_release(&nd);
out:
putname(tmp);
@@ -1691,9 +1718,14 @@
if(IS_ERR(name))
return PTR_ERR(name);
- error = path_lookup(name, LOOKUP_PARENT, &nd);
+ intent_init(&nd.intent, IT_RMDIR);
+ error = path_lookup_it(name, LOOKUP_PARENT, &nd);
if (error)
goto exit;
+ if (nd.intent.it_flags & IT_STATUS_RAW) {
+ error = nd.intent.it_create.raw_status;
+ goto exit1;
+ }
switch(nd.last_type) {
case LAST_DOTDOT:
@@ -1769,9 +1801,15 @@
if(IS_ERR(name))
return PTR_ERR(name);
- error = path_lookup(name, LOOKUP_PARENT, &nd);
+ intent_init(&nd.intent, IT_UNLINK);
+ error = path_lookup_it(name, LOOKUP_PARENT, &nd);
if (error)
goto exit;
+ if (nd.intent.it_flags & IT_STATUS_RAW) {
+ error = nd.intent.it_create.raw_status;
+ goto exit1;
+ }
+
error = -EISDIR;
if (nd.last_type != LAST_NORM)
goto exit1;
@@ -1843,9 +1881,15 @@
struct dentry *dentry;
struct nameidata nd;
- error = path_lookup(to, LOOKUP_PARENT, &nd);
+ intent_init(&nd.intent, IT_SYMLINK);
+ nd.intent.it_create.link = from;
+ error = path_lookup_it(to, LOOKUP_PARENT, &nd);
if (error)
goto out;
+ if (nd.intent.it_flags & IT_STATUS_RAW) {
+ error = nd.intent.it_create.raw_status;
+ goto out2;
+ }
dentry = lookup_create(&nd, 0);
error = PTR_ERR(dentry);
if (!IS_ERR(dentry)) {
@@ -1853,6 +1897,7 @@
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
+out2:
path_release(&nd);
out:
putname(to);
@@ -1924,9 +1969,15 @@
error = __user_walk(oldname, 0, &old_nd);
if (error)
goto exit;
- error = path_lookup(to, LOOKUP_PARENT, &nd);
+ intent_init(&nd.intent, IT_LINK);
+ nd.intent.it_create.source_nd = &old_nd;
+ error = path_lookup_it(to, LOOKUP_PARENT, &nd);
if (error)
goto out;
+ if (nd.intent.it_flags & IT_STATUS_RAW) {
+ error = nd.intent.it_create.raw_status;
+ goto out_release;
+ }
error = -EXDEV;
if (old_nd.mnt != nd.mnt)
goto out_release;
@@ -2107,9 +2158,15 @@
if (error)
goto exit;
- error = path_lookup(newname, LOOKUP_PARENT, &newnd);
+ intent_init(&newnd.intent, IT_RENAME);
+ newnd.intent.it_create.source_nd = &oldnd;
+ error = path_lookup_it(newname, LOOKUP_PARENT, &newnd);
if (error)
goto exit1;
+ if (newnd.intent.it_flags & IT_STATUS_RAW) {
+ error = newnd.intent.it_create.raw_status;
+ goto exit2;
+ }
error = -EXDEV;
if (oldnd.mnt != newnd.mnt)
Index: linux-2.6.6/include/linux/namei.h
===================================================================
--- linux-2.6.6.orig/include/linux/namei.h 2004-06-02 17:01:51.091409160 +0300
+++ linux-2.6.6/include/linux/namei.h 2004-06-02 17:01:54.912828216 +0300
@@ -15,16 +15,33 @@
#define IT_UNLINK (1<<5)
#define IT_TRUNC (1<<6)
#define IT_GETXATTR (1<<7)
+#define IT_RMDIR (1<<8)
+#define IT_LINK (1<<9)
+#define IT_RENAME (1<<10)
+#define IT_MKDIR (1<<11)
+#define IT_MKNOD (1<<12)
+#define IT_SYMLINK (1<<13)
#define INTENT_MAGIC 0x19620323
+#define IT_STATUS_RAW (1<<10) /* Setting this in it_flags on exit from lookup
+ means everything was done already and return
+ value from lookup is in fact status of
+ already performed operation */
struct lookup_intent {
int it_magic;
int it_op;
void (*it_op_release)(struct lookup_intent *);
int it_flags;
int it_create_mode;
+ union {
+ int raw_status; /* return value from raw method */
+ unsigned dev; /* For mknod */
+ char *link; /* For symlink */
+ struct nameidata *source_nd; /* For link/rename */
+ } it_create;
};
+
static inline void intent_init(struct lookup_intent *it, int op)
{
memset(it, 0, sizeof(*it));
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-02 23:15 [PATCH/RFC] Lustre VFS patch, version 2 Peter J. Braam
@ 2004-06-03 13:59 ` Christoph Hellwig
2004-06-03 14:19 ` Lars Marowsky-Bree
2004-06-03 14:27 ` Christoph Hellwig
2004-06-04 16:55 ` Anton Blanchard
2 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2004-06-03 13:59 UTC (permalink / raw)
To: Peter J. Braam
Cc: linux-kernel, hch, axboe, lmb, kevcorry, arjanv, iro,
trond.myklebust, anton, lustre-devel
On Wed, Jun 02, 2004 at 05:15:27PM -0600, Peter J. Braam wrote:
> People requested to see the code that uses the patch. We have uploaded that
> to:
>
> ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz
>
> The client file system is the primary user of the kernel patch, in the
> llite directory. The MDS server is a sample user of do_kern_mount. As
> requested I have removed many other things from the tar ball to make
> review simple (so this won't compile or run).
> We actually need do_kern_mount and truncate_complete_page. Do kern
> mount is used because we use a file system namespace in servers in the
> kernel without exporting it to user space (mds/handler.c). The server
> file systems are ext3 file systems but we replace VFS locking with DLM
> locks, and it would take considerable work to export that as a file
> system.
Yikes. I'd rather not see something like this going in, and better work
on properly integrating the MDS code with the filesystem. There's also
lots of duplication or almost duplication of VFS functionality in that
directory and the fsfilter horrors. I'd suggest you get that cleaned up
and we'll try to merge it into 2.7, okay?
> Truncate_complete_page is used to remove pages in the middle of a file
> mapping, when lock revocations happen (llite/file.c
> ll_extent_lock_callback, calling ll_pgcache_remove_extent) .
Most of ll_pgcache_remove_extent probably wants to be a proper VFS
function. Again, only interesting if the rest of lustre gets merged.
>
> 2. lustre_version.patch concerns by Christoph Hellwig:
>
> This one can easily be removed, but kernel version alone does not
> necessarily represent anything useful. There are tons of people
> patching their kernel with patches, even applying parts of newer
> kernel and still leaving kernel version at its old value
> (distributions immediately come to mind). So we still need something
> to identify version of necessary bits. E.g. version of intent API.
Well, bad luck for you. It's not like there much interest to merge
any of these patches into the tree without the actual users anyway..
> 3. Introduction of lock-less version of d_rehash (__d_rehash) by
> Christoph Hellwig:
>
> In some places lustre needs to do several things to dentry's with
> dcache lock held already, e.g. traverse alias dentries in inode to
> find one with same name and parent as the one we have already. Lustre
> can invalidate busy dentries, which we put on a list. If these are
> looked up again, concurrently, we find them on this list and re-use
> them, to avoid having several identical aliases in an inode. See
> llite/{dcache.c,namei.c} ll_revalidate and the lock callback function
> ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to
> manipulate dentries associated with raw inodes and names in ext3.
I've only taken a short look at the dcache operations you're doing and
it looks a little fishy and very senistive for small changes in internal
dcache semantics. You're also missing e.g. the LSM callbacks it seems.
Have you talked to Al about that code?
> 4. vfs intent API changes kernel exported concern API by Christoph
> Hellwig:
>
> With slight modification it is possible to reduce the changes to just
> changes in the name of intent structure itself and some of its
> fields.
>
> This renaming was requested by Linus, but we can change names back
> easily if needed, that would avoid any api change. Are there other
> users, please let us know what to do?
Again, you're changing a filesystem API, we have a bunch of intree users
that can be modular so it's likely there are out of tree users, too.
The new semantics might be much nicer, but it's 2.7 material.
> 7. raw operations concerns by various people:
>
> We have now implemented an alternative approach to this, that is
> taking place when parent lookup is done, using intents. For setattr
> we managed to remove the raw operations alltogether, (praying that we
> haven't forgotten some awful problem we solved that led to the
> introduction of setattr_raw in the first place).
>
> The correctly filled intent is recognised by filesystem's lookup or
> revalidate method. After the parent is looked up, based on the intent
> the correct "raw" server call is executed, within the file
> system. Then a special flag is set in intent, the caller of parent
> lookup checks for the flag and if it is set, the functions returns
> immediately with supplied (in intent)exit code, without instantiating
> child dentries.
>
> This needs some minor changes to VFS, though. There are at
> least two approaches.
>
> One is to not introduce any new methods and just rely on fs' metohds
> to do everything, for this to work filesystem needs to know the
> remaining path to be traversed (we can fill nd->last with remaining
> path before calling into fs). In the root directory of the mount, we
> need to call a revalidate (if supported by fs) on mountpoint to
> intercept the intent, after we crossed mountpoint. We have this
> approach implemented in that attached patch. Does it look better than
> the raw operations?
I'm not sure whether overloading ->d_revalidate or a new method for
that is prefferable.
> Much simpler for us is to add additional inode operation
> "process_intent" method that would be called when LOOKUP_PARENT sort
> of lookup was requested and we are about to leave link_path_walk()
> with nameidata structure filled and everything ready. Then the same
> flag in intent will be set and everything else as in previous
> approach.
Yupp, that sounds better.
> Well, how close are we now to this being acceptable?
As already mentioned above they're completely uninteresting without
actually getting the user in tree _and_ maintained there (unlike e.g.
intermezzo or coda that are creeping along). I think based on those
patch we should be able to properly integrate intermezzo once 2.7 opens.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 13:59 ` Christoph Hellwig
@ 2004-06-03 14:19 ` Lars Marowsky-Bree
2004-06-03 14:26 ` Christoph Hellwig
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-06-03 14:19 UTC (permalink / raw)
To: Christoph Hellwig, Peter J. Braam, linux-kernel, axboe, kevcorry,
arjanv, iro, trond.myklebust, anton, lustre-devel
On 2004-06-03T14:59:52,
Christoph Hellwig <hch@infradead.org> said:
> > Well, how close are we now to this being acceptable?
> As already mentioned above they're completely uninteresting without
> actually getting the user in tree _and_ maintained there (unlike e.g.
> intermezzo or coda that are creeping along). I think based on those
> patch we should be able to properly integrate intermezzo once 2.7 opens.
This is something I've got to disagree with.
First, Inter-mezzo is reasonably dead, from what I can see. As is Coda.
You'll notice that the developers behind them have sort-of moved on to
Lustre ;-)
The hooks (once cleaned up, no disagreement here, the technical feedback
so far has been very valuable and continues to be) are useful and in
effect needed not just for Lustre, but in principle for all cluster
filesystems, such as (Open)GFS and others, even potentially NFS4 et al.
The logic that _all_ modules and functionality need to be "in the tree"
right from the start for hooks to be useful is flawed, I'm afraid. Pure
horror that a proprietary cluster file system might also profit from it
is not, exactly, a sound technical argument. (I can assure you I don't
care at all for the proprietary cluster-fs.)
Lustre alone would be, roughly, ~10MB more sources, just in the kernel.
I don't think you want to merge that right now, as desireable as it is
on the other hand to be able to use it with a mainstream kernel. I think
this is why kbuild allows external modules to be build; with that logic
it would follow that this should be disabled too.
There certainly is an interest in merging these (cleaned up) extensions
and allowing powerful cluster filesystems to exist on Linux.
Another example of this is the cache invalidation hook which we went
through a few weeks ago too. Back then you complained about not having
an Open Source user (because it was requested by IBM GPFS), and so
GFS/OpenGFS chimed in - now it is the lack of an _in-tree_ Open Source
user...
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 14:19 ` Lars Marowsky-Bree
@ 2004-06-03 14:26 ` Christoph Hellwig
2004-06-03 14:33 ` Christoph Hellwig
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2004-06-03 14:26 UTC (permalink / raw)
To: Lars Marowsky-Bree
Cc: Christoph Hellwig, Peter J. Braam, linux-kernel, axboe, kevcorry,
arjanv, iro, trond.myklebust, anton, lustre-devel
On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:
> On 2004-06-03T14:59:52,
> Christoph Hellwig <hch@infradead.org> said:
>
> > > Well, how close are we now to this being acceptable?
> > As already mentioned above they're completely uninteresting without
> > actually getting the user in tree _and_ maintained there (unlike e.g.
> > intermezzo or coda that are creeping along). I think based on those
> > patch we should be able to properly integrate intermezzo once 2.7 opens.
>
> This is something I've got to disagree with.
>
> First, Inter-mezzo is reasonably dead, from what I can see. As is Coda.
> You'll notice that the developers behind them have sort-of moved on to
> Lustre ;-)
Arggg, sorry. Typo there. It should have of course read
"I think based on those patches we should be able to properly integrate
LUSTRE once 2.7 opens"
.oO(/me looks for a brown paperbag to hide)
> The logic that _all_ modules and functionality need to be "in the tree"
> right from the start for hooks to be useful is flawed, I'm afraid. Pure
> horror that a proprietary cluster file system might also profit from it
> is not, exactly, a sound technical argument. (I can assure you I don't
> care at all for the proprietary cluster-fs.)
It's more about maintaince overhead. Maintaining features without the
user direct at hand isn't going anywhere. Especially when messing around
deeply in the VFS. By your argumentation we should also throw in all the
mosix and openssi hooks because they could be possibly useful, no? ;-)
> Another example of this is the cache invalidation hook which we went
> through a few weeks ago too. Back then you complained about not having
> an Open Source user (because it was requested by IBM GPFS), and so
> GFS/OpenGFS chimed in - now it is the lack of an _in-tree_ Open Source
> user...
I was always arguing against the lack of an intree user mostly. Lack of
something that could we could merge even in the future is even worse.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 14:19 ` Lars Marowsky-Bree
2004-06-03 14:26 ` Christoph Hellwig
@ 2004-06-03 14:33 ` Christoph Hellwig
2004-06-03 14:49 ` Trond Myklebust
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2004-06-03 14:33 UTC (permalink / raw)
To: Lars Marowsky-Bree
Cc: Christoph Hellwig, Peter J. Braam, linux-kernel, axboe, kevcorry,
arjanv, trond.myklebust, anton
On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:
> The logic that _all_ modules and functionality need to be "in the tree"
> right from the start for hooks to be useful is flawed, I'm afraid.
And btw, I didn't say from the beginning. I just want a comitment from
the lustre folks that they're merging it so we can work out the rough edges
together. There's not much of a problem doing the merge spread over a few
kernel releases.
> Lustre alone would be, roughly, ~10MB more sources, just in the kernel.
I think for mainline mostly the client, aka the llite directory would
be interesting, so a linux box can simply join the lustre cluster. the
metadata server and even worse the object storage box mods would require
tons of work to get anywhere a mergeable shape and are less interesting
anyway.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 14:19 ` Lars Marowsky-Bree
2004-06-03 14:26 ` Christoph Hellwig
2004-06-03 14:33 ` Christoph Hellwig
@ 2004-06-03 14:49 ` Trond Myklebust
2004-06-03 18:10 ` Jan Harkes
2004-06-04 5:03 ` Daniel Phillips
4 siblings, 0 replies; 13+ messages in thread
From: Trond Myklebust @ 2004-06-03 14:49 UTC (permalink / raw)
To: Lars Marowsky-Bree
Cc: Christoph Hellwig, Peter J. Braam, linux-kernel, axboe, kevcorry,
arjanv, iro, anton, lustre-devel
På to , 03/06/2004 klokka 07:19, skreiv Lars Marowsky-Bree:
> The hooks (once cleaned up, no disagreement here, the technical feedback
> so far has been very valuable and continues to be) are useful and in
> effect needed not just for Lustre, but in principle for all cluster
> filesystems, such as (Open)GFS and others, even potentially NFS4 et al.
>
> The logic that _all_ modules and functionality need to be "in the tree"
> right from the start for hooks to be useful is flawed, I'm afraid. Pure
> horror that a proprietary cluster file system might also profit from it
> is not, exactly, a sound technical argument. (I can assure you I don't
> care at all for the proprietary cluster-fs.)
Whereas I agree that NFSv4 could use some of this (I'm mainly interested
in the intent_release() stuff in order to fix up an existing race), I
also agree with Christoph on the principle that having in-tree users
right from the start should be the norm rather than the exception.
Otherwise, exactly what is the plan for how to determine when an
interface is obsolete? Are we going to rely on all the out-of-tree
vendors to collectively step up and say "by the way - we're not using
this anymore."?
Cheers,
Trond
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 14:19 ` Lars Marowsky-Bree
` (2 preceding siblings ...)
2004-06-03 14:49 ` Trond Myklebust
@ 2004-06-03 18:10 ` Jan Harkes
2004-06-04 5:03 ` Daniel Phillips
4 siblings, 0 replies; 13+ messages in thread
From: Jan Harkes @ 2004-06-03 18:10 UTC (permalink / raw)
To: Lars Marowsky-Bree; +Cc: linux-kernel
On Thu, Jun 03, 2004 at 04:19:22PM +0200, Lars Marowsky-Bree wrote:
> First, Inter-mezzo is reasonably dead, from what I can see. As is Coda.
> You'll notice that the developers behind them have sort-of moved on to
> Lustre ;-)
Actually, Coda is not dead, there is still quite a bit of activity. It
is just seems slow on the kernel side because we actually have kernel
modules for various operating systems, FreeBSD, NetBSD, Windows 9x,
Windows NT/2000/XP, Solaris, and recently MacOS/Darwin. As a result we
are quite conservative as far as any significant changes in the
kernel-userspace interface.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 14:19 ` Lars Marowsky-Bree
` (3 preceding siblings ...)
2004-06-03 18:10 ` Jan Harkes
@ 2004-06-04 5:03 ` Daniel Phillips
4 siblings, 0 replies; 13+ messages in thread
From: Daniel Phillips @ 2004-06-04 5:03 UTC (permalink / raw)
To: Lars Marowsky-Bree
Cc: Christoph Hellwig, Peter J. Braam, linux-kernel, axboe, kevcorry,
arjanv, viro, trond.myklebust, anton, lustre-devel
On Thursday 03 June 2004 10:19, Lars Marowsky-Bree wrote:
> The hooks (once cleaned up, no disagreement here, the technical feedback
> so far has been very valuable and continues to be) are useful and in
> effect needed not just for Lustre, but in principle for all cluster
> filesystems, such as (Open)GFS and others, even potentially NFS4 et al.
GFS is now down to needing two trivial patches:
1) export sync_inodes_sb
2) provide a filesystem hook for flock
Since GFS functions well without any of the current batch of proposed vfs
hooks, the word "needed" is not appropriate. Maybe there is something in
here that could benefit GFS, most probably in the intents department, but we
certainly do want to try it first before pronouncing on that. The raw_ops
seem to be entirely irrelevant to GFS, which is peer-to-pear, so does not
delegate anything to a server. I don't think we have a use for lookup_last.
There are quite possibly some helpful ideas in the dcache tweaks but the devil
is in the details: again we need to try it.
Such things as:
+#define DCACHE_LUSTRE_INVALID 0x0020 /* invalidated by Lustre */
clearly fail the "needed not just for Lustre" test.
Looking into my crystal ball, I see many further revisions of this patch set.
Unfortunately, in the latest revision we lost the patch-by-patch discussion,
which seems to have been replaced by list of issues sorted by complainant.
That's interesting, but it's no substitute.
Regards,
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-02 23:15 [PATCH/RFC] Lustre VFS patch, version 2 Peter J. Braam
2004-06-03 13:59 ` Christoph Hellwig
@ 2004-06-03 14:27 ` Christoph Hellwig
2004-06-04 16:55 ` Anton Blanchard
2 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2004-06-03 14:27 UTC (permalink / raw)
To: Peter J. Braam; +Cc: linux-kernel
Btw, you you please stop cross-posting to closed lists?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-02 23:15 [PATCH/RFC] Lustre VFS patch, version 2 Peter J. Braam
2004-06-03 13:59 ` Christoph Hellwig
2004-06-03 14:27 ` Christoph Hellwig
@ 2004-06-04 16:55 ` Anton Blanchard
2004-06-07 18:02 ` Dipankar Sarma
2 siblings, 1 reply; 13+ messages in thread
From: Anton Blanchard @ 2004-06-04 16:55 UTC (permalink / raw)
To: Peter J. Braam
Cc: linux-kernel, hch, axboe, lmb, kevcorry, arjanv, iro,
trond.myklebust, lustre-devel
> 10. "Have these patches undergone any siginifant test?" by Anton Blanchard:
>
> There are two important questions I think:
> - Do the patches cause damage?
> Probably not anymore. SUSE has done testing and it appears the
> original patch I attached didn't break things (after one fix was
> made).
IBM did a lot of the work on that issue and it took the better part of a
week to find, fix and verify.
Anton
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-04 16:55 ` Anton Blanchard
@ 2004-06-07 18:02 ` Dipankar Sarma
0 siblings, 0 replies; 13+ messages in thread
From: Dipankar Sarma @ 2004-06-07 18:02 UTC (permalink / raw)
To: Anton Blanchard
Cc: Peter J. Braam, linux-kernel, hch, axboe, lmb, kevcorry, arjanv,
iro, trond.myklebust, lustre-devel
On Sat, Jun 05, 2004 at 02:55:48AM +1000, Anton Blanchard wrote:
>
> > 10. "Have these patches undergone any siginifant test?" by Anton Blanchard:
> >
> > There are two important questions I think:
> > - Do the patches cause damage?
> > Probably not anymore. SUSE has done testing and it appears the
> > original patch I attached didn't break things (after one fix was
> > made).
>
> IBM did a lot of the work on that issue and it took the better part of a
> week to find, fix and verify.
AFAIK, Maneesh asked about revalidate_special() returning negative dentries
and no checking of it in path lookup(), but got no reply from Lustre
folks. It is clearly broken. Maneesh has more breakage from Lustre
VFS patches now. It would be helpful if they atleast comment on
fixes for those.
Thanks
Dipankar
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH/RFC] Lustre VFS patch, version 2
@ 2004-06-03 15:53 Peter J. Braam
2004-06-06 17:00 ` Christoph Hellwig
0 siblings, 1 reply; 13+ messages in thread
From: Peter J. Braam @ 2004-06-03 15:53 UTC (permalink / raw)
To: linux-kernel, torvalds, akpm
Cc: Christoph Hellwig, axboe, kevcorry, arjanv, viro, anton,
Trond Myklebust, Lars Marowsky-Bree
Hi,
Of course I am totally happy to include or not include the Lustre client
with it. However, that does lead to a sizeable amount of (completely
modular) code, as it depends on the networking, lock manager, logical
volume driver and metadata and object storage clients and the management
framework. It's 2M.
I'd like to also acknowledge that we should remove the small
incompatibility in the names of intents, to preserve api compatibility,
and add an inode method for intent execution. Yes, the LUSTRE_INVALID
flag was discussed on irc with Al Viro: he said that probably I really
needed _something_, he said it's hairy, so it was coded to not affect
anyone that doesn't use that flag.
I have not worked on Coda for 5 years, and have nothing to say about it.
I have recently withdrawn InterMezzo to be helpful to the kernel
community. Of course I would offer the same for Lustre. But as I have
said before, this time there are a lot of resources to maintain this.
Perhaps it is useful to explain that vendors (Novell, Dell, HP and
others) have urged me to enquire if the hooks could go into 2.6. All of
them have really major Lustre customers, running top10 super computing
clusters with Lustre. Having the hooks avoids having to patch vendor
kernels, which breaks support arrangements. As for our position, it's
in fact easier to wait and just collect clever insights from time to
time.
I represent them here. I understand and would respect the wait until
2.7 argument, but I think it is workable to get them into 2.6. Is it
really a big deal to go through these small patches a few more times to
judge if they are safe, and to include them? I think it would help
people who care and support Linux financically. I only hear Christoph
arguing against it, are there other insights?
Again many thanks for spending time to study the patches, it has already
helped Lustre get better.
- Peter -
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH/RFC] Lustre VFS patch, version 2
2004-06-03 15:53 Peter J. Braam
@ 2004-06-06 17:00 ` Christoph Hellwig
0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2004-06-06 17:00 UTC (permalink / raw)
To: Peter J. Braam; +Cc: linux-kernel
On Thu, Jun 03, 2004 at 11:53:43AM -0400, Peter J. Braam wrote:
> Perhaps it is useful to explain that vendors (Novell, Dell, HP and
> others) have urged me to enquire if the hooks could go into 2.6. All of
> them have really major Lustre customers, running top10 super computing
> clusters with Lustre. Having the hooks avoids having to patch vendor
> kernels, which breaks support arrangements. As for our position, it's
> in fact easier to wait and just collect clever insights from time to
> time.
>
> I represent them here. I understand and would respect the wait until
> 2.7 argument, but I think it is workable to get them into 2.6. Is it
> really a big deal to go through these small patches a few more times to
> judge if they are safe, and to include them? I think it would help
> people who care and support Linux financically. I only hear Christoph
> arguing against it, are there other insights?
Trond also clearly spoke against it and Anton didn't seem to be impressed
by the code quality of your patches either ;-) Only lmb who certainly
has a vested interest by beeing responsible for cluster at one of the above
mentioned vendors has speaken for it. Given that SLES9 will already have
lustre life should already be much simpler for you. If clusterfs is
actually interested in maintaining lustre as part of the linux kernel I'm
the last one to object, but without you place the burden of maintaining
all the hooks that are very specific to your filesystem on us.
p.s. where's lustre's current cvs tree? I'd like to actually build a module
vs the hooks that you posted and growel in the cvs history a little.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2004-06-07 18:06 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-02 23:15 [PATCH/RFC] Lustre VFS patch, version 2 Peter J. Braam
2004-06-03 13:59 ` Christoph Hellwig
2004-06-03 14:19 ` Lars Marowsky-Bree
2004-06-03 14:26 ` Christoph Hellwig
2004-06-03 14:33 ` Christoph Hellwig
2004-06-03 14:49 ` Trond Myklebust
2004-06-03 18:10 ` Jan Harkes
2004-06-04 5:03 ` Daniel Phillips
2004-06-03 14:27 ` Christoph Hellwig
2004-06-04 16:55 ` Anton Blanchard
2004-06-07 18:02 ` Dipankar Sarma
-- strict thread matches above, loose matches on Subject: below --
2004-06-03 15:53 Peter J. Braam
2004-06-06 17:00 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox