* [PATCH 00/26] Mount writer count and read-only bind mounts
@ 2007-06-22 20:03 Dave Hansen
2007-06-22 20:03 ` [PATCH 01/26] document nlink function Dave Hansen
` (26 more replies)
0 siblings, 27 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
The largest change from last time is in patch 25/26. It has
been reworked to use per-cpu lock instead of a per-node one.
See the patch for more details.
I would also appreciate any close review of patch 08/27. It
touches a lot of code, and I'm a bit worried that a buglet or
two could have snuck in there.
Note that there are a few stragglers left using IS_RDONLY()
in reiser4 and in some other low-level filesystem code. The
reiser4 asserts should be just fine left the way they are,
plus reiser4 needs quite a bit of work before merging anyway.
The ntfs usage appears to be internal, and not related to
user activity.
This patch survives running ltp as well as a home-grown set
of filesystem operations here:
http://sr71.net/~dave/linux/robind-test.sh
---
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.
This has a number of uses. It allows chroots to have parts of
filesystems writable. It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems. This also replaces
patches that vserver has had out of the tree for several years.
It allows security enhancement by making sure that parts of
your filesystem read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular
bind and a r/o bind mount of it. It then performs some normal
filesystem operations on the three directories, including ones
that are expected to fail, like creating a file on the r/o
mount.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/26] document nlink function
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:36 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 02/26] ext3: remove extra IS_RDONLY() check Dave Hansen
` (25 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
These should have been documented from the beginning. Fix it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/include/linux/fs.h | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff -puN include/linux/fs.h~document-nlink-funcs include/linux/fs.h
--- lxc/include/linux/fs.h~document-nlink-funcs 2007-06-21 23:23:10.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2007-06-21 23:23:10.000000000 -0700
@@ -1198,6 +1198,14 @@ static inline void mark_inode_dirty_sync
__mark_inode_dirty(inode, I_DIRTY_SYNC);
}
+/**
+ * inc_nlink - directly increment an inode's link count
+ * @inode: inode
+ *
+ * This is a low-level filesystem helper to replace any
+ * direct filesystem manipulation of i_nlink. Currently,
+ * it is only here for parity with dec_nlink().
+ */
static inline void inc_nlink(struct inode *inode)
{
inode->i_nlink++;
@@ -1209,11 +1217,30 @@ static inline void inode_inc_link_count(
mark_inode_dirty(inode);
}
+/**
+ * drop_nlink - directly drop an inode's link count
+ * @inode: inode
+ *
+ * This is a low-level filesystem helper to replace any
+ * direct filesystem manipulation of i_nlink. In cases
+ * where we are attempting to track writes to the
+ * filesystem, a decrement to zero means an imminent
+ * write when the file is truncated and actually unlinked
+ * on the filesystem.
+ */
static inline void drop_nlink(struct inode *inode)
{
inode->i_nlink--;
}
+/**
+ * clear_nlink - directly zero an inode's link count
+ * @inode: inode
+ *
+ * This is a low-level filesystem helper to replace any
+ * direct filesystem manipulation of i_nlink. See
+ * drop_nlink() for why we care about i_nlink hitting zero.
+ */
static inline void clear_nlink(struct inode *inode)
{
inode->i_nlink = 0;
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 02/26] ext3: remove extra IS_RDONLY() check
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
2007-06-22 20:03 ` [PATCH 01/26] document nlink function Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:36 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 03/26] ext4: " Dave Hansen
` (24 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
ext3_change_inode_journal_flag() is only called from one
location: ext3_ioctl(EXT3_IOC_SETFLAGS). That ioctl
case already has a IS_RDONLY() call in it so this one
is superfluous.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/ext3/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN fs/ext3/inode.c~ext3-extra-ro-check fs/ext3/inode.c
--- lxc/fs/ext3/inode.c~ext3-extra-ro-check 2007-06-21 23:23:11.000000000 -0700
+++ lxc-dave/fs/ext3/inode.c 2007-06-21 23:23:12.000000000 -0700
@@ -3182,7 +3182,7 @@ int ext3_change_inode_journal_flag(struc
*/
journal = EXT3_JOURNAL(inode);
- if (is_journal_aborted(journal) || IS_RDONLY(inode))
+ if (is_journal_aborted(journal))
return -EROFS;
journal_lock_updates(journal);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 03/26] ext4: remove extra IS_RDONLY() check
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
2007-06-22 20:03 ` [PATCH 01/26] document nlink function Dave Hansen
2007-06-22 20:03 ` [PATCH 02/26] ext3: remove extra IS_RDONLY() check Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:37 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 04/26] filesystem helpers for custom 'struct file's Dave Hansen
` (23 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
ext4_change_inode_journal_flag() is only called from one
location: ext4_ioctl(EXT3_IOC_SETFLAGS). That ioctl
case already has a IS_RDONLY() call in it so this one
is superfluous.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/ext4/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN fs/ext4/inode.c~ext4-extra-ro-check fs/ext4/inode.c
--- lxc/fs/ext4/inode.c~ext4-extra-ro-check 2007-06-21 23:23:12.000000000 -0700
+++ lxc-dave/fs/ext4/inode.c 2007-06-21 23:23:12.000000000 -0700
@@ -3196,7 +3196,7 @@ int ext4_change_inode_journal_flag(struc
*/
journal = EXT4_JOURNAL(inode);
- if (is_journal_aborted(journal) || IS_RDONLY(inode))
+ if (is_journal_aborted(journal))
return -EROFS;
jbd2_journal_lock_updates(journal);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (2 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 03/26] ext4: " Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:38 ` Christoph Hellwig
2007-06-23 16:52 ` Andrew Morton
2007-06-22 20:03 ` [PATCH 05/26] r/o bind mounts: stub functions Dave Hansen
` (22 subsequent siblings)
26 siblings, 2 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Christoph H. says this stands on its own and can go in before the
rest of the r/o bind mount set.
---
Some filesystems forego the vfs and may_open() and create their
own 'struct file's.
This patch creates a couple of helper functions which can be
used by these filesystems, and will provide a unified place
which the r/o bind mount code may patch.
Also, rename two existing, static-scope init_file() to less
generic names.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/configfs/dir.c | 5 +++--
lxc-dave/fs/file_table.c | 34 ++++++++++++++++++++++++++++++++++
lxc-dave/fs/hugetlbfs/inode.c | 22 +++++++++-------------
lxc-dave/fs/sysfs/dir.c | 4 ++--
lxc-dave/include/linux/file.h | 9 +++++++++
lxc-dave/ipc/shm.c | 13 +++++--------
lxc-dave/mm/shmem.c | 7 ++-----
lxc-dave/mm/tiny-shmem.c | 19 +++++++------------
lxc-dave/net/socket.c | 19 ++++++++++---------
9 files changed, 81 insertions(+), 51 deletions(-)
diff -puN fs/configfs/dir.c~01-24-filesystem-helpers-for-custom-struct-file-s fs/configfs/dir.c
--- lxc/fs/configfs/dir.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/fs/configfs/dir.c 2007-06-21 23:23:13.000000000 -0700
@@ -142,7 +142,7 @@ static int init_dir(struct inode * inode
return 0;
}
-static int init_file(struct inode * inode)
+static int configfs_init_file(struct inode * inode)
{
inode->i_size = PAGE_SIZE;
inode->i_fop = &configfs_file_operations;
@@ -283,7 +283,8 @@ static int configfs_attach_attr(struct c
dentry->d_fsdata = configfs_get(sd);
sd->s_dentry = dentry;
- error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file);
+ error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG,
+ configfs_init_file);
if (error) {
configfs_put(sd);
return error;
diff -puN fs/file_table.c~01-24-filesystem-helpers-for-custom-struct-file-s fs/file_table.c
--- lxc/fs/file_table.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-06-21 23:23:13.000000000 -0700
@@ -139,6 +139,40 @@ fail:
EXPORT_SYMBOL(get_empty_filp);
+struct file *alloc_file(struct vfsmount *mnt, struct dentry *dentry,
+ mode_t mode, const struct file_operations *fop)
+{
+ struct file *file;
+ struct path;
+
+ file = get_empty_filp();
+ if (!file)
+ return NULL;
+
+ init_file(file, mnt, dentry, mode, fop);
+ return file;
+}
+EXPORT_SYMBOL(alloc_file);
+
+/*
+ * Note: This is a crappy interface. It is here to make
+ * merging with the existing users of get_empty_filp()
+ * who have complex failure logic easier. All users
+ * of this should be moving to alloc_file().
+ */
+int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
+ mode_t mode, const struct file_operations *fop)
+{
+ int error = 0;
+ file->f_path.dentry = dentry;
+ file->f_path.mnt = mntget(mnt);
+ file->f_mapping = dentry->d_inode->i_mapping;
+ file->f_mode = mode;
+ file->f_op = fop;
+ return error;
+}
+EXPORT_SYMBOL(init_file);
+
void fastcall fput(struct file *file)
{
if (atomic_dec_and_test(&file->f_count))
diff -puN fs/hugetlbfs/inode.c~01-24-filesystem-helpers-for-custom-struct-file-s fs/hugetlbfs/inode.c
--- lxc/fs/hugetlbfs/inode.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/fs/hugetlbfs/inode.c 2007-06-21 23:23:13.000000000 -0700
@@ -759,16 +759,11 @@ struct file *hugetlb_zero_setup(size_t s
if (!dentry)
goto out_shm_unlock;
- error = -ENFILE;
- file = get_empty_filp();
- if (!file)
- goto out_dentry;
-
error = -ENOSPC;
inode = hugetlbfs_get_inode(root->d_sb, current->fsuid,
current->fsgid, S_IFREG | S_IRWXUGO, 0);
if (!inode)
- goto out_file;
+ goto out_dentry;
error = -ENOMEM;
if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
@@ -777,17 +772,18 @@ struct file *hugetlb_zero_setup(size_t s
d_instantiate(dentry, inode);
inode->i_size = size;
inode->i_nlink = 0;
- file->f_path.mnt = mntget(hugetlbfs_vfsmount);
- file->f_path.dentry = dentry;
- file->f_mapping = inode->i_mapping;
- file->f_op = &hugetlbfs_file_operations;
- file->f_mode = FMODE_WRITE | FMODE_READ;
+
+ error = -ENFILE;
+ file = alloc_file(hugetlbfs_vfsmount, dentry,
+ FMODE_WRITE | FMODE_READ,
+ &hugetlbfs_file_operations);
+ if (!file)
+ goto out_inode;
+
return file;
out_inode:
iput(inode);
-out_file:
- put_filp(file);
out_dentry:
dput(dentry);
out_shm_unlock:
diff -puN fs/sysfs/dir.c~01-24-filesystem-helpers-for-custom-struct-file-s fs/sysfs/dir.c
--- lxc/fs/sysfs/dir.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/fs/sysfs/dir.c 2007-06-21 23:23:13.000000000 -0700
@@ -134,7 +134,7 @@ static int init_dir(struct inode * inode
return 0;
}
-static int init_file(struct inode * inode)
+static int sysfs_init_file(struct inode * inode)
{
inode->i_size = PAGE_SIZE;
inode->i_fop = &sysfs_file_operations;
@@ -234,7 +234,7 @@ static int sysfs_attach_attr(struct sysf
attr = &bin_attr->attr;
} else {
attr = sd->s_element;
- init = init_file;
+ init = sysfs_init_file;
}
dentry->d_fsdata = sysfs_get(sd);
diff -puN include/linux/file.h~01-24-filesystem-helpers-for-custom-struct-file-s include/linux/file.h
--- lxc/include/linux/file.h~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/include/linux/file.h 2007-06-21 23:23:13.000000000 -0700
@@ -62,6 +62,15 @@ extern struct kmem_cache *filp_cachep;
extern void FASTCALL(__fput(struct file *));
extern void FASTCALL(fput(struct file *));
+struct file_operations;
+struct vfsmount;
+struct dentry;
+extern int init_file(struct file *, struct vfsmount *mnt,
+ struct dentry *dentry, mode_t mode,
+ const struct file_operations *fop);
+extern struct file *alloc_file(struct vfsmount *, struct dentry *dentry,
+ mode_t mode, const struct file_operations *fop);
+
static inline void fput_light(struct file *file, int fput_needed)
{
if (unlikely(fput_needed))
diff -puN ipc/shm.c~01-24-filesystem-helpers-for-custom-struct-file-s ipc/shm.c
--- lxc/ipc/shm.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/ipc/shm.c 2007-06-21 23:23:13.000000000 -0700
@@ -899,7 +899,7 @@ long do_shmat(int shmid, char __user *sh
goto out_unlock;
path.dentry = dget(shp->shm_file->f_path.dentry);
- path.mnt = mntget(shp->shm_file->f_path.mnt);
+ path.mnt = shp->shm_file->f_path.mnt;
shp->shm_nattch++;
size = i_size_read(path.dentry->d_inode);
shm_unlock(shp);
@@ -907,18 +907,16 @@ long do_shmat(int shmid, char __user *sh
err = -ENOMEM;
sfd = kzalloc(sizeof(*sfd), GFP_KERNEL);
if (!sfd)
- goto out_put_path;
+ goto out_put_dentry;
err = -ENOMEM;
- file = get_empty_filp();
+
+ file = alloc_file(path.mnt, path.dentry, f_mode, &shm_file_operations);
if (!file)
goto out_free;
- file->f_op = &shm_file_operations;
file->private_data = sfd;
- file->f_path = path;
file->f_mapping = shp->shm_file->f_mapping;
- file->f_mode = f_mode;
sfd->id = shp->id;
sfd->ns = get_ipc_ns(ns);
sfd->file = shp->shm_file;
@@ -969,9 +967,8 @@ out_unlock:
out_free:
kfree(sfd);
-out_put_path:
+out_put_dentry:
dput(path.dentry);
- mntput(path.mnt);
goto out_nattch;
}
diff -puN mm/shmem.c~01-24-filesystem-helpers-for-custom-struct-file-s mm/shmem.c
--- lxc/mm/shmem.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/mm/shmem.c 2007-06-21 23:23:13.000000000 -0700
@@ -2567,11 +2567,8 @@ struct file *shmem_file_setup(char *name
d_instantiate(dentry, inode);
inode->i_size = size;
inode->i_nlink = 0; /* It is unlinked */
- file->f_path.mnt = mntget(shm_mnt);
- file->f_path.dentry = dentry;
- file->f_mapping = inode->i_mapping;
- file->f_op = &shmem_file_operations;
- file->f_mode = FMODE_WRITE | FMODE_READ;
+ init_file(file, shm_mnt, dentry, FMODE_WRITE | FMODE_READ,
+ &shmem_file_operations);
return file;
close_file:
diff -puN mm/tiny-shmem.c~01-24-filesystem-helpers-for-custom-struct-file-s mm/tiny-shmem.c
--- lxc/mm/tiny-shmem.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/mm/tiny-shmem.c 2007-06-21 23:23:13.000000000 -0700
@@ -66,24 +66,19 @@ struct file *shmem_file_setup(char *name
if (!dentry)
goto put_memory;
- error = -ENFILE;
- file = get_empty_filp();
- if (!file)
- goto put_dentry;
-
error = -ENOSPC;
inode = ramfs_get_inode(root->d_sb, S_IFREG | S_IRWXUGO, 0);
if (!inode)
- goto close_file;
+ goto put_dentry;
d_instantiate(dentry, inode);
- inode->i_nlink = 0; /* It is unlinked */
+ error = -ENFILE;
+ file = alloc_file(shm_mnt, dentry, FMODE_WRITE | FMODE_READ,
+ &ramfs_file_operations);
+ if (!file)
+ goto put_dentry;
- file->f_path.mnt = mntget(shm_mnt);
- file->f_path.dentry = dentry;
- file->f_mapping = inode->i_mapping;
- file->f_op = &ramfs_file_operations;
- file->f_mode = FMODE_WRITE | FMODE_READ;
+ inode->i_nlink = 0; /* It is unlinked */
/* notify everyone as to the change of file size */
error = do_truncate(dentry, size, 0, file);
diff -puN net/socket.c~01-24-filesystem-helpers-for-custom-struct-file-s net/socket.c
--- lxc/net/socket.c~01-24-filesystem-helpers-for-custom-struct-file-s 2007-06-21 23:23:13.000000000 -0700
+++ lxc-dave/net/socket.c 2007-06-21 23:23:13.000000000 -0700
@@ -355,31 +355,32 @@ static int sock_alloc_fd(struct file **f
static int sock_attach_fd(struct socket *sock, struct file *file)
{
+ struct dentry *dentry;
struct qstr this;
char name[32];
this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
this.name = name;
this.hash = 0;
+ struct path;
- file->f_path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
- if (unlikely(!file->f_path.dentry))
+ dentry = d_alloc(sock_mnt->mnt_sb->s_root, &this);
+ if (unlikely(!dentry))
return -ENOMEM;
- file->f_path.dentry->d_op = &sockfs_dentry_operations;
+ dentry->d_op = &sockfs_dentry_operations;
/*
* We dont want to push this dentry into global dentry hash table.
* We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
* This permits a working /proc/$pid/fd/XXX on sockets
*/
- file->f_path.dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(file->f_path.dentry, SOCK_INODE(sock));
- file->f_path.mnt = mntget(sock_mnt);
- file->f_mapping = file->f_path.dentry->d_inode->i_mapping;
-
+ dentry->d_flags &= ~DCACHE_UNHASHED;
+ d_instantiate(dentry, SOCK_INODE(sock));
+ init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
+ &socket_file_ops);
+ SOCK_INODE(sock)->i_fop = &socket_file_ops;
sock->file = file;
file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops;
- file->f_mode = FMODE_READ | FMODE_WRITE;
file->f_flags = O_RDWR;
file->f_pos = 0;
file->private_data = sock;
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 05/26] r/o bind mounts: stub functions
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (3 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 04/26] filesystem helpers for custom 'struct file's Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:39 ` Christoph Hellwig
2007-06-23 16:52 ` Andrew Morton
2007-06-22 20:03 ` [PATCH 06/26] elevate write count open()'d files Dave Hansen
` (21 subsequent siblings)
26 siblings, 2 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This patch adds two function mnt_want_write() and
mnt_drop_write(). These are used like a lock pair around
and fs operations that might cause a write to the filesystem.
Before these can become useful, we must first cover each
place in the VFS where writes are performed with a
want/drop pair. When that is complete, we can actually
introduce code that will safely check the counts before
allowing r/w<->r/o transitions to occur.
Note that we put the linux/fs.h #include as far down in
mount.h as possible. This is to keep the mount.h->fs.h->
sched.h->mount.h include dependency from biting us.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namespace.c | 13 +++++++++++++
lxc-dave/include/linux/mount.h | 13 +++++++++++++
2 files changed, 26 insertions(+)
diff -puN fs/namespace.c~03-24-add-vfsmount-writer-count fs/namespace.c
--- lxc/fs/namespace.c~03-24-add-vfsmount-writer-count 2007-06-21 23:23:14.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-06-21 23:23:14.000000000 -0700
@@ -76,6 +76,19 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}
+int mnt_want_write(struct vfsmount *mnt)
+{
+ if (__mnt_is_readonly(mnt))
+ return -EROFS;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mnt_want_write);
+
+void mnt_drop_write(struct vfsmount *mnt)
+{
+}
+EXPORT_SYMBOL_GPL(mnt_drop_write);
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
mnt->mnt_sb = sb;
diff -puN include/linux/mount.h~03-24-add-vfsmount-writer-count include/linux/mount.h
--- lxc/include/linux/mount.h~03-24-add-vfsmount-writer-count 2007-06-21 23:23:14.000000000 -0700
+++ lxc-dave/include/linux/mount.h 2007-06-21 23:23:14.000000000 -0700
@@ -70,6 +70,19 @@ static inline struct vfsmount *mntget(st
return mnt;
}
+#include <linux/fs.h>
+
+/*
+ * This shouldn't be used directly ouside of the VFS,
+ * use mnt_want/drop_write() instead.
+ */
+static inline int __mnt_is_readonly(struct vfsmount *mnt)
+{
+ return (mnt->mnt_sb->s_flags & MS_RDONLY);
+}
+
+extern int mnt_want_write(struct vfsmount *mnt);
+extern void mnt_drop_write(struct vfsmount *mnt);
extern void mntput_no_expire(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 06/26] elevate write count open()'d files
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (4 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 05/26] r/o bind mounts: stub functions Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:40 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls Dave Hansen
` (20 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This is the first really tricky patch in the series. It
elevates the writer count on a mount each time a
non-special file is opened for write.
This is not completely apparent in the patch because the
two if() conditions in may_open() above the
mnt_want_write() call are, combined, equivalent to
special_file().
There is also an elevated count around the vfs_create()
call in open_namei(). The count needs to be kept elevated
all the way into the may_open() call. Otherwise, when the
write is dropped, a ro->rw transisition could occur. This
would lead to having rw access on the newly created file,
while the vfsmount is ro. That is bad.
Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt writer
count because they will get __fput(), and we need to make
sure they're balanced.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/file_table.c | 9 ++++++++-
lxc-dave/fs/namei.c | 20 ++++++++++++++++----
lxc-dave/ipc/mqueue.c | 3 +++
3 files changed, 27 insertions(+), 5 deletions(-)
diff -puN fs/file_table.c~14-24-tricky-elevate-write-count-files-are-open-ed fs/file_table.c
--- lxc/fs/file_table.c~14-24-tricky-elevate-write-count-files-are-open-ed 2007-06-22 10:00:02.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-06-22 10:00:02.000000000 -0700
@@ -169,6 +169,10 @@ int init_file(struct file *file, struct
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
+ if (mode & FMODE_WRITE) {
+ error = mnt_want_write(mnt);
+ WARN_ON(error);
+ }
return error;
}
EXPORT_SYMBOL(init_file);
@@ -206,8 +210,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- if (file->f_mode & FMODE_WRITE)
+ if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
+ if (!special_file(inode->i_mode))
+ mnt_drop_write(mnt);
+ }
put_pid(file->f_owner.pid);
file_kill(file);
file->f_path.dentry = NULL;
diff -puN fs/namei.c~14-24-tricky-elevate-write-count-files-are-open-ed fs/namei.c
--- lxc/fs/namei.c~14-24-tricky-elevate-write-count-files-are-open-ed 2007-06-22 10:00:02.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-22 10:00:02.000000000 -0700
@@ -1544,8 +1544,15 @@ int may_open(struct nameidata *nd, int a
return -EACCES;
flag &= ~O_TRUNC;
- } else if (IS_RDONLY(inode) && (flag & FMODE_WRITE))
- return -EROFS;
+ } else if (flag & FMODE_WRITE) {
+ /*
+ * effectively: !special_file()
+ * balanced by __fput()
+ */
+ error = mnt_want_write(nd->mnt);
+ if (error)
+ return error;
+ }
/*
* An append-only file must be opened in append mode for writing.
*/
@@ -1684,14 +1691,17 @@ do_last:
}
if (IS_ERR(nd->intent.open.file)) {
- mutex_unlock(&dir->d_inode->i_mutex);
error = PTR_ERR(nd->intent.open.file);
- goto exit_dput;
+ goto exit_mutex_unlock;
}
/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
+ error = mnt_want_write(nd->mnt);
+ if (error)
+ goto exit_mutex_unlock;
error = open_namei_create(nd, &path, flag, mode);
+ mnt_drop_write(nd->mnt);
if (error)
goto exit;
return 0;
@@ -1729,6 +1739,8 @@ ok:
goto exit;
return 0;
+exit_mutex_unlock:
+ mutex_unlock(&dir->d_inode->i_mutex);
exit_dput:
dput_path(&path, nd);
exit:
diff -puN ipc/mqueue.c~14-24-tricky-elevate-write-count-files-are-open-ed ipc/mqueue.c
--- lxc/ipc/mqueue.c~14-24-tricky-elevate-write-count-files-are-open-ed 2007-06-22 10:00:02.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2007-06-22 10:00:02.000000000 -0700
@@ -687,6 +687,9 @@ asmlinkage long sys_mq_open(const char _
goto out;
filp = do_open(dentry, oflag);
} else {
+ error = mnt_want_write(mqueue_mnt);
+ if (error)
+ goto out;
filp = do_create(mqueue_mnt->mnt_root, dentry,
oflag, mode, u_attr);
}
diff -puN fs/ext2/ioctl.c~14-24-tricky-elevate-write-count-files-are-open-ed fs/ext2/ioctl.c
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (5 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 06/26] elevate write count open()'d files Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:42 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 08/26] elevate writer count for chown and friends Dave Hansen
` (19 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Some ioctl()s can cause writes to the filesystem. Take
these, and make them use mnt_want/drop_write() instead.
We need to pass the filp one layer deeper in XFS, but
somebody _just_ pulled it out in February because nobody
was using it, so I don't feel guilty for adding it back.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/ext2/ioctl.c | 46 +++++++++-----
lxc-dave/fs/ext3/ioctl.c | 100 +++++++++++++++++++++-----------
lxc-dave/fs/ext4/ioctl.c | 105 +++++++++++++++++++++-------------
lxc-dave/fs/fat/file.c | 10 +--
lxc-dave/fs/hfsplus/ioctl.c | 39 +++++++-----
lxc-dave/fs/jfs/ioctl.c | 33 ++++++----
lxc-dave/fs/ocfs2/ioctl.c | 11 +--
lxc-dave/fs/reiserfs/ioctl.c | 53 +++++++++++------
lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 +++-
lxc-dave/fs/xfs/linux-2.6/xfs_iops.c | 7 --
lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c | 9 ++
11 files changed, 272 insertions(+), 156 deletions(-)
diff -puN fs/ext2/ioctl.c~05-24-ioctl-mnt-takers fs/ext2/ioctl.c
--- lxc/fs/ext2/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/ext2/ioctl.c 2007-06-22 10:58:25.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include <linux/smp_lock.h>
#include <asm/current.h>
#include <asm/uaccess.h>
@@ -22,6 +23,7 @@ int ext2_ioctl (struct inode * inode, st
{
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
+ int ret;
ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg);
@@ -32,14 +34,19 @@ int ext2_ioctl (struct inode * inode, st
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
+ ret = mnt_want_write(filp->f_vfsmnt);
+ if (ret)
+ return ret;
+
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ ret = -EACCES;
+ goto setflags_out;
+ }
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
+ if (get_user(flags, (int __user *) arg)) {
+ ret = -EFAULT;
+ goto setflags_out;
+ }
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT2_DIRSYNC_FL;
@@ -56,7 +63,8 @@ int ext2_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ ret = -EPERM;
+ goto setflags_out;
}
}
@@ -68,20 +76,26 @@ int ext2_ioctl (struct inode * inode, st
ext2_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
}
case EXT2_IOC_GETVERSION:
return put_user(inode->i_generation, (int __user *) arg);
case EXT2_IOC_SETVERSION:
if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(inode->i_generation, (int __user *) arg))
- return -EFAULT;
- inode->i_ctime = CURRENT_TIME_SEC;
- mark_inode_dirty(inode);
- return 0;
+ ret = mnt_want_write(filp->f_vfsmnt);
+ if (ret)
+ return ret;
+ if (get_user(inode->i_generation, (int __user *) arg)) {
+ ret = -EFAULT;
+ } else {
+ inode->i_ctime = CURRENT_TIME_SEC;
+ mark_inode_dirty(inode);
+ }
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
default:
return -ENOTTY;
}
diff -puN fs/ext3/ioctl.c~05-24-ioctl-mnt-takers fs/ext3/ioctl.c
--- lxc/fs/ext3/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/ext3/ioctl.c 2007-06-22 10:59:02.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/capability.h>
#include <linux/ext3_fs.h>
#include <linux/ext3_jbd.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/compat.h>
#include <linux/smp_lock.h>
@@ -37,14 +38,19 @@ int ext3_ioctl (struct inode * inode, st
unsigned int oldflags;
unsigned int jflag;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ err = -EACCES;
+ goto flags_out;
+ }
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
+ if (get_user(flags, (int __user *) arg)) {
+ err = -EFAULT;
+ goto flags_out;
+ }
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT3_DIRSYNC_FL;
@@ -64,7 +70,8 @@ int ext3_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT3_APPEND_FL | EXT3_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -75,7 +82,8 @@ int ext3_ioctl (struct inode * inode, st
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -83,7 +91,8 @@ int ext3_ioctl (struct inode * inode, st
handle = ext3_journal_start(inode, 1);
if (IS_ERR(handle)) {
mutex_unlock(&inode->i_mutex);
- return PTR_ERR(handle);
+ err = PTR_ERR(handle);
+ goto flags_out;
}
if (IS_SYNC(inode))
handle->h_sync = 1;
@@ -109,6 +118,8 @@ flags_err:
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL))
err = ext3_change_inode_journal_flag(inode, jflag);
mutex_unlock(&inode->i_mutex);
+flags_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT3_IOC_GETVERSION:
@@ -123,14 +134,18 @@ flags_err:
if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(generation, (int __user *) arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(generation, (int __user *) arg)) {
+ err = -EFAULT;
+ goto setversion_out;
+ }
handle = ext3_journal_start(inode, 1);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto setversion_out;
+ }
err = ext3_reserve_inode_write(handle, inode, &iloc);
if (err == 0) {
inode->i_ctime = CURRENT_TIME_SEC;
@@ -138,6 +153,8 @@ flags_err:
err = ext3_mark_iloc_dirty(handle, inode, &iloc);
}
ext3_journal_stop(handle);
+setversion_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
#ifdef CONFIG_JBD_DEBUG
@@ -173,18 +190,24 @@ flags_err:
}
return -ENOTTY;
case EXT3_IOC_SETRSVSZ: {
+ int err;
if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
return -ENOTTY;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ err = -EACCES;
+ goto setrsvsz_out;
+ }
- if (get_user(rsv_window_size, (int __user *)arg))
- return -EFAULT;
+ if (get_user(rsv_window_size, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setrsvsz_out;
+ }
if (rsv_window_size > EXT3_MAX_RESERVE_BLOCKS)
rsv_window_size = EXT3_MAX_RESERVE_BLOCKS;
@@ -202,7 +225,9 @@ flags_err:
rsv->rsv_goal_size = rsv_window_size;
}
mutex_unlock(&ei->truncate_mutex);
- return 0;
+setrsvsz_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
case EXT3_IOC_GROUP_EXTEND: {
ext3_fsblk_t n_blocks_count;
@@ -212,17 +237,20 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (get_user(n_blocks_count, (__u32 __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(n_blocks_count, (__u32 __user *)arg)) {
+ err = -EFAULT;
+ goto group_extend_out;
+ }
err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
journal_lock_updates(EXT3_SB(sb)->s_journal);
journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
-
+group_extend_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT3_IOC_GROUP_ADD: {
@@ -233,18 +261,22 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
if (copy_from_user(&input, (struct ext3_new_group_input __user *)arg,
- sizeof(input)))
- return -EFAULT;
+ sizeof(input))) {
+ err = -EFAULT;
+ goto group_add_out;
+ }
err = ext3_group_add(sb, &input);
journal_lock_updates(EXT3_SB(sb)->s_journal);
journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
-
+group_add_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
diff -puN fs/ext4/ioctl.c~05-24-ioctl-mnt-takers fs/ext4/ioctl.c
--- lxc/fs/ext4/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/ext4/ioctl.c 2007-06-22 11:00:03.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/capability.h>
#include <linux/ext4_fs.h>
#include <linux/ext4_jbd2.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/compat.h>
#include <linux/smp_lock.h>
@@ -37,15 +38,19 @@ int ext4_ioctl (struct inode * inode, st
unsigned int oldflags;
unsigned int jflag;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ err = -EACCES;
+ goto flags_out;
+ }
+ if (get_user(flags, (int __user *) arg)) {
+ err = -EFAULT;
+ goto flags_out;
+ }
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT4_DIRSYNC_FL;
@@ -64,7 +69,8 @@ int ext4_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -75,7 +81,8 @@ int ext4_ioctl (struct inode * inode, st
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -83,7 +90,8 @@ int ext4_ioctl (struct inode * inode, st
handle = ext4_journal_start(inode, 1);
if (IS_ERR(handle)) {
mutex_unlock(&inode->i_mutex);
- return PTR_ERR(handle);
+ err = PTR_ERR(handle);
+ goto flags_out;
}
if (IS_SYNC(inode))
handle->h_sync = 1;
@@ -103,12 +111,14 @@ flags_err:
ext4_journal_stop(handle);
if (err) {
mutex_unlock(&inode->i_mutex);
- return err;
+ goto flags_out;
}
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL))
err = ext4_change_inode_journal_flag(inode, jflag);
mutex_unlock(&inode->i_mutex);
+flags_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT4_IOC_GETVERSION:
@@ -123,14 +133,18 @@ flags_err:
if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(generation, (int __user *) arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(generation, (int __user *) arg)) {
+ err = -EFAULT;
+ goto setversion_out;
+ }
handle = ext4_journal_start(inode, 1);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto setversion_out;
+ }
err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err == 0) {
inode->i_ctime = CURRENT_TIME_SEC;
@@ -138,6 +152,8 @@ flags_err:
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
}
ext4_journal_stop(handle);
+setversion_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
#ifdef CONFIG_JBD_DEBUG
@@ -173,19 +189,23 @@ flags_err:
}
return -ENOTTY;
case EXT4_IOC_SETRSVSZ: {
+ int err;
if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
return -ENOTTY;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
-
- if (get_user(rsv_window_size, (int __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ err = -EACCES;
+ goto setrsvsz_out;
+ }
+ if (get_user(rsv_window_size, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setrsvsz_out;
+ }
if (rsv_window_size > EXT4_MAX_RESERVE_BLOCKS)
rsv_window_size = EXT4_MAX_RESERVE_BLOCKS;
@@ -202,7 +222,9 @@ flags_err:
rsv->rsv_goal_size = rsv_window_size;
}
mutex_unlock(&ei->truncate_mutex);
- return 0;
+setrsvsz_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
case EXT4_IOC_GROUP_EXTEND: {
ext4_fsblk_t n_blocks_count;
@@ -212,17 +234,21 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (get_user(n_blocks_count, (__u32 __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(n_blocks_count, (__u32 __user *)arg)) {
+ err = -EFAULT;
+ goto group_extend_out;
+ }
err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
jbd2_journal_flush(EXT4_SB(sb)->s_journal);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+group_extend_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT4_IOC_GROUP_ADD: {
@@ -233,18 +259,21 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
if (copy_from_user(&input, (struct ext4_new_group_input __user *)arg,
- sizeof(input)))
- return -EFAULT;
-
+ sizeof(input))) {
+ err = -EFAULT;
+ goto group_add_out;
+ }
err = ext4_group_add(sb, &input);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
jbd2_journal_flush(EXT4_SB(sb)->s_journal);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
-
+group_add_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
diff -puN fs/fat/file.c~05-24-ioctl-mnt-takers fs/fat/file.c
--- lxc/fs/fat/file.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/fat/file.c 2007-06-22 10:05:25.000000000 -0700
@@ -8,6 +8,7 @@
#include <linux/capability.h>
#include <linux/module.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/msdos_fs.h>
#include <linux/smp_lock.h>
@@ -46,10 +47,9 @@ int fat_generic_ioctl(struct inode *inod
mutex_lock(&inode->i_mutex);
- if (IS_RDONLY(inode)) {
- err = -EROFS;
- goto up;
- }
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ goto up_no_drop_write;
/*
* ATTR_VOLUME and ATTR_DIR cannot be changed; this also
@@ -106,6 +106,8 @@ int fat_generic_ioctl(struct inode *inod
MSDOS_I(inode)->i_attrs = attr & ATTR_UNUSED;
mark_inode_dirty(inode);
up:
+ mnt_drop_write(filp->f_vfsmnt);
+ up_no_drop_write:
mutex_unlock(&inode->i_mutex);
return err;
}
diff -puN fs/hfsplus/ioctl.c~05-24-ioctl-mnt-takers fs/hfsplus/ioctl.c
--- lxc/fs/hfsplus/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/hfsplus/ioctl.c 2007-06-22 11:02:23.000000000 -0700
@@ -35,25 +35,32 @@ int hfsplus_ioctl(struct inode *inode, s
flags |= FS_NODUMP_FL; /* EXT2_NODUMP_FL */
return put_user(flags, (int __user *)arg);
case HFSPLUS_IOC_EXT2_SETFLAGS: {
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
-
- if (get_user(flags, (int __user *)arg))
- return -EFAULT;
-
+ int err = 0;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ err = -EACCES;
+ goto setflags_out;
+ }
+ if (get_user(flags, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setflags_out;
+ }
if (flags & (FS_IMMUTABLE_FL|FS_APPEND_FL) ||
HFSPLUS_I(inode).rootflags & (HFSPLUS_FLG_IMMUTABLE|HFSPLUS_FLG_APPEND)) {
- if (!capable(CAP_LINUX_IMMUTABLE))
- return -EPERM;
+ if (!capable(CAP_LINUX_IMMUTABLE)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
}
/* don't silently ignore unsupported ext2 flags */
- if (flags & ~(FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL))
- return -EOPNOTSUPP;
-
+ if (flags & ~(FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL)) {
+ err = -EOPNOTSUPP;
+ goto setflags_out;
+ }
if (flags & FS_IMMUTABLE_FL) { /* EXT2_IMMUTABLE_FL */
inode->i_flags |= S_IMMUTABLE;
HFSPLUS_I(inode).rootflags |= HFSPLUS_FLG_IMMUTABLE;
@@ -75,7 +82,9 @@ int hfsplus_ioctl(struct inode *inode, s
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
default:
return -ENOTTY;
diff -puN fs/jfs/ioctl.c~05-24-ioctl-mnt-takers fs/jfs/ioctl.c
--- lxc/fs/jfs/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/jfs/ioctl.c 2007-06-22 11:02:32.000000000 -0700
@@ -8,6 +8,7 @@
#include <linux/fs.h>
#include <linux/ctype.h>
#include <linux/capability.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/sched.h>
#include <asm/current.h>
@@ -64,16 +65,20 @@ int jfs_ioctl(struct inode * inode, stru
return put_user(flags, (int __user *) arg);
case JFS_IOC_SETFLAGS: {
unsigned int oldflags;
+ int err;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
- return -EACCES;
-
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+
+ if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) {
+ err = -EACCES;
+ goto setflags_out;
+ }
+ if (get_user(flags, (int __user *) arg)) {
+ err = -EFAULT;
+ goto setflags_out;
+ }
flags = jfs_map_ext2(flags, 1);
if (!S_ISDIR(inode->i_mode))
flags &= ~JFS_DIRSYNC_FL;
@@ -87,8 +92,10 @@ int jfs_ioctl(struct inode * inode, stru
if ((oldflags & JFS_IMMUTABLE_FL) ||
((flags ^ oldflags) &
(JFS_APPEND_FL | JFS_IMMUTABLE_FL))) {
- if (!capable(CAP_LINUX_IMMUTABLE))
- return -EPERM;
+ if (!capable(CAP_LINUX_IMMUTABLE)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
}
flags = flags & JFS_FL_USER_MODIFIABLE;
@@ -98,7 +105,9 @@ int jfs_ioctl(struct inode * inode, stru
jfs_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
default:
return -ENOTTY;
diff -puN fs/ocfs2/ioctl.c~05-24-ioctl-mnt-takers fs/ocfs2/ioctl.c
--- lxc/fs/ocfs2/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/ocfs2/ioctl.c 2007-06-22 10:05:25.000000000 -0700
@@ -56,10 +56,6 @@ static int ocfs2_set_inode_attr(struct i
goto bail;
}
- status = -EROFS;
- if (IS_RDONLY(inode))
- goto bail_unlock;
-
status = -EACCES;
if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
goto bail_unlock;
@@ -127,8 +123,13 @@ int ocfs2_ioctl(struct inode * inode, st
if (get_user(flags, (int __user *) arg))
return -EFAULT;
- return ocfs2_set_inode_attr(inode, flags,
+ status = mnt_want_write(filp->f_vfsmnt);
+ if (status)
+ return status;
+ status = ocfs2_set_inode_attr(inode, flags,
OCFS2_FL_MODIFIABLE);
+ mnt_drop_write(filp->f_vfsmnt);
+ return status;
default:
return -ENOTTY;
}
diff -puN fs/reiserfs/ioctl.c~05-24-ioctl-mnt-takers fs/reiserfs/ioctl.c
--- lxc/fs/reiserfs/ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/reiserfs/ioctl.c 2007-06-22 11:04:42.000000000 -0700
@@ -4,6 +4,7 @@
#include <linux/capability.h>
#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/reiserfs_fs.h>
#include <linux/time.h>
#include <asm/uaccess.h>
@@ -25,6 +26,7 @@ int reiserfs_ioctl(struct inode *inode,
unsigned long arg)
{
unsigned int flags;
+ int err = 0;
switch (cmd) {
case REISERFS_IOC_UNPACK:
@@ -48,48 +50,61 @@ int reiserfs_ioctl(struct inode *inode,
if (!reiserfs_attrs(inode->i_sb))
return -ENOTTY;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
if ((current->fsuid != inode->i_uid)
- && !capable(CAP_FOWNER))
- return -EPERM;
-
- if (get_user(flags, (int __user *)arg))
- return -EFAULT;
-
+ && !capable(CAP_FOWNER)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
+ if (get_user(flags, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setflags_out;
+ }
if (((flags ^ REISERFS_I(inode)->
i_attrs) & (REISERFS_IMMUTABLE_FL |
REISERFS_APPEND_FL))
- && !capable(CAP_LINUX_IMMUTABLE))
- return -EPERM;
-
+ && !capable(CAP_LINUX_IMMUTABLE)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
if ((flags & REISERFS_NOTAIL_FL) &&
S_ISREG(inode->i_mode)) {
int result;
result = reiserfs_unpack(inode, filp);
- if (result)
- return result;
+ if (result) {
+ err = result;
+ goto setflags_out;
+ }
}
sd_attrs_to_i_attrs(flags, inode);
REISERFS_I(inode)->i_attrs = flags;
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
case REISERFS_IOC_GETVERSION:
return put_user(inode->i_generation, (int __user *)arg);
case REISERFS_IOC_SETVERSION:
if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(inode->i_generation, (int __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(inode->i_generation, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setversion_out;
+ }
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setversion_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
default:
return -ENOTTY;
}
diff -puN fs/xfs/linux-2.6/xfs_ioctl.c~05-24-ioctl-mnt-takers fs/xfs/linux-2.6/xfs_ioctl.c
--- lxc/fs/xfs/linux-2.6/xfs_ioctl.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c 2007-06-22 10:05:25.000000000 -0700
@@ -526,8 +526,6 @@ xfs_attrmulti_attr_set(
char *kbuf;
int error = EFAULT;
- if (IS_RDONLY(&vp->v_inode))
- return -EROFS;
if (IS_IMMUTABLE(&vp->v_inode) || IS_APPEND(&vp->v_inode))
return EPERM;
if (len > XATTR_SIZE_MAX)
@@ -553,8 +551,6 @@ xfs_attrmulti_attr_remove(
char *name,
__uint32_t flags)
{
- if (IS_RDONLY(&vp->v_inode))
- return -EROFS;
if (IS_IMMUTABLE(&vp->v_inode) || IS_APPEND(&vp->v_inode))
return EPERM;
return bhv_vop_attr_remove(vp, name, flags, NULL);
@@ -564,6 +560,7 @@ STATIC int
xfs_attrmulti_by_handle(
xfs_mount_t *mp,
void __user *arg,
+ struct file *parfilp,
struct inode *parinode)
{
int error;
@@ -618,13 +615,21 @@ xfs_attrmulti_by_handle(
&ops[i].am_length, ops[i].am_flags);
break;
case ATTR_OP_SET:
+ ops[i].am_error = mnt_want_write(parfilp->f_vfsmnt);
+ if (ops[i].am_error)
+ break;
ops[i].am_error = xfs_attrmulti_attr_set(vp,
attr_name, ops[i].am_attrvalue,
ops[i].am_length, ops[i].am_flags);
+ mnt_drop_write(parfilp->f_vfsmnt);
break;
case ATTR_OP_REMOVE:
+ ops[i].am_error = mnt_want_write(parfilp->f_vfsmnt);
+ if (ops[i].am_error)
+ break;
ops[i].am_error = xfs_attrmulti_attr_remove(vp,
attr_name, ops[i].am_flags);
+ mnt_drop_write(parfilp->f_vfsmnt);
break;
default:
ops[i].am_error = EINVAL;
@@ -804,7 +809,7 @@ xfs_ioctl(
return xfs_attrlist_by_handle(mp, arg, inode);
case XFS_IOC_ATTRMULTI_BY_HANDLE:
- return xfs_attrmulti_by_handle(mp, arg, inode);
+ return xfs_attrmulti_by_handle(mp, arg, filp, inode);
case XFS_IOC_SWAPEXT: {
error = xfs_swapext((struct xfs_swapext __user *)arg);
diff -puN fs/xfs/linux-2.6/xfs_iops.c~05-24-ioctl-mnt-takers fs/xfs/linux-2.6/xfs_iops.c
--- lxc/fs/xfs/linux-2.6/xfs_iops.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/xfs/linux-2.6/xfs_iops.c 2007-06-22 10:05:25.000000000 -0700
@@ -156,13 +156,6 @@ xfs_ichgtime_fast(
*/
ASSERT((flags & XFS_ICHGTIME_ACC) == 0);
- /*
- * We're not supposed to change timestamps in readonly-mounted
- * filesystems. Throw it away if anyone asks us.
- */
- if (unlikely(IS_RDONLY(inode)))
- return;
-
if (flags & XFS_ICHGTIME_MOD) {
tvp = &inode->i_mtime;
ip->i_d.di_mtime.t_sec = (__int32_t)tvp->tv_sec;
diff -puN fs/xfs/linux-2.6/xfs_lrw.c~05-24-ioctl-mnt-takers fs/xfs/linux-2.6/xfs_lrw.c
--- lxc/fs/xfs/linux-2.6/xfs_lrw.c~05-24-ioctl-mnt-takers 2007-06-22 10:05:25.000000000 -0700
+++ lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c 2007-06-22 10:05:25.000000000 -0700
@@ -50,6 +50,7 @@
#include "xfs_iomap.h"
#include <linux/capability.h>
+#include <linux/mount.h>
#include <linux/writeback.h>
@@ -762,10 +763,16 @@ start:
}
}
- if (likely(!(ioflags & IO_INVIS))) {
+ /*
+ * We're not supposed to change timestamps in readonly-mounted
+ * filesystems. Throw it away if anyone asks us.
+ */
+ if (likely(!(ioflags & IO_INVIS) &&
+ !mnt_want_write(file->f_vfsmnt))) {
file_update_time(file);
xfs_ichgtime_fast(xip, inode,
XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+ mnt_drop_write(file->f_vfsmnt);
}
/*
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 08/26] elevate writer count for chown and friends
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (6 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:43 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 09/26] make access() use mnt check Dave Hansen
` (18 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
chown/chmod,etc... don't call permission in the same way
that the normal "open for write" calls do. They still
write to the filesystem, so bump the write count during
these operations.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/open.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)
diff -puN fs/open.c~06-24-elevate-writer-count-for-chown-and-friends fs/open.c
--- lxc/fs/open.c~06-24-elevate-writer-count-for-chown-and-friends 2007-06-21 23:23:16.000000000 -0700
+++ lxc-dave/fs/open.c 2007-06-21 23:23:16.000000000 -0700
@@ -508,12 +508,12 @@ asmlinkage long sys_fchmod(unsigned int
audit_inode(NULL, inode);
- err = -EROFS;
- if (IS_RDONLY(inode))
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
goto out_putf;
err = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto out_putf;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -522,6 +522,8 @@ asmlinkage long sys_fchmod(unsigned int
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+out_drop_write:
+ mnt_drop_write(file->f_vfsmnt);
out_putf:
fput(file);
out:
@@ -541,13 +543,13 @@ asmlinkage long sys_fchmodat(int dfd, co
goto out;
inode = nd.dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
@@ -557,6 +559,8 @@ asmlinkage long sys_fchmodat(int dfd, co
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+out_drop_write:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
@@ -579,9 +583,6 @@ static int chown_common(struct dentry *
printk(KERN_ERR "chown_common: NULL inode\n");
goto out;
}
- error = -EROFS;
- if (IS_RDONLY(inode))
- goto out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
goto out;
@@ -611,7 +612,12 @@ asmlinkage long sys_chown(const char __u
error = user_path_walk(filename, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -631,7 +637,12 @@ asmlinkage long sys_fchownat(int dfd, co
error = __user_walk_fd(dfd, filename, follow, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -645,7 +656,12 @@ asmlinkage long sys_lchown(const char __
error = user_path_walk_link(filename, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -662,9 +678,14 @@ asmlinkage long sys_fchown(unsigned int
if (!file)
goto out;
+ error = mnt_want_write(file->f_vfsmnt);
+ if (error)
+ goto out_fput;
dentry = file->f_path.dentry;
audit_inode(NULL, dentry->d_inode);
error = chown_common(dentry, user, group);
+ mnt_drop_write(file->f_vfsmnt);
+out_fput:
fput(file);
out:
return error;
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 09/26] make access() use mnt check
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (7 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 08/26] elevate writer count for chown and friends Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:45 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir() Dave Hansen
` (17 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
It is OK to let access() go without using a mnt_want/drop_write()
pair because it doesn't actually do writes to the filesystem,
and it is inherently racy anyway. This is a rare case when it is
OK to use __mnt_is_readonly() directly.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/open.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN fs/open.c~make-access-use-helper fs/open.c
--- lxc/fs/open.c~make-access-use-helper 2007-06-21 23:23:17.000000000 -0700
+++ lxc-dave/fs/open.c 2007-06-21 23:23:17.000000000 -0700
@@ -395,7 +395,7 @@ asmlinkage long sys_faccessat(int dfd, c
special_file(nd.dentry->d_inode->i_mode))
goto out_path_release;
- if(IS_RDONLY(nd.dentry->d_inode))
+ if (__mnt_is_readonly(nd.mnt))
res = -EROFS;
out_path_release:
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (8 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 09/26] make access() use mnt check Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:45 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 11/26] elevate write count during entire ncp_ioctl() Dave Hansen
` (16 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Pretty self-explanatory. Fits in with the rest of the series.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 5 +++++
lxc-dave/fs/nfsd/nfs4recover.c | 4 ++++
2 files changed, 9 insertions(+)
diff -puN fs/namei.c~07-24-elevate-mnt-writers-for-callers-of-vfs-mkdir fs/namei.c
--- lxc/fs/namei.c~07-24-elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-06-21 23:23:17.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-21 23:23:17.000000000 -0700
@@ -1971,7 +1971,12 @@ asmlinkage long sys_mkdirat(int dfd, con
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
diff -puN fs/nfsd/nfs4recover.c~07-24-elevate-mnt-writers-for-callers-of-vfs-mkdir fs/nfsd/nfs4recover.c
--- lxc/fs/nfsd/nfs4recover.c~07-24-elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-06-21 23:23:17.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4recover.c 2007-06-21 23:23:17.000000000 -0700
@@ -156,7 +156,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
goto out_put;
}
+ status = mnt_want_write(rec_dir.mnt);
+ if (status)
+ goto out_put;
status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU);
+ mnt_drop_write(rec_dir.mnt);
out_put:
dput(dentry);
out_unlock:
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 11/26] elevate write count during entire ncp_ioctl()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (9 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 12/26] elevate write count for link and symlink calls Dave Hansen
` (15 subsequent siblings)
26 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Some ioctls need write access, but others don't. Make a helper
function to decide when write access is needed, and take it.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/ncpfs/ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 54 insertions(+), 1 deletion(-)
diff -puN fs/ncpfs/ioctl.c~08-24-elevate-write-count-during-entire-ncp-ioctl fs/ncpfs/ioctl.c
--- lxc/fs/ncpfs/ioctl.c~08-24-elevate-write-count-during-entire-ncp-ioctl 2007-06-21 23:23:18.000000000 -0700
+++ lxc-dave/fs/ncpfs/ioctl.c 2007-06-21 23:23:18.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/ioctl.h>
#include <linux/time.h>
#include <linux/mm.h>
+#include <linux/mount.h>
#include <linux/highuid.h>
#include <linux/smp_lock.h>
#include <linux/vmalloc.h>
@@ -260,7 +261,7 @@ ncp_get_charsets(struct ncp_server* serv
}
#endif /* CONFIG_NCPFS_NLS */
-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
struct ncp_server *server = NCP_SERVER(inode);
@@ -821,6 +822,58 @@ outrel:
return -EINVAL;
}
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+ switch (cmd) {
+ case NCP_IOC_GET_FS_INFO:
+ case NCP_IOC_GET_FS_INFO_V2:
+ case NCP_IOC_NCPREQUEST:
+ case NCP_IOC_SETDENTRYTTL:
+ case NCP_IOC_SIGN_INIT:
+ case NCP_IOC_LOCKUNLOCK:
+ case NCP_IOC_SET_SIGN_WANTED:
+ return 1;
+ case NCP_IOC_GETOBJECTNAME:
+ case NCP_IOC_SETOBJECTNAME:
+ case NCP_IOC_GETPRIVATEDATA:
+ case NCP_IOC_SETPRIVATEDATA:
+ case NCP_IOC_SETCHARSETS:
+ case NCP_IOC_GETCHARSETS:
+ case NCP_IOC_CONN_LOGGED_IN:
+ case NCP_IOC_GETDENTRYTTL:
+ case NCP_IOC_GETMOUNTUID2:
+ case NCP_IOC_SIGN_WANTED:
+ case NCP_IOC_GETROOT:
+ case NCP_IOC_SETROOT:
+ return 0;
+ default:
+ /* unkown IOCTL command, assume write */
+ WARN_ON(1);
+ }
+ return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+ int ret;
+
+ if (ncp_ioctl_need_write(cmd)) {
+ /*
+ * inside the ioctl(), any failures which
+ * are because of file_permission() are
+ * -EACCESS, so it seems consistent to keep
+ * that here.
+ */
+ if (mnt_want_write(filp->f_vfsmnt))
+ return -EACCES;
+ }
+ ret = __ncp_ioctl(inode, filp, cmd, arg);
+ if (ncp_ioctl_need_write(cmd))
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 12/26] elevate write count for link and symlink calls
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (10 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 11/26] elevate write count during entire ncp_ioctl() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 13/26] elevate mount count for extended attributes Dave Hansen
` (14 subsequent siblings)
26 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff -puN fs/namei.c~09-24-elevate-write-count-for-link-and-symlink-calls fs/namei.c
--- lxc/fs/namei.c~09-24-elevate-write-count-for-link-and-symlink-calls 2007-06-21 23:23:19.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-21 23:23:19.000000000 -0700
@@ -2244,7 +2244,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
@@ -2339,7 +2344,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 13/26] elevate mount count for extended attributes
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (11 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 12/26] elevate write count for link and symlink calls Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 14/26] elevate write count for file_update_time() Dave Hansen
` (13 subsequent siblings)
26 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This basically audits the callers of xattr_permission(), which
calls permission() and can perform writes to the filesystem.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/nfsd/nfs4proc.c | 7 ++++++-
lxc-dave/fs/xattr.c | 16 ++++++++++++++--
2 files changed, 20 insertions(+), 3 deletions(-)
diff -puN fs/nfsd/nfs4proc.c~10-24-elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c
--- lxc/fs/nfsd/nfs4proc.c~10-24-elevate-mount-count-for-extended-attributes 2007-06-21 23:23:19.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4proc.c 2007-06-21 23:23:19.000000000 -0700
@@ -626,14 +626,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+ status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt);
+ if (status)
+ return status;
status = nfs_ok;
if (setattr->sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh,
setattr->sa_acl);
if (status)
- return status;
+ goto out;
status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
0, (time_t)0);
+out:
+ mnt_drop_write(cstate->current_fh.fh_export->ex_mnt);
return status;
}
diff -puN fs/xattr.c~10-24-elevate-mount-count-for-extended-attributes fs/xattr.c
--- lxc/fs/xattr.c~10-24-elevate-mount-count-for-extended-attributes 2007-06-21 23:23:19.000000000 -0700
+++ lxc-dave/fs/xattr.c 2007-06-21 23:23:19.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/smp_lock.h>
#include <linux/file.h>
#include <linux/xattr.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -33,8 +34,6 @@ xattr_permission(struct inode *inode, co
* filesystem or on an immutable / append-only inode.
*/
if (mask & MAY_WRITE) {
- if (IS_RDONLY(inode))
- return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return -EPERM;
}
@@ -237,7 +236,11 @@ sys_setxattr(char __user *path, char __u
error = user_path_walk(path, &nd);
if (error)
return error;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ return error;
error = setxattr(nd.dentry, name, value, size, flags);
+ mnt_drop_write(nd.mnt);
path_release(&nd);
return error;
}
@@ -252,7 +255,11 @@ sys_lsetxattr(char __user *path, char __
error = user_path_walk_link(path, &nd);
if (error)
return error;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ return error;
error = setxattr(nd.dentry, name, value, size, flags);
+ mnt_drop_write(nd.mnt);
path_release(&nd);
return error;
}
@@ -268,9 +275,14 @@ sys_fsetxattr(int fd, char __user *name,
f = fget(fd);
if (!f)
return error;
+ error = mnt_want_write(f->f_vfsmnt);
+ if (error)
+ goto out_fput;
dentry = f->f_path.dentry;
audit_inode(NULL, dentry->d_inode);
error = setxattr(dentry, name, value, size, flags);
+ mnt_drop_write(f->f_vfsmnt);
+out_fput:
fput(f);
return error;
}
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 14/26] elevate write count for file_update_time()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (12 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 13/26] elevate mount count for extended attributes Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:46 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 15/26] mount_is_safe(): add comment Dave Hansen
` (12 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/inode.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff -puN fs/inode.c~elevate-write-count-for-file_update_time fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-file_update_time 2007-06-21 23:23:20.000000000 -0700
+++ lxc-dave/fs/inode.c 2007-06-21 23:23:20.000000000 -0700
@@ -1217,10 +1217,13 @@ void file_update_time(struct file *file)
struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
+ int err = 0;
if (IS_NOCMTIME(inode))
return;
- if (IS_RDONLY(inode))
+ if (file->f_vfsmnt)
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
return;
now = current_fs_time(inode->i_sb);
@@ -1236,6 +1239,8 @@ void file_update_time(struct file *file)
if (sync_it)
mark_inode_dirty_sync(inode);
+ if (file->f_vfsmnt)
+ mnt_drop_write(file->f_vfsmnt);
}
EXPORT_SYMBOL(file_update_time);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 15/26] mount_is_safe(): add comment
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (13 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 14/26] elevate write count for file_update_time() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:47 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 16/26] unix_find_other() elevate write count for touch_atime() Dave Hansen
` (11 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This area of code is currently #ifdef'd out, so add a comment
for the time when it is actually used.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namespace.c | 4 ++++
1 file changed, 4 insertions(+)
diff -puN fs/namespace.c~11-24-mount-is-safe-add-comment fs/namespace.c
--- lxc/fs/namespace.c~11-24-mount-is-safe-add-comment 2007-06-21 23:23:20.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-06-21 23:23:20.000000000 -0700
@@ -691,6 +691,10 @@ static int mount_is_safe(struct nameidat
if (current->uid != nd->dentry->d_inode->i_uid)
return -EPERM;
}
+ /*
+ * We will eventually check for the mnt->writer_count here,
+ * but since the code is not used now, skip it - Dave Hansen
+ */
if (vfs_permission(nd, MAY_WRITE))
return -EPERM;
return 0;
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 16/26] unix_find_other() elevate write count for touch_atime()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (14 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 15/26] mount_is_safe(): add comment Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:47 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 17/26] elevate write count over calls to vfs_rename() Dave Hansen
` (10 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/net/unix/af_unix.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff -puN net/unix/af_unix.c~12-24-unix-find-other-elevate-write-count-for-touch-atime net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~12-24-unix-find-other-elevate-write-count-for-touch-atime 2007-06-21 23:23:21.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-06-21 23:23:21.000000000 -0700
@@ -703,21 +703,27 @@ static struct sock *unix_find_other(stru
err = path_lookup(sunname->sun_path, LOOKUP_FOLLOW, &nd);
if (err)
goto fail;
+
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto put_path_fail;
+
err = vfs_permission(&nd, MAY_WRITE);
if (err)
- goto put_fail;
+ goto mnt_drop_write_fail;
err = -ECONNREFUSED;
if (!S_ISSOCK(nd.dentry->d_inode->i_mode))
- goto put_fail;
+ goto mnt_drop_write_fail;
u=unix_find_socket_byinode(nd.dentry->d_inode);
if (!u)
- goto put_fail;
+ goto mnt_drop_write_fail;
if (u->sk_type == type)
touch_atime(nd.mnt, nd.dentry);
path_release(&nd);
+ mnt_drop_write(nd.mnt);
err=-EPROTOTYPE;
if (u->sk_type != type) {
@@ -737,7 +743,9 @@ static struct sock *unix_find_other(stru
}
return u;
-put_fail:
+mnt_drop_write_fail:
+ mnt_drop_write(nd.mnt);
+put_path_fail:
path_release(&nd);
fail:
*error=err;
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 17/26] elevate write count over calls to vfs_rename()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (15 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 16/26] unix_find_other() elevate write count for touch_atime() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 18/26] nfs: check mnt instead of superblock directly Dave Hansen
` (9 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This also creates a little helper in the NFS code to
make an if() a little bit less ugly.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 4 ++++
lxc-dave/fs/nfsd/vfs.c | 23 +++++++++++++++++++----
2 files changed, 23 insertions(+), 4 deletions(-)
diff -puN fs/namei.c~13-24-elevate-write-count-over-calls-to-vfs-rename fs/namei.c
--- lxc/fs/namei.c~13-24-elevate-write-count-over-calls-to-vfs-rename 2007-06-21 23:23:22.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-21 23:23:22.000000000 -0700
@@ -2575,8 +2575,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;
+ error = mnt_want_write(oldnd.mnt);
+ if (error)
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
+ mnt_drop_write(oldnd.mnt);
exit5:
dput(new_dentry);
exit4:
diff -puN fs/nfsd/vfs.c~13-24-elevate-write-count-over-calls-to-vfs-rename fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~13-24-elevate-write-count-over-calls-to-vfs-rename 2007-06-21 23:23:22.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-06-21 23:23:22.000000000 -0700
@@ -1554,6 +1554,14 @@ out_nfserr:
goto out_unlock;
}
+static inline int svc_msnfs(struct svc_fh *ffhp)
+{
+#ifdef MSNFS
+ return (ffhp->fh_export->ex_flags & NFSEXP_MSNFS);
+#else
+ return 0;
+#endif
+}
/*
* Rename a file
* N.B. After this call _both_ ffhp and tfhp need an fh_put
@@ -1615,13 +1623,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;
-#ifdef MSNFS
- if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+ if (svc_msnfs(ffhp) &&
((atomic_read(&odentry->d_count) > 1)
|| (atomic_read(&ndentry->d_count) > 1))) {
host_err = -EPERM;
- } else
-#endif
+ goto out_dput_new;
+ }
+
+ host_err = -EXDEV;
+ if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+ goto out_dput_new;
+ host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+ if (host_err)
+ goto out_dput_new;
+
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
host_err = nfsd_sync_dir(tdentry);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 18/26] nfs: check mnt instead of superblock directly
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (16 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 17/26] elevate write count over calls to vfs_rename() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 19/26] elevate writer count for do_sys_truncate() Dave Hansen
` (8 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/nfs/dir.c | 3 ++-
lxc-dave/fs/nfsd/vfs.c | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff -puN fs/nfs/dir.c~nfs-check-mnt-instead-of-sb fs/nfs/dir.c
--- lxc/fs/nfs/dir.c~nfs-check-mnt-instead-of-sb 2007-06-21 23:23:22.000000000 -0700
+++ lxc-dave/fs/nfs/dir.c 2007-06-21 23:23:22.000000000 -0700
@@ -981,7 +981,8 @@ static int is_atomic_open(struct inode *
if (nd->flags & LOOKUP_DIRECTORY)
return 0;
/* Are we trying to write to a read only partition? */
- if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
+ if (__mnt_is_readonly(nd->mnt) &&
+ (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
return 0;
return 1;
}
diff -puN fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb 2007-06-21 23:23:22.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-06-21 23:23:22.000000000 -0700
@@ -1811,7 +1811,7 @@ nfsd_permission(struct svc_export *exp,
inode->i_mode,
IS_IMMUTABLE(inode)? " immut" : "",
IS_APPEND(inode)? " append" : "",
- IS_RDONLY(inode)? " ro" : "");
+ __mnt_is_readonly(exp->mnt)? " ro" : "");
dprintk(" owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
#endif
@@ -1822,7 +1822,7 @@ nfsd_permission(struct svc_export *exp,
*/
if (!(acc & MAY_LOCAL_ACCESS))
if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) {
- if (EX_RDONLY(exp) || IS_RDONLY(inode))
+ if (EX_RDONLY(exp) || __mnt_is_readonly(exp->mnt))
return nfserr_rofs;
if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode))
return nfserr_perm;
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 19/26] elevate writer count for do_sys_truncate()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (17 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 18/26] nfs: check mnt instead of superblock directly Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 20/26] elevate write count for do_utimes() Dave Hansen
` (7 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/open.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff -puN fs/open.c~15-24-elevate-writer-count-for-do-sys-truncate fs/open.c
--- lxc/fs/open.c~15-24-elevate-writer-count-for-do-sys-truncate 2007-06-21 23:23:23.000000000 -0700
+++ lxc-dave/fs/open.c 2007-06-21 23:23:23.000000000 -0700
@@ -241,28 +241,28 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode->i_mode))
goto dput_and_out;
- error = vfs_permission(&nd, MAY_WRITE);
+ error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;
- error = -EROFS;
- if (IS_RDONLY(inode))
- goto dput_and_out;
+ error = vfs_permission(&nd, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
/*
* Make sure that there are no leases.
*/
error = break_lease(inode, FMODE_WRITE);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
error = get_write_access(inode);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
error = locks_verify_truncate(inode, NULL, length);
if (!error) {
@@ -271,6 +271,8 @@ static long do_sys_truncate(const char _
}
put_write_access(inode);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 20/26] elevate write count for do_utimes()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (18 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 19/26] elevate writer count for do_sys_truncate() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime() Dave Hansen
` (6 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/utimes.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff -puN fs/utimes.c~16-24-elevate-write-count-for-do-utimes fs/utimes.c
--- lxc/fs/utimes.c~16-24-elevate-write-count-for-do-utimes 2007-06-21 23:23:24.000000000 -0700
+++ lxc-dave/fs/utimes.c 2007-06-21 23:23:24.000000000 -0700
@@ -1,6 +1,7 @@
#include <linux/compiler.h>
#include <linux/fs.h>
#include <linux/linkage.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/utime.h>
@@ -32,8 +33,8 @@ asmlinkage long sys_utime(char __user *
goto out;
inode = nd.dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;
/* Don't worry, the checks are done in inode_change_ok() */
@@ -41,7 +42,7 @@ asmlinkage long sys_utime(char __user *
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
error = get_user(newattrs.ia_atime.tv_sec, ×->actime);
newattrs.ia_atime.tv_nsec = 0;
@@ -49,21 +50,23 @@ asmlinkage long sys_utime(char __user *
error = get_user(newattrs.ia_mtime.tv_sec, ×->modtime);
newattrs.ia_mtime.tv_nsec = 0;
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
newattrs.ia_valid |= ATTR_ATIME_SET | ATTR_MTIME_SET;
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
if (current->fsuid != inode->i_uid &&
(error = vfs_permission(&nd, MAY_WRITE)) != 0)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
mutex_lock(&inode->i_mutex);
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
@@ -89,8 +92,8 @@ long do_utimes(int dfd, char __user *fil
goto out;
inode = nd.dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;
/* Don't worry, the checks are done in inode_change_ok() */
@@ -98,7 +101,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
newattrs.ia_atime.tv_sec = times[0].tv_sec;
newattrs.ia_atime.tv_nsec = times[0].tv_usec * 1000;
@@ -108,15 +111,17 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
if (current->fsuid != inode->i_uid &&
(error = vfs_permission(&nd, MAY_WRITE)) != 0)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
mutex_lock(&inode->i_mutex);
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (19 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 20/26] elevate write count for do_utimes() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:50 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create() Dave Hansen
` (5 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/inode.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff -puN fs/inode.c~17-24-elevate-write-count-for-do-sys-utime-and-touch-atime fs/inode.c
--- lxc/fs/inode.c~17-24-elevate-write-count-for-do-sys-utime-and-touch-atime 2007-06-21 23:23:24.000000000 -0700
+++ lxc-dave/fs/inode.c 2007-06-21 23:23:24.000000000 -0700
@@ -1161,22 +1161,23 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry->d_inode;
struct timespec now;
- if (inode->i_flags & S_NOATIME)
+ if (mnt && mnt_want_write(mnt))
return;
+ if (inode->i_flags & S_NOATIME)
+ goto out;
if (IS_NOATIME(inode))
- return;
+ goto out;
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
+ goto out;
/*
* We may have a NULL vfsmount when coming from NFSD
*/
if (mnt) {
if (mnt->mnt_flags & MNT_NOATIME)
- return;
+ goto out;
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
-
+ goto out;
if (mnt->mnt_flags & MNT_RELATIME) {
/*
* With relative atime, only update atime if the
@@ -1187,16 +1188,19 @@ void touch_atime(struct vfsmount *mnt, s
&inode->i_atime) < 0 &&
timespec_compare(&inode->i_ctime,
&inode->i_atime) < 0)
- return;
+ goto out;
}
}
now = current_fs_time(inode->i_sb);
if (timespec_equal(&inode->i_atime, &now))
- return;
+ goto out;
inode->i_atime = now;
mark_inode_dirty_sync(inode);
+out:
+ if (mnt)
+ mnt_drop_write(mnt);
}
EXPORT_SYMBOL(touch_atime);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (20 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 23/26] elevate mnt writers for vfs_unlink() callers Dave Hansen
` (4 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 12 ++++++++++++
lxc-dave/fs/nfsd/vfs.c | 8 ++++++--
lxc-dave/net/unix/af_unix.c | 4 ++++
3 files changed, 22 insertions(+), 2 deletions(-)
diff -puN fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c
--- lxc/fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-06-21 23:23:25.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-21 23:23:25.000000000 -0700
@@ -1897,14 +1897,26 @@ asmlinkage long sys_mknodat(int dfd, con
if (!IS_ERR(dentry)) {
switch (mode & S_IFMT) {
case 0: case S_IFREG:
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ break;
error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
+ mnt_drop_write(nd.mnt);
break;
case S_IFCHR: case S_IFBLK:
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ break;
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,
new_decode_dev(dev));
+ mnt_drop_write(nd.mnt);
break;
case S_IFIFO: case S_IFSOCK:
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ break;
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
+ mnt_drop_write(nd.mnt);
break;
case S_IFDIR:
error = -EPERM;
diff -puN fs/nfsd/vfs.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-06-21 23:23:25.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-06-21 23:23:25.000000000 -0700
@@ -1192,7 +1192,11 @@ nfsd_create(struct svc_rqst *rqstp, stru
case S_IFBLK:
case S_IFIFO:
case S_IFSOCK:
+ host_err = mnt_want_write(fhp->fh_export->ex_mnt);
+ if (host_err)
+ break;
host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
+ mnt_drop_write(fhp->fh_export->ex_mnt);
break;
default:
printk("nfsd: bad file type %o in nfsd_create\n", type);
@@ -1811,7 +1815,7 @@ nfsd_permission(struct svc_export *exp,
inode->i_mode,
IS_IMMUTABLE(inode)? " immut" : "",
IS_APPEND(inode)? " append" : "",
- __mnt_is_readonly(exp->mnt)? " ro" : "");
+ __mnt_is_readonly(exp->ex_mnt)? " ro" : "");
dprintk(" owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
#endif
@@ -1822,7 +1826,7 @@ nfsd_permission(struct svc_export *exp,
*/
if (!(acc & MAY_LOCAL_ACCESS))
if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) {
- if (EX_RDONLY(exp) || __mnt_is_readonly(exp->mnt))
+ if (EX_RDONLY(exp) || __mnt_is_readonly(exp->ex_mnt))
return nfserr_rofs;
if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode))
return nfserr_perm;
diff -puN net/unix/af_unix.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-06-21 23:23:25.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-06-21 23:23:25.000000000 -0700
@@ -816,7 +816,11 @@ static int unix_bind(struct socket *sock
*/
mode = S_IFSOCK |
(SOCK_INODE(sock)->i_mode & ~current->fs->umask);
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto out_mknod_dput;
err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+ mnt_drop_write(nd.mnt);
if (err)
goto out_mknod_dput;
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 23/26] elevate mnt writers for vfs_unlink() callers
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (21 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create() Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 24/26] do_rmdir(): elevate write count Dave Hansen
` (3 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 4 ++++
lxc-dave/ipc/mqueue.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff -puN fs/namei.c~19-24-elevate-mnt-writers-for-vfs-unlink-callers fs/namei.c
--- lxc/fs/namei.c~19-24-elevate-mnt-writers-for-vfs-unlink-callers 2007-06-21 23:23:25.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-21 23:23:25.000000000 -0700
@@ -2175,7 +2175,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit2;
error = vfs_unlink(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN ipc/mqueue.c~19-24-elevate-mnt-writers-for-vfs-unlink-callers ipc/mqueue.c
--- lxc/ipc/mqueue.c~19-24-elevate-mnt-writers-for-vfs-unlink-callers 2007-06-21 23:23:25.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2007-06-21 23:23:25.000000000 -0700
@@ -750,8 +750,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
-
+ err = mnt_want_write(mqueue_mnt);
+ if (err)
+ goto out_err;
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+ mnt_drop_write(mqueue_mnt);
out_err:
dput(dentry);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 24/26] do_rmdir(): elevate write count
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (22 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 23/26] elevate mnt writers for vfs_unlink() callers Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 25/26] r/o bind mounts: scalable writer count Dave Hansen
` (2 subsequent siblings)
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Elevate the write count during the vfs_rmdir() call.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 5 +++++
1 file changed, 5 insertions(+)
diff -puN fs/namei.c~20-24-do-rmdir-elevate-write-count fs/namei.c
--- lxc/fs/namei.c~20-24-do-rmdir-elevate-write-count 2007-06-21 23:23:26.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-21 23:23:26.000000000 -0700
@@ -2095,7 +2095,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit3;
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (23 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 24/26] do_rmdir(): elevate write count Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 11:28 ` Miklos Szeredi
2007-06-23 16:52 ` Andrew Morton
2007-06-22 20:03 ` [PATCH 26/26] honor r/w changes at do_remount() time Dave Hansen
2007-06-23 16:52 ` [PATCH 00/26] Mount writer count and read-only bind mounts Andrew Morton
26 siblings, 2 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
This uses a statically-allocated percpu variable. All
operations are local to a cpu as long that cpu operates on
the same mount, and there are no writer count imbalances.
Writer count imbalances happen when a write is taken on one
cpu, and released on another, like when an open/close pair
is performed on two different cpus because the task moved.
I've written a little benchmark to sit in a loop for a couple
of seconds in several cpus in parallel doing open/write/close
loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch.
It does opens on a _pair_ of files in two different mounts
in parallel. This should cause my code to lose its "operate
on the same mount" optimization completely. This worst-case
scenario causes a 3% degredation in the benchmark.
I could probably get rid of even this 3%, but it would be
more complex than what I have here, and I think this is
getting into acceptable territory. In practice, I expect
writing more than 3 bytes to a file, as well as disk I/O
to mask any effects that this has.
---
A little history:
These patches used to use a single atomic_t in each vfsmount
to track the number of writers to the mount at any given time.
Open a file for write anywhere on the mount, and you increment
the atomic_t.
This didn't scale on NUMA or SMP. It caused something like a
4x slowdown in open/write/close operations. It bounced
cachelines around like mad. I tried out a new system with a
per-node spinlock and atomic in each vfsmount to replace the
old single atomic. It worked _much_ better on NUMA, but still
caused a ~6% regression on the same open/write/close operation
set on a normal 4-way SMP machine, because it was still
contended inside of a node.
We could generalize this lock, but this is an awfully specialized
situation, and I'd be worried that people would try to use it
when it isn't absolutely necessary.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 6 +
lxc-dave/fs/namespace.c | 140 ++++++++++++++++++++++++++++++++++++++++-
lxc-dave/include/linux/mount.h | 9 ++
3 files changed, 150 insertions(+), 5 deletions(-)
diff -puN fs/namei.c~numa_mnt_want_write fs/namei.c
--- lxc/fs/namei.c~numa_mnt_want_write 2007-06-22 10:12:46.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-22 10:12:46.000000000 -0700
@@ -231,10 +231,12 @@ int permission(struct inode *inode, int
int retval, submask;
if (mask & MAY_WRITE) {
-
/*
- * Nobody gets write access to a read-only fs.
+ * If this WARN_ON() is hit, it likely means that
+ * there was a missed mnt_want_write() on the path
+ * leading here.
*/
+ WARN_ON(__mnt_is_readonly(nd->mnt));
if (IS_RDONLY(inode) &&
(S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
return -EROFS;
diff -puN fs/namespace.c~numa_mnt_want_write fs/namespace.c
--- lxc/fs/namespace.c~numa_mnt_want_write 2007-06-22 10:12:46.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-06-22 10:21:39.000000000 -0700
@@ -17,6 +17,7 @@
#include <linux/quotaops.h>
#include <linux/acct.h>
#include <linux/capability.h>
+#include <linux/cpumask.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/seq_file.h>
@@ -51,6 +52,8 @@ static inline unsigned long hash(struct
return tmp & hash_mask;
}
+#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -64,6 +67,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ atomic_set(&mnt->__mnt_writers, 0);
if (name) {
int size = strlen(name) + 1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -76,19 +80,149 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}
-int mnt_want_write(struct vfsmount *mnt)
+
+struct mnt_writer {
+ /*
+ * If holding multiple instances of this lock, they
+ * must be ordered by cpu number.
+ */
+ spinlock_t lock;
+ unsigned long count;
+ struct vfsmount *mnt;
+} ____cacheline_aligned_in_smp;
+DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
+
+static int __init init_mnt_writers(void)
{
- if (__mnt_is_readonly(mnt))
- return -EROFS;
+ int cpu;
+ for_each_possible_cpu(cpu) {
+ struct mnt_writer *writer = &per_cpu(mnt_writers, cpu);
+ spin_lock_init(&writer->lock);
+ writer->count = 0;
+ }
return 0;
}
+fs_initcall(init_mnt_writers);
+
+static inline void __clear_mnt_count(struct mnt_writer *cpu_writer)
+{
+ if (!cpu_writer->mnt)
+ return;
+ atomic_add(cpu_writer->count, &cpu_writer->mnt->__mnt_writers);
+ cpu_writer->count = 0;
+}
+
+int mnt_want_write(struct vfsmount *mnt)
+{
+ int ret = 0;
+ struct mnt_writer *cpu_writer;
+
+ cpu_writer = &get_cpu_var(mnt_writers);
+ spin_lock(&cpu_writer->lock);
+ if (__mnt_is_readonly(mnt)) {
+ ret = -EROFS;
+ goto out;
+ }
+ if (cpu_writer->mnt != mnt) {
+ __clear_mnt_count(cpu_writer);
+ cpu_writer->mnt = mnt;
+ }
+ cpu_writer->count++;
+out:
+ spin_unlock(&cpu_writer->lock);
+ put_cpu_var(mnt_writers);
+ return ret;
+}
EXPORT_SYMBOL_GPL(mnt_want_write);
+static void lock_and_coalesce_cpu_mnt_writer_counts(void)
+{
+ int cpu;
+ struct mnt_writer *cpu_writer;
+
+ for_each_possible_cpu(cpu) {
+ cpu_writer = &per_cpu(mnt_writers, cpu);
+ spin_lock(&cpu_writer->lock);
+ __clear_mnt_count(cpu_writer);
+ cpu_writer->mnt = NULL;
+ }
+}
+
+static void mnt_unlock_cpus(void)
+{
+ int cpu;
+ struct mnt_writer *cpu_writer;
+
+ for_each_possible_cpu(cpu) {
+ cpu_writer = &per_cpu(mnt_writers, cpu);
+ spin_unlock(&cpu_writer->lock);
+ }
+}
+
+/*
+ * These per-cpu write counts are not guaranteed to have
+ * matched increments and decrements on any given cpu.
+ * A file open()ed for write on one cpu and close()d on
+ * another cpu will imbalance this count. Make sure it
+ * does not get too far out of whack.
+ */
+static void handle_write_count_underflow(struct vfsmount *mnt)
+{
+ /*
+ * This should be fast the vast majority
+ * of the time because everyone will only
+ * be reading it and will share cachelines.
+ */
+ while (atomic_read(&mnt->__mnt_writers) <
+ MNT_WRITER_UNDERFLOW_LIMIT) {
+ /*
+ * It isn't necessary to hold all of the locks
+ * at the same time, but doing it this way makes
+ * us share a lot more code.
+ */
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ mnt_unlock_cpus();
+ }
+}
+
void mnt_drop_write(struct vfsmount *mnt)
{
+ struct mnt_writer *cpu_writer;
+
+ cpu_writer = &get_cpu_var(mnt_writers);
+ spin_lock(&cpu_writer->lock);
+ put_cpu_var(mnt_writers);
+ if (cpu_writer->count > 0)
+ cpu_writer->count--;
+ else
+ atomic_dec(&mnt->__mnt_writers);
+ spin_unlock(&cpu_writer->lock);
+ handle_write_count_underflow(mnt);
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
+int mnt_make_readonly(struct vfsmount *mnt)
+{
+ int ret = 0;
+
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ /*
+ * With all the locks held, this value is stable
+ */
+ if (atomic_read(&mnt->__mnt_writers) > 0) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /*
+ * actually set mount's r/o flag here to make
+ * __mnt_is_readonly() true, which keeps anyone
+ * from doing a successful mnt_want_write().
+ */
+out:
+ mnt_unlock_cpus();
+ return ret;
+}
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
mnt->mnt_sb = sb;
diff -puN include/linux/mount.h~numa_mnt_want_write include/linux/mount.h
--- lxc/include/linux/mount.h~numa_mnt_want_write 2007-06-22 10:12:46.000000000 -0700
+++ lxc-dave/include/linux/mount.h 2007-06-22 10:21:39.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/list.h>
+#include <linux/nodemask.h>
#include <linux/spinlock.h>
#include <asm/atomic.h>
@@ -61,6 +62,14 @@ struct vfsmount {
atomic_t mnt_count;
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
+
+ /*
+ * This value is not stable unless all of the
+ * mnt_writers[] spinlocks are held, and all
+ * mnt_writer[]s on this mount have 0 as
+ * their ->count
+ */
+ atomic_t __mnt_writers;
};
static inline struct vfsmount *mntget(struct vfsmount *mnt)
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 26/26] honor r/w changes at do_remount() time
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (24 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 25/26] r/o bind mounts: scalable writer count Dave Hansen
@ 2007-06-22 20:03 ` Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-23 16:52 ` [PATCH 00/26] Mount writer count and read-only bind mounts Andrew Morton
26 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-22 20:03 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, hch, viro, Dave Hansen
Originally from: Herbert Poetzl <herbert@13thfloor.at>
This is the core of the read-only bind mount patch set.
Note that this does _not_ add a "ro" option directly to
the bind mount operation. If you require such a mount,
you must first do the bind, then follow it up with a
'mount -o remount,ro' operation.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namespace.c | 40 ++++++++++++++++++++++++++++++++++------
lxc-dave/include/linux/mount.h | 7 ++++++-
2 files changed, 40 insertions(+), 7 deletions(-)
diff -puN fs/namespace.c~23-24-honor-r-w-changes-at-do-remount-time fs/namespace.c
--- lxc/fs/namespace.c~23-24-honor-r-w-changes-at-do-remount-time 2007-06-22 10:14:19.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-06-22 10:14:19.000000000 -0700
@@ -201,7 +201,7 @@ void mnt_drop_write(struct vfsmount *mnt
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
-int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_make_readonly(struct vfsmount *mnt)
{
int ret = 0;
@@ -214,15 +214,21 @@ int mnt_make_readonly(struct vfsmount *m
goto out;
}
/*
- * actually set mount's r/o flag here to make
- * __mnt_is_readonly() true, which keeps anyone
- * from doing a successful mnt_want_write().
+ * nobody can do a successful mnt_want_write() with all
+ * of the counts in MNT_DENIED_WRITE and the locks held.
*/
+ if (!ret)
+ mnt->mnt_flags |= MNT_READONLY;
out:
mnt_unlock_cpus();
return ret;
}
+static void __mnt_unmake_readonly(struct vfsmount *mnt)
+{
+ mnt->mnt_flags &= ~MNT_READONLY;
+}
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
mnt->mnt_sb = sb;
@@ -524,7 +530,7 @@ static int show_vfsmnt(struct seq_file *
seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
seq_putc(m, ' ');
mangle(m, mnt->mnt_sb->s_type->name);
- seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
+ seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw");
for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
if (mnt->mnt_sb->s_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
@@ -1093,6 +1099,23 @@ out:
return err;
}
+static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+{
+ int error = 0;
+ int readonly_request = 0;
+
+ if (ms_flags & MS_RDONLY)
+ readonly_request = 1;
+ if (readonly_request == __mnt_is_readonly(mnt))
+ return 0;
+
+ if (readonly_request)
+ error = mnt_make_readonly(mnt);
+ else
+ __mnt_unmake_readonly(mnt);
+ return error;
+}
+
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
@@ -1114,7 +1137,10 @@ static int do_remount(struct nameidata *
return -EINVAL;
down_write(&sb->s_umount);
- err = do_remount_sb(sb, flags, data, 0);
+ if (flags & MS_BIND)
+ err = change_mount_flags(nd->mnt, flags);
+ else
+ err = do_remount_sb(sb, flags, data, 0);
if (!err)
nd->mnt->mnt_flags = mnt_flags;
up_write(&sb->s_umount);
@@ -1558,6 +1584,8 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+ if (flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
diff -puN include/linux/mount.h~23-24-honor-r-w-changes-at-do-remount-time include/linux/mount.h
--- lxc/include/linux/mount.h~23-24-honor-r-w-changes-at-do-remount-time 2007-06-22 10:14:19.000000000 -0700
+++ lxc-dave/include/linux/mount.h 2007-06-22 10:14:19.000000000 -0700
@@ -29,6 +29,7 @@ struct mnt_namespace;
#define MNT_NOATIME 0x08
#define MNT_NODIRATIME 0x10
#define MNT_RELATIME 0x20
+#define MNT_READONLY 0x40 /* does the user want this to be r/o? */
#define MNT_SHRINKABLE 0x100
@@ -87,7 +88,11 @@ static inline struct vfsmount *mntget(st
*/
static inline int __mnt_is_readonly(struct vfsmount *mnt)
{
- return (mnt->mnt_sb->s_flags & MS_RDONLY);
+ if (mnt->mnt_flags & MNT_READONLY)
+ return 1;
+ if (mnt->mnt_sb->s_flags & MS_RDONLY)
+ return 1;
+ return 0;
}
extern int mnt_want_write(struct vfsmount *mnt);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/26] document nlink function
2007-06-22 20:03 ` [PATCH 01/26] document nlink function Dave Hansen
@ 2007-06-23 7:36 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:36 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:04PM -0700, Dave Hansen wrote:
>
> These should have been documented from the beginning. Fix it.
Ok. This is a trivial doc fix and should go into the 2.6.22 queue IMHO.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 02/26] ext3: remove extra IS_RDONLY() check
2007-06-22 20:03 ` [PATCH 02/26] ext3: remove extra IS_RDONLY() check Dave Hansen
@ 2007-06-23 7:36 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:36 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:05PM -0700, Dave Hansen wrote:
>
> ext3_change_inode_journal_flag() is only called from one
> location: ext3_ioctl(EXT3_IOC_SETFLAGS). That ioctl
> case already has a IS_RDONLY() call in it so this one
> is superfluous.
Ok, and stands nicely on it's own.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 03/26] ext4: remove extra IS_RDONLY() check
2007-06-22 20:03 ` [PATCH 03/26] ext4: " Dave Hansen
@ 2007-06-23 7:37 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:37 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:06PM -0700, Dave Hansen wrote:
>
> ext4_change_inode_journal_flag() is only called from one
> location: ext4_ioctl(EXT3_IOC_SETFLAGS). That ioctl
> case already has a IS_RDONLY() call in it so this one
> is superfluous.
Ditto.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-22 20:03 ` [PATCH 04/26] filesystem helpers for custom 'struct file's Dave Hansen
@ 2007-06-23 7:38 ` Christoph Hellwig
2007-06-25 14:53 ` Dave Hansen
2007-06-23 16:52 ` Andrew Morton
1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:38 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:08PM -0700, Dave Hansen wrote:
>
> Christoph H. says this stands on its own and can go in before the
> rest of the r/o bind mount set.
>
> ---
>
> Some filesystems forego the vfs and may_open() and create their
> own 'struct file's.
>
> This patch creates a couple of helper functions which can be
> used by these filesystems, and will provide a unified place
> which the r/o bind mount code may patch.
>
> Also, rename two existing, static-scope init_file() to less
> generic names.
Looks good. Again this should go in ASAP after 2.6.23 independent of
the more complicated bits because it's a nice cleanup. In the end
get_empty_filp should go away as an exported symbol.
Note that we've grown more instances of the crap you're fixing here,
e.g. fs/anon_inode.c
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 05/26] r/o bind mounts: stub functions
2007-06-22 20:03 ` [PATCH 05/26] r/o bind mounts: stub functions Dave Hansen
@ 2007-06-23 7:39 ` Christoph Hellwig
2007-06-23 16:52 ` Andrew Morton
1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:39 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:09PM -0700, Dave Hansen wrote:
>
> This patch adds two function mnt_want_write() and
> mnt_drop_write(). These are used like a lock pair around
> and fs operations that might cause a write to the filesystem.
>
> Before these can become useful, we must first cover each
> place in the VFS where writes are performed with a
> want/drop pair. When that is complete, we can actually
> introduce code that will safely check the counts before
> allowing r/w<->r/o transitions to occur.
Ok,
>
> Note that we put the linux/fs.h #include as far down in
> mount.h as possible. This is to keep the mount.h->fs.h->
> sched.h->mount.h include dependency from biting us.
except for this. Please just move __mnt_is_readonly out of line
or make it a macro, but we surely don't want to include fs.h
in mount.h
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 06/26] elevate write count open()'d files
2007-06-22 20:03 ` [PATCH 06/26] elevate write count open()'d files Dave Hansen
@ 2007-06-23 7:40 ` Christoph Hellwig
2007-06-25 15:03 ` Dave Hansen
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:40 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:10PM -0700, Dave Hansen wrote:
>
> This is the first really tricky patch in the series. It
> elevates the writer count on a mount each time a
> non-special file is opened for write.
>
> This is not completely apparent in the patch because the
> two if() conditions in may_open() above the
> mnt_want_write() call are, combined, equivalent to
> special_file().
>
> There is also an elevated count around the vfs_create()
> call in open_namei(). The count needs to be kept elevated
> all the way into the may_open() call. Otherwise, when the
> write is dropped, a ro->rw transisition could occur. This
> would lead to having rw access on the newly created file,
> while the vfsmount is ro. That is bad.
Looks good.
>
> Some filesystems forego the use of normal vfs calls to create
> struct files. Make sure that these users elevate the mnt writer
> count because they will get __fput(), and we need to make
> sure they're balanced.
With this you mean ipc/mqueue.c? You should probably mention
that single bastard child of a wannabe-filesystem explicitly :)
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls
2007-06-22 20:03 ` [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls Dave Hansen
@ 2007-06-23 7:42 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:42 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:12PM -0700, Dave Hansen wrote:
>
> Some ioctl()s can cause writes to the filesystem. Take
> these, and make them use mnt_want/drop_write() instead.
>
> We need to pass the filp one layer deeper in XFS, but
> somebody _just_ pulled it out in February because nobody
> was using it, so I don't feel guilty for adding it back.
Could have been me :)
Would be nicer to just pass down the vfsmount instead of the file
in XFS, but it shouldn't really matter in the end.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 08/26] elevate writer count for chown and friends
2007-06-22 20:03 ` [PATCH 08/26] elevate writer count for chown and friends Dave Hansen
@ 2007-06-23 7:43 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:43 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:13PM -0700, Dave Hansen wrote:
>
>
> chown/chmod,etc... don't call permission in the same way
> that the normal "open for write" calls do. They still
> write to the filesystem, so bump the write count during
> these operations.
Looks good.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 09/26] make access() use mnt check
2007-06-22 20:03 ` [PATCH 09/26] make access() use mnt check Dave Hansen
@ 2007-06-23 7:45 ` Christoph Hellwig
2007-06-25 18:27 ` Dave Hansen
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:45 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:14PM -0700, Dave Hansen wrote:
>
> It is OK to let access() go without using a mnt_want/drop_write()
> pair because it doesn't actually do writes to the filesystem,
> and it is inherently racy anyway. This is a rare case when it is
> OK to use __mnt_is_readonly() directly.
You probably want to add a big comment explaining why it's fine here.
That reminds me of something else I had in mind to debug that the
writer counts are okay:
we should probably add a check in permission that we have an elevated
writercount on the vfsmount/sb. Of course we'll need some way to
overrid it for access(), which means passing down a flag to it or
something.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir()
2007-06-22 20:03 ` [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir() Dave Hansen
@ 2007-06-23 7:45 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:45 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:16PM -0700, Dave Hansen wrote:
>
> Pretty self-explanatory. Fits in with the rest of the series.
Ok for this and similar patch 11-13.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 14/26] elevate write count for file_update_time()
2007-06-22 20:03 ` [PATCH 14/26] elevate write count for file_update_time() Dave Hansen
@ 2007-06-23 7:46 ` Christoph Hellwig
2007-06-25 18:32 ` Dave Hansen
2007-07-06 19:17 ` Dave Hansen
0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:46 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:21PM -0700, Dave Hansen wrote:
>
>
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
We really want a guaranteed non-NULL file here, but I don't want to put
this on your plate also. Please add a comment about bloody NFS exports
for now.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 15/26] mount_is_safe(): add comment
2007-06-22 20:03 ` [PATCH 15/26] mount_is_safe(): add comment Dave Hansen
@ 2007-06-23 7:47 ` Christoph Hellwig
2007-06-25 15:10 ` Dave Hansen
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:47 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:22PM -0700, Dave Hansen wrote:
>
>
> This area of code is currently #ifdef'd out, so add a comment
> for the time when it is actually used.
Ok. Does this clash with the user mount patches? Even if it does
I think we want this patch first in the series and fix the user mounts
up ontop of it.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 16/26] unix_find_other() elevate write count for touch_atime()
2007-06-22 20:03 ` [PATCH 16/26] unix_find_other() elevate write count for touch_atime() Dave Hansen
@ 2007-06-23 7:47 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:47 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 17/26] elevate write count over calls to vfs_rename()
2007-06-22 20:03 ` [PATCH 17/26] elevate write count over calls to vfs_rename() Dave Hansen
@ 2007-06-23 7:49 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:49 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:25PM -0700, Dave Hansen wrote:
>
> This also creates a little helper in the NFS code to
> make an if() a little bit less ugly.
That should probably be a separate patch. Or better one to just rip out
the MSNFS ifdefs completely, they've always been true for about the last
ten years.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 18/26] nfs: check mnt instead of superblock directly
2007-06-22 20:03 ` [PATCH 18/26] nfs: check mnt instead of superblock directly Dave Hansen
@ 2007-06-23 7:49 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:49 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 19/26] elevate writer count for do_sys_truncate()
2007-06-22 20:03 ` [PATCH 19/26] elevate writer count for do_sys_truncate() Dave Hansen
@ 2007-06-23 7:49 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:49 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 20/26] elevate write count for do_utimes()
2007-06-22 20:03 ` [PATCH 20/26] elevate write count for do_utimes() Dave Hansen
@ 2007-06-23 7:49 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:49 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime()
2007-06-22 20:03 ` [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime() Dave Hansen
@ 2007-06-23 7:50 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:50 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-06-22 20:03 ` [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create() Dave Hansen
@ 2007-06-23 7:51 ` Christoph Hellwig
2007-06-25 15:19 ` Dave Hansen
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:51 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:32PM -0700, Dave Hansen wrote:
>
>
> This takes care of all of the direct callers of vfs_mknod().
> Since a few of these cases also handle normal file creation
> as well, this also covers some calls to vfs_create().
Ok.
> diff -puN fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c
> --- lxc/fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-06-21 23:23:25.000000000 -0700
> +++ lxc-dave/fs/namei.c 2007-06-21 23:23:25.000000000 -0700
> @@ -1897,14 +1897,26 @@ asmlinkage long sys_mknodat(int dfd, con
> if (!IS_ERR(dentry)) {
> switch (mode & S_IFMT) {
> case 0: case S_IFREG:
> + error = mnt_want_write(nd.mnt);
> + if (error)
> + break;
> error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
> + mnt_drop_write(nd.mnt);
> break;
> case S_IFCHR: case S_IFBLK:
> + error = mnt_want_write(nd.mnt);
> + if (error)
> + break;
> error = vfs_mknod(nd.dentry->d_inode,dentry,mode,
> new_decode_dev(dev));
> + mnt_drop_write(nd.mnt);
> break;
> case S_IFIFO: case S_IFSOCK:
> + error = mnt_want_write(nd.mnt);
> + if (error)
> + break;
> error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
> + mnt_drop_write(nd.mnt);
> break;
> case S_IFDIR:
> error = -EPERM;
Should we just take the calls outside the switch statement?
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 23/26] elevate mnt writers for vfs_unlink() callers
2007-06-22 20:03 ` [PATCH 23/26] elevate mnt writers for vfs_unlink() callers Dave Hansen
@ 2007-06-23 7:51 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:51 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 24/26] do_rmdir(): elevate write count
2007-06-22 20:03 ` [PATCH 24/26] do_rmdir(): elevate write count Dave Hansen
@ 2007-06-23 7:51 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:51 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 26/26] honor r/w changes at do_remount() time
2007-06-22 20:03 ` [PATCH 26/26] honor r/w changes at do_remount() time Dave Hansen
@ 2007-06-23 7:51 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-23 7:51 UTC (permalink / raw)
To: Dave Hansen; +Cc: akpm, linux-fsdevel, hch, viro
On Fri, Jun 22, 2007 at 01:03:37PM -0700, Dave Hansen wrote:
>
>
> Originally from: Herbert Poetzl <herbert@13thfloor.at>
>
> This is the core of the read-only bind mount patch set.
>
> Note that this does _not_ add a "ro" option directly to
> the bind mount operation. If you require such a mount,
> you must first do the bind, then follow it up with a
> 'mount -o remount,ro' operation.
>
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Ok.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-22 20:03 ` [PATCH 25/26] r/o bind mounts: scalable writer count Dave Hansen
@ 2007-06-23 11:28 ` Miklos Szeredi
2007-06-23 11:31 ` Miklos Szeredi
2007-06-25 15:36 ` Dave Hansen
2007-06-23 16:52 ` Andrew Morton
1 sibling, 2 replies; 83+ messages in thread
From: Miklos Szeredi @ 2007-06-23 11:28 UTC (permalink / raw)
To: haveblue; +Cc: akpm, linux-fsdevel, hch, viro, haveblue
> This uses a statically-allocated percpu variable. All
> operations are local to a cpu as long that cpu operates on
> the same mount, and there are no writer count imbalances.
> Writer count imbalances happen when a write is taken on one
> cpu, and released on another, like when an open/close pair
> is performed on two different cpus because the task moved.
>
I'd suggest something along these lines in final mntput:
lock_and_coalesce_cpu_mnt_writer_counts();
mnt_unlock_cpus();
BUG_ON(atomic_read(&mnt->__mnt_writers));
since there's basically no other we'll notice if there _is_ an
imbalance.
Miklos
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-23 11:28 ` Miklos Szeredi
@ 2007-06-23 11:31 ` Miklos Szeredi
2007-06-25 15:36 ` Dave Hansen
1 sibling, 0 replies; 83+ messages in thread
From: Miklos Szeredi @ 2007-06-23 11:31 UTC (permalink / raw)
To: haveblue; +Cc: akpm, linux-fsdevel, hch, viro, haveblue
> I'd suggest something along these lines in final mntput:
>
> lock_and_coalesce_cpu_mnt_writer_counts();
> mnt_unlock_cpus();
> BUG_ON(atomic_read(&mnt->__mnt_writers));
Or rather a WARN_ON(), since it's not a fatal condition.
Miklos
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-22 20:03 ` [PATCH 04/26] filesystem helpers for custom 'struct file's Dave Hansen
2007-06-23 7:38 ` Christoph Hellwig
@ 2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:37 ` Dave Hansen
1 sibling, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2007-06-23 16:52 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-fsdevel, hch, viro, haveblue
> On Fri, 22 Jun 2007 13:03:08 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> Christoph H. says this stands on its own and can go in before the
> rest of the r/o bind mount set.
>
> ---
>
> Some filesystems forego the vfs and may_open() and create their
> own 'struct file's.
>
> This patch creates a couple of helper functions which can be
> used by these filesystems, and will provide a unified place
> which the r/o bind mount code may patch.
>
> Also, rename two existing, static-scope init_file() to less
> generic names.
the sysfs changes are no longer applicable here due to pending changes in
Greg's tree.
The net/socket.c change is a mystery. My version of sock_attach_fd()
doesn't look like yours.
So I snarfed the first three patches, passed on this one and shall await a
quality review-and-ack from someone who is appropriately familiar with this
code, thanks.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 05/26] r/o bind mounts: stub functions
2007-06-22 20:03 ` [PATCH 05/26] r/o bind mounts: stub functions Dave Hansen
2007-06-23 7:39 ` Christoph Hellwig
@ 2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:49 ` Dave Hansen
1 sibling, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2007-06-23 16:52 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-fsdevel, hch, viro, haveblue
> On Fri, 22 Jun 2007 13:03:09 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> This patch adds two function mnt_want_write() and
> mnt_drop_write()
ITYM "global, exported-to-modules yet 100% undocumented" functions.
tsk.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-22 20:03 ` [PATCH 25/26] r/o bind mounts: scalable writer count Dave Hansen
2007-06-23 11:28 ` Miklos Szeredi
@ 2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:47 ` Dave Hansen
1 sibling, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2007-06-23 16:52 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-fsdevel, hch, viro, haveblue
> On Fri, 22 Jun 2007 13:03:36 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> +DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
this can have static scope.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/26] Mount writer count and read-only bind mounts
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
` (25 preceding siblings ...)
2007-06-22 20:03 ` [PATCH 26/26] honor r/w changes at do_remount() time Dave Hansen
@ 2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:45 ` Dave Hansen
2007-06-30 9:57 ` Christoph Hellwig
26 siblings, 2 replies; 83+ messages in thread
From: Andrew Morton @ 2007-06-23 16:52 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-fsdevel, hch, viro, haveblue
> On Fri, 22 Jun 2007 13:03:03 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> Why do we need r/o bind mounts?
>
> This feature allows a read-only view into a read-write filesystem.
> In the process of doing that, it also provides infrastructure for
> keeping track of the number of writers to any given mount.
>
> This has a number of uses. It allows chroots to have parts of
> filesystems writable. It will be useful for containers in the future
> because users may have root inside a container, but should not
> be allowed to write to somefilesystems. This also replaces
> patches that vserver has had out of the tree for several years.
>
> It allows security enhancement by making sure that parts of
> your filesystem read-only (such as when you don't trust your
> FTP server), when you don't want to have entire new filesystems
> mounted, or when you want atime selectively updated.
> I've been using the following script to test that the feature is
> working as desired. It takes a directory and makes a regular
> bind and a r/o bind mount of it. It then performs some normal
> filesystem operations on the three directories, including ones
> that are expected to fail, like creating a file on the r/o
> mount.
Doesn't selinux do some of this?
My overall reaction: owch. There's a ton of tricksy code here and great
potential for us to accidentally break it in the future by forgetting a
mnt_may_write() as the kernel evolves. And then there's the added
complexity and the added runtime overhead.
Balance that against some pretty obscure-looking benefits and I'm
struggling to see how a merge is justifiable?
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-23 7:38 ` Christoph Hellwig
@ 2007-06-25 14:53 ` Dave Hansen
0 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 14:53 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Hansen, akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:38 +0100, Christoph Hellwig wrote:
> Note that we've grown more instances of the crap you're fixing here,
> e.g. fs/anon_inode.c
Ugh. I'll go clean that one up.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 06/26] elevate write count open()'d files
2007-06-23 7:40 ` Christoph Hellwig
@ 2007-06-25 15:03 ` Dave Hansen
0 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:03 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Hansen, akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:40 +0100, Christoph Hellwig wrote:
>
> > Some filesystems forego the use of normal vfs calls to create
> > struct files. Make sure that these users elevate the mnt writer
> > count because they will get __fput(), and we need to make
> > sure they're balanced.
>
> With this you mean ipc/mqueue.c? You should probably mention
> that single bastard child of a wannabe-filesystem explicitly :)
I was referring to all the ones that use the "filesystem helpers" from
patch 04/26.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 15/26] mount_is_safe(): add comment
2007-06-23 7:47 ` Christoph Hellwig
@ 2007-06-25 15:10 ` Dave Hansen
0 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:10 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Hansen, akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:47 +0100, Christoph Hellwig wrote:
> On Fri, Jun 22, 2007 at 01:03:22PM -0700, Dave Hansen wrote:
> > This area of code is currently #ifdef'd out, so add a comment
> > for the time when it is actually used.
>
> Ok. Does this clash with the user mount patches? Even if it does
> I think we want this patch first in the series and fix the user mounts
> up ontop of it.
It will clash with those, but they completely drops the call to
vfs_permission() in here. As I think about this again, we really don't
care about r/o bind mounts here, _only_ the vfs permissions.
If it is a r/o bind mount we want to be able to mount something writable
on top of it because we're never actually writing to the mount *itself*.
I think this patch can just go away now.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-06-23 7:51 ` Christoph Hellwig
@ 2007-06-25 15:19 ` Dave Hansen
2007-06-30 9:39 ` Christoph Hellwig
0 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Dave Hansen, akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:51 +0100, Christoph Hellwig wrote:
> > diff -puN fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c
> > --- lxc/fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-06-21 23:23:25.000000000 -0700
> > +++ lxc-dave/fs/namei.c 2007-06-21 23:23:25.000000000 -0700
> > @@ -1897,14 +1897,26 @@ asmlinkage long sys_mknodat(int dfd, con
> > if (!IS_ERR(dentry)) {
> > switch (mode & S_IFMT) {
> > case 0: case S_IFREG:
> > + error = mnt_want_write(nd.mnt);
> > + if (error)
> > + break;
> > error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
> > + mnt_drop_write(nd.mnt);
> > break;
> > case S_IFCHR: case S_IFBLK:
> > + error = mnt_want_write(nd.mnt);
> > + if (error)
> > + break;
> > error = vfs_mknod(nd.dentry->d_inode,dentry,mode,
> > new_decode_dev(dev));
> > + mnt_drop_write(nd.mnt);
> > break;
> > case S_IFIFO: case S_IFSOCK:
> > + error = mnt_want_write(nd.mnt);
> > + if (error)
> > + break;
> > error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
> > + mnt_drop_write(nd.mnt);
> > break;
> > case S_IFDIR:
> > error = -EPERM;
>
> Should we just take the calls outside the switch statement?
Yeah, that's much better. I assume we don't care whether we're getting
-EROFS or -EPERM/-EINVAL for the S_IFDIR and default cases?
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-23 11:28 ` Miklos Szeredi
2007-06-23 11:31 ` Miklos Szeredi
@ 2007-06-25 15:36 ` Dave Hansen
2007-06-25 19:09 ` Miklos Szeredi
1 sibling, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:36 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: haveblue, akpm, linux-fsdevel, hch, viro
On Sat, 2007-06-23 at 13:28 +0200, Miklos Szeredi wrote:
>
> I'd suggest something along these lines in final mntput:
>
> lock_and_coalesce_cpu_mnt_writer_counts();
> mnt_unlock_cpus();
> BUG_ON(atomic_read(&mnt->__mnt_writers));
>
> since there's basically no other we'll notice if there _is_ an
> imbalance.
That's a very good idea. I've added that check, but a WARN_ON() instead
of a BUG_ON().
Does __mntput() seem like the right place to you?
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-23 16:52 ` Andrew Morton
@ 2007-06-25 15:37 ` Dave Hansen
2007-06-25 17:25 ` Andrew Morton
0 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: Dave Hansen, linux-fsdevel, hch, viro
On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > On Fri, 22 Jun 2007 13:03:08 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> > Christoph H. says this stands on its own and can go in before the
> > rest of the r/o bind mount set.
> >
> > ---
> >
> > Some filesystems forego the vfs and may_open() and create their
> > own 'struct file's.
> >
> > This patch creates a couple of helper functions which can be
> > used by these filesystems, and will provide a unified place
> > which the r/o bind mount code may patch.
> >
> > Also, rename two existing, static-scope init_file() to less
> > generic names.
>
> the sysfs changes are no longer applicable here due to pending changes in
> Greg's tree.
>
> The net/socket.c change is a mystery. My version of sock_attach_fd()
> doesn't look like yours.
Should I rebase them on top of -mm, or a just current -rc?
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/26] Mount writer count and read-only bind mounts
2007-06-23 16:52 ` [PATCH 00/26] Mount writer count and read-only bind mounts Andrew Morton
@ 2007-06-25 15:45 ` Dave Hansen
2007-06-30 9:57 ` Christoph Hellwig
1 sibling, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: Dave Hansen, linux-fsdevel, hch, viro
On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > On Fri, 22 Jun 2007 13:03:03 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> > Why do we need r/o bind mounts?
> >
> > This feature allows a read-only view into a read-write filesystem.
> > In the process of doing that, it also provides infrastructure for
> > keeping track of the number of writers to any given mount.
> >
> > This has a number of uses. It allows chroots to have parts of
> > filesystems writable. It will be useful for containers in the future
> > because users may have root inside a container, but should not
> > be allowed to write to somefilesystems. This also replaces
> > patches that vserver has had out of the tree for several years.
> >
> > It allows security enhancement by making sure that parts of
> > your filesystem read-only (such as when you don't trust your
> > FTP server), when you don't want to have entire new filesystems
> > mounted, or when you want atime selectively updated.
> > I've been using the following script to test that the feature is
> > working as desired. It takes a directory and makes a regular
> > bind and a r/o bind mount of it. It then performs some normal
> > filesystem operations on the three directories, including ones
> > that are expected to fail, like creating a file on the r/o
> > mount.
>
> Doesn't selinux do some of this?
>
> My overall reaction: owch. There's a ton of tricksy code here and great
> potential for us to accidentally break it in the future by forgetting a
> mnt_may_write() as the kernel evolves.
This is definitely a tricky thing. It takes a static, single check and
replaces it with a matched set of operations. But, it's not much
different that adding a mutex to something. People can always miss one
side of the lock pair.
People won't miss the mnt_may_write() because it will become the only
way that it is valid to check a mounted fs for the ability to write to
it. IS_RDONLY() will not be available for these kinds of checks.
> And then there's the added complexity and the added runtime overhead.
>
> Balance that against some pretty obscure-looking benefits and I'm
> struggling to see how a merge is justifiable?
One reason Al had me go through using these paired operations instead of
just passing the mount all over the vfs is that this fixes some
existing, fundamental problems: we do not properly track when writers
are _finished_ to our filesystems, and may allow a remount-r/o operation
to success when writes are still occurring. We needed to separate out
the logical "users can write to this fs" from the physical "this fs is
on r/o media" or "this fs is dying and writes will only kill it more".
That's what these patches do in the end.
One set of things that I'm going to tack on here once these go in is the
ability to increment the writer count upon a decrement of i_nlink to
zero. We'll drop the write count when the file is actually truncated.
As it stands right now, since there is never an open filp on those
files, you might unlink a file, do a r/o mount of the fs, then still
write to it when the truncate occurs. I think fixing that was one of
Al's long-term goals with this strategy.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-23 16:52 ` Andrew Morton
@ 2007-06-25 15:47 ` Dave Hansen
0 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:47 UTC (permalink / raw)
To: Andrew Morton; +Cc: Dave Hansen, linux-fsdevel, hch, viro
On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > On Fri, 22 Jun 2007 13:03:36 -0700 Dave Hansen <haveblue@us.ibm.com>
> wrote:
> > +DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
>
> this can have static scope.
Fixed. That'll be in my update.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 05/26] r/o bind mounts: stub functions
2007-06-23 16:52 ` Andrew Morton
@ 2007-06-25 15:49 ` Dave Hansen
0 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 15:49 UTC (permalink / raw)
To: Andrew Morton; +Cc: Dave Hansen, linux-fsdevel, hch, viro
On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > On Fri, 22 Jun 2007 13:03:09 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> > This patch adds two function mnt_want_write() and
> > mnt_drop_write()
>
> ITYM "global, exported-to-modules yet 100% undocumented" functions.
Point taken. :)
I've added some big fat comments on these functions' definitions.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-25 15:37 ` Dave Hansen
@ 2007-06-25 17:25 ` Andrew Morton
2007-06-25 17:32 ` Dave Hansen
0 siblings, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2007-06-25 17:25 UTC (permalink / raw)
To: Dave Hansen; +Cc: Dave Hansen, linux-fsdevel, hch, viro
On Mon, 25 Jun 2007 08:37:29 -0700 Dave Hansen <hansendc@us.ibm.com> wrote:
> On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > > On Fri, 22 Jun 2007 13:03:08 -0700 Dave Hansen <haveblue@us.ibm.com> wrote:
> > > Christoph H. says this stands on its own and can go in before the
> > > rest of the r/o bind mount set.
> > >
> > > ---
> > >
> > > Some filesystems forego the vfs and may_open() and create their
> > > own 'struct file's.
> > >
> > > This patch creates a couple of helper functions which can be
> > > used by these filesystems, and will provide a unified place
> > > which the r/o bind mount code may patch.
> > >
> > > Also, rename two existing, static-scope init_file() to less
> > > generic names.
> >
> > the sysfs changes are no longer applicable here due to pending changes in
> > Greg's tree.
> >
> > The net/socket.c change is a mystery. My version of sock_attach_fd()
> > doesn't look like yours.
>
> Should I rebase them on top of -mm, or a just current -rc?
>
Weren't they already against current -rc? I really am mystified about
where your net/socket.c cam from, although I didn't search super-hard (I
was offline at the time).
Normally against current -rc, but the pending sysfs changes do mean that it
perhaps would be better if you were to runtime test it at least against
Greg's tree.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-25 17:25 ` Andrew Morton
@ 2007-06-25 17:32 ` Dave Hansen
2007-06-30 9:35 ` Christoph Hellwig
0 siblings, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 17:32 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, hch, viro
On Mon, 2007-06-25 at 10:25 -0700, Andrew Morton wrote:
> On Mon, 25 Jun 2007 08:37:29 -0700 Dave Hansen <hansendc@us.ibm.com> wrote:
> > On Sat, 2007-06-23 at 09:52 -0700, Andrew Morton wrote:
> > > The net/socket.c change is a mystery. My version of sock_attach_fd()
> > > doesn't look like yours.
> >
> > Should I rebase them on top of -mm, or a just current -rc?
> >
>
> Weren't they already against current -rc? I really am mystified about
> where your net/socket.c cam from, although I didn't search super-hard (I
> was offline at the time).
Looks like I had them againt plain 2.6.21. That'll do it.
I'll rebase and test with Greg's tree.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 09/26] make access() use mnt check
2007-06-23 7:45 ` Christoph Hellwig
@ 2007-06-25 18:27 ` Dave Hansen
2007-06-26 19:04 ` Dave Kleikamp
2007-06-30 9:37 ` Christoph Hellwig
0 siblings, 2 replies; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 18:27 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:45 +0100, Christoph Hellwig wrote:
> On Fri, Jun 22, 2007 at 01:03:14PM -0700, Dave Hansen wrote:
> >
> > It is OK to let access() go without using a mnt_want/drop_write()
> > pair because it doesn't actually do writes to the filesystem,
> > and it is inherently racy anyway. This is a rare case when it is
> > OK to use __mnt_is_readonly() directly.
>
> You probably want to add a big comment explaining why it's fine here.
I've got this in the next set:
-
- if(IS_RDONLY(nd.dentry->d_inode))
+ /*
+ * This is a rare case where using __mnt_is_readonly()
+ * is OK without a mnt_want/drop_write() pair. Since
+ * not actual write to the fs is performed here, we do
+ * not need to telegraph to that to anyone. Also, we
+ * accept that access is inherently racy, and know that
+ * the fs might be remounted between this syscall, and
+ * any action taken because of its result.
+ */
+ if (__mnt_is_readonly(nd.mnt))
res = -EROFS;
> That reminds me of something else I had in mind to debug that the
> writer counts are okay:
>
> we should probably add a check in permission that we have an elevated
> writercount on the vfsmount/sb. Of course we'll need some way to
> overrid it for access(), which means passing down a flag to it or
> something.
This was already in the second to last patch in the series. Good enough?
diff -puN fs/namei.c~numa_mnt_want_write fs/namei.c
--- lxc/fs/namei.c~numa_mnt_want_write 2007-06-25 11:05:50.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-06-25 11:05:50.000000000 -0700
@@ -230,10 +230,12 @@ int permission(struct inode *inode, int
int retval, submask;
if (mask & MAY_WRITE) {
-
/*
- * Nobody gets write access to a read-only fs.
+ * If this WARN_ON() is hit, it likely means that
+ * there was a missed mnt_want_write() on the path
+ * leading here.
*/
+ WARN_ON(__mnt_is_readonly(nd->mnt));
if (IS_RDONLY(inode) &&
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 14/26] elevate write count for file_update_time()
2007-06-23 7:46 ` Christoph Hellwig
@ 2007-06-25 18:32 ` Dave Hansen
2007-06-30 9:38 ` Christoph Hellwig
2007-07-06 19:17 ` Dave Hansen
1 sibling, 1 reply; 83+ messages in thread
From: Dave Hansen @ 2007-06-25 18:32 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:46 +0100, Christoph Hellwig wrote:
> On Fri, Jun 22, 2007 at 01:03:21PM -0700, Dave Hansen wrote:
> > Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
>
> We really want a guaranteed non-NULL file here, but I don't want to put
> this on your plate also. Please add a comment about bloody NFS exports
> for now.
How does this look?
- if (IS_RDONLY(inode))
+ /*
+ * Ideally, we want to guarantee that 'f_vfsmnt'
+ * is non-NULL here. But, NFS exports need to
+ * be fixed up before we can do that. So, check
+ * it for now. - Dave Hansen
+ */
+ if (file->f_vfsmnt)
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
return;
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 25/26] r/o bind mounts: scalable writer count
2007-06-25 15:36 ` Dave Hansen
@ 2007-06-25 19:09 ` Miklos Szeredi
0 siblings, 0 replies; 83+ messages in thread
From: Miklos Szeredi @ 2007-06-25 19:09 UTC (permalink / raw)
To: hansendc; +Cc: haveblue, akpm, linux-fsdevel, hch, viro
> > I'd suggest something along these lines in final mntput:
> >
> > lock_and_coalesce_cpu_mnt_writer_counts();
> > mnt_unlock_cpus();
> > BUG_ON(atomic_read(&mnt->__mnt_writers));
> >
> > since there's basically no other we'll notice if there _is_ an
> > imbalance.
>
> That's a very good idea. I've added that check, but a WARN_ON() instead
> of a BUG_ON().
>
> Does __mntput() seem like the right place to you?
Yes, it does.
Miklos
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 09/26] make access() use mnt check
2007-06-25 18:27 ` Dave Hansen
@ 2007-06-26 19:04 ` Dave Kleikamp
2007-06-30 9:37 ` Christoph Hellwig
1 sibling, 0 replies; 83+ messages in thread
From: Dave Kleikamp @ 2007-06-26 19:04 UTC (permalink / raw)
To: Dave Hansen; +Cc: Christoph Hellwig, akpm, linux-fsdevel, viro
On Mon, 2007-06-25 at 11:27 -0700, Dave Hansen wrote:
> On Sat, 2007-06-23 at 08:45 +0100, Christoph Hellwig wrote:
> > You probably want to add a big comment explaining why it's fine here.
>
> I've got this in the next set:
>
> -
> - if(IS_RDONLY(nd.dentry->d_inode))
> + /*
> + * This is a rare case where using __mnt_is_readonly()
> + * is OK without a mnt_want/drop_write() pair. Since
> + * not actual write to the fs is performed here, we do
s/not/no/
> + * not need to telegraph to that to anyone. Also, we
> + * accept that access is inherently racy, and know that
> + * the fs might be remounted between this syscall, and
> + * any action taken because of its result.
> + */
> + if (__mnt_is_readonly(nd.mnt))
> res = -EROFS;
>
--
David Kleikamp
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
2007-06-25 17:32 ` Dave Hansen
@ 2007-06-30 9:35 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-30 9:35 UTC (permalink / raw)
To: Dave Hansen; +Cc: Andrew Morton, linux-fsdevel, hch, viro
On Mon, Jun 25, 2007 at 10:32:43AM -0700, Dave Hansen wrote:
> Looks like I had them againt plain 2.6.21. That'll do it.
>
> I'll rebase and test with Greg's tree.
Can you send it out ASAP? I want to do the indepth review so we can
get this in ASAP and kill the get_empty_filp export eventually. We're
growing far too many opencoded instance of this crap and need to stop
it now.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 09/26] make access() use mnt check
2007-06-25 18:27 ` Dave Hansen
2007-06-26 19:04 ` Dave Kleikamp
@ 2007-06-30 9:37 ` Christoph Hellwig
2007-07-02 16:09 ` Dave Hansen
1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-30 9:37 UTC (permalink / raw)
To: Dave Hansen; +Cc: Christoph Hellwig, akpm, linux-fsdevel, viro
On Mon, Jun 25, 2007 at 11:27:25AM -0700, Dave Hansen wrote:
> I've got this in the next set:
>
> -
> - if(IS_RDONLY(nd.dentry->d_inode))
> + /*
> + * This is a rare case where using __mnt_is_readonly()
> + * is OK without a mnt_want/drop_write() pair. Since
> + * not actual write to the fs is performed here, we do
> + * not need to telegraph to that to anyone. Also, we
> + * accept that access is inherently racy, and know that
> + * the fs might be remounted between this syscall, and
> + * any action taken because of its result.
> + */
> + if (__mnt_is_readonly(nd.mnt))
> res = -EROFS;
Looks good.
> diff -puN fs/namei.c~numa_mnt_want_write fs/namei.c
> --- lxc/fs/namei.c~numa_mnt_want_write 2007-06-25 11:05:50.000000000 -0700
> +++ lxc-dave/fs/namei.c 2007-06-25 11:05:50.000000000 -0700
> @@ -230,10 +230,12 @@ int permission(struct inode *inode, int
> int retval, submask;
>
> if (mask & MAY_WRITE) {
> -
> /*
> - * Nobody gets write access to a read-only fs.
> + * If this WARN_ON() is hit, it likely means that
> + * there was a missed mnt_want_write() on the path
> + * leading here.
> */
> + WARN_ON(__mnt_is_readonly(nd->mnt));
> if (IS_RDONLY(inode) &&
I might be missing something, but wouldn't this trip an
access("/foo/bar", W_OK);
On a readonly filesystem?
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 14/26] elevate write count for file_update_time()
2007-06-25 18:32 ` Dave Hansen
@ 2007-06-30 9:38 ` Christoph Hellwig
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-30 9:38 UTC (permalink / raw)
To: Dave Hansen; +Cc: Christoph Hellwig, akpm, linux-fsdevel, viro
On Mon, Jun 25, 2007 at 11:32:08AM -0700, Dave Hansen wrote:
> How does this look?
>
> - if (IS_RDONLY(inode))
> + /*
> + * Ideally, we want to guarantee that 'f_vfsmnt'
> + * is non-NULL here. But, NFS exports need to
> + * be fixed up before we can do that. So, check
> + * it for now. - Dave Hansen
> + */
> + if (file->f_vfsmnt)
> + err = mnt_want_write(file->f_vfsmnt);
> + if (err)
> return;
Looks okay. Unfortunately we've grown some more callers of dentry_open
with a NULL vfsmount in the meantime (*grasp* reiserfs xattr mess *grasp*)
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-06-25 15:19 ` Dave Hansen
@ 2007-06-30 9:39 ` Christoph Hellwig
2007-07-02 23:31 ` Dave Hansen
2007-07-05 22:43 ` Dave Hansen
0 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-30 9:39 UTC (permalink / raw)
To: Dave Hansen; +Cc: Christoph Hellwig, Dave Hansen, akpm, linux-fsdevel, viro
On Mon, Jun 25, 2007 at 08:19:52AM -0700, Dave Hansen wrote:
> > Should we just take the calls outside the switch statement?
>
> Yeah, that's much better. I assume we don't care whether we're getting
> -EROFS or -EPERM/-EINVAL for the S_IFDIR and default cases?
We need to keep the exact error returns, so you'll have to add some
special case checks before the r/o check. It's probably still cleaner,
though.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/26] Mount writer count and read-only bind mounts
2007-06-23 16:52 ` [PATCH 00/26] Mount writer count and read-only bind mounts Andrew Morton
2007-06-25 15:45 ` Dave Hansen
@ 2007-06-30 9:57 ` Christoph Hellwig
1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-06-30 9:57 UTC (permalink / raw)
To: Andrew Morton; +Cc: Dave Hansen, linux-fsdevel, hch, viro
On Sat, Jun 23, 2007 at 09:52:46AM -0700, Andrew Morton wrote:
> Doesn't selinux do some of this?
No.
> My overall reaction: owch. There's a ton of tricksy code here and great
> potential for us to accidentally break it in the future by forgetting a
> mnt_may_write() as the kernel evolves. And then there's the added
> complexity and the added runtime overhead.
Most of the code is not actually implementing per-mountpoint r/o but
fixing the way we deal with mount -o remount,ro and the general writeability
state of a filesystem. Theres been a huge racy window before which is
fixed. And it lays the groundwork for interesting new things like the
posibility for the filesystem to revert to r/o after it's not been
written to for a while.
There's been a plain r/o bindmounts patch from the vserver folks that's
a lot smaller because it doesn't fix the problems (and actually widens
them a little)
>
> Balance that against some pretty obscure-looking benefits and I'm
> struggling to see how a merge is justifiable?
It's a feature people have been asking for a long time, just about a month
ago there has been a lenghty thread on the nfs list about the requirement.
And it only gets more important (not to say absolutely essential) for full
containers support, where you really don't want your untrusted container
to able to write into the shared /usr.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 09/26] make access() use mnt check
2007-06-30 9:37 ` Christoph Hellwig
@ 2007-07-02 16:09 ` Dave Hansen
0 siblings, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-07-02 16:09 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, linux-fsdevel, viro
On Sat, 2007-06-30 at 10:37 +0100, Christoph Hellwig wrote:
> > --- lxc/fs/namei.c~numa_mnt_want_write 2007-06-25 11:05:50.000000000 -0700
> > +++ lxc-dave/fs/namei.c 2007-06-25 11:05:50.000000000 -0700
> > @@ -230,10 +230,12 @@ int permission(struct inode *inode, int
> > int retval, submask;
> >
> > if (mask & MAY_WRITE) {
> > -
> > /*
> > - * Nobody gets write access to a read-only fs.
> > + * If this WARN_ON() is hit, it likely means that
> > + * there was a missed mnt_want_write() on the path
> > + * leading here.
> > */
> > + WARN_ON(__mnt_is_readonly(nd->mnt));
> > if (IS_RDONLY(inode) &&
>
> I might be missing something, but wouldn't this trip an
>
> access("/foo/bar", W_OK);
>
> On a readonly filesystem?
Yes, it will. I will re-work it a bit.
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-06-30 9:39 ` Christoph Hellwig
@ 2007-07-02 23:31 ` Dave Hansen
2007-07-05 22:43 ` Dave Hansen
1 sibling, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-07-02 23:31 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, linux-fsdevel, viro
On Sat, 2007-06-30 at 10:39 +0100, Christoph Hellwig wrote:
> On Mon, Jun 25, 2007 at 08:19:52AM -0700, Dave Hansen wrote:
> > > Should we just take the calls outside the switch statement?
> >
> > Yeah, that's much better. I assume we don't care whether we're getting
> > -EROFS or -EPERM/-EINVAL for the S_IFDIR and default cases?
>
> We need to keep the exact error returns, so you'll have to add some
> special case checks before the r/o check. It's probably still cleaner,
> though.
I'm looking through this and having a really hard time seeing any nice
ways to make this cleaner.
Were you thinking of basically a duplicate case statement before the r/w
check?
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-06-30 9:39 ` Christoph Hellwig
2007-07-02 23:31 ` Dave Hansen
@ 2007-07-05 22:43 ` Dave Hansen
2007-07-07 18:25 ` Jan Engelhardt
2007-07-11 10:22 ` Christoph Hellwig
1 sibling, 2 replies; 83+ messages in thread
From: Dave Hansen @ 2007-07-05 22:43 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: haveblue, akpm, linux-fsdevel, viro
On Sat, 2007-06-30 at 10:39 +0100, Christoph Hellwig wrote:
> On Mon, Jun 25, 2007 at 08:19:52AM -0700, Dave Hansen wrote:
> > > Should we just take the calls outside the switch statement?
> >
> > Yeah, that's much better. I assume we don't care whether we're getting
> > -EROFS or -EPERM/-EINVAL for the S_IFDIR and default cases?
>
> We need to keep the exact error returns, so you'll have to add some
> special case checks before the r/o check. It's probably still cleaner,
> though.
How does this (untested) patch look?
--
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().
So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch.
One thing I noticed: do we actually _need_ the first
S_ISDIR() check at the top of the function?
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
lxc-dave/fs/namei.c | 29 ++++++++++++++++++++---------
lxc-dave/fs/nfsd/vfs.c | 8 ++++++--
lxc-dave/net/unix/af_unix.c | 4 ++++
3 files changed, 30 insertions(+), 11 deletions(-)
diff -puN fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c
--- lxc/fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-05 15:40:18.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-07-05 15:40:26.000000000 -0700
@@ -1911,13 +1911,27 @@ asmlinkage long sys_mknodat(int dfd, con
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
if (error)
goto out;
+
dentry = lookup_create(&nd, 0);
error = PTR_ERR(dentry);
+ if (error)
+ goto out_unlock;
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
- if (!IS_ERR(dentry)) {
- switch (mode & S_IFMT) {
+ if (S_ISDIR(mode)) {
+ error = -EPERM;
+ goto out_dput;
+ }
+ if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) &&
+ !S_ISFIFO(mode) && !S_ISSOCK(mode) && (mode != 0)) {
+ error = -EINVAL;
+ goto out_dput;
+ }
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
+ switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
break;
@@ -1928,14 +1942,11 @@ asmlinkage long sys_mknodat(int dfd, con
case S_IFIFO: case S_IFSOCK:
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
break;
- case S_IFDIR:
- error = -EPERM;
- break;
- default:
- error = -EINVAL;
- }
- dput(dentry);
}
+ mnt_drop_write(nd.mnt);
+out_dput:
+ dput(dentry);
+out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
path_release(&nd);
out:
diff -puN fs/nfsd/vfs.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-05 15:40:18.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-07-05 15:40:18.000000000 -0700
@@ -1199,7 +1199,11 @@ nfsd_create(struct svc_rqst *rqstp, stru
case S_IFBLK:
case S_IFIFO:
case S_IFSOCK:
+ host_err = mnt_want_write(fhp->fh_export->ex_mnt);
+ if (host_err)
+ break;
host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
+ mnt_drop_write(fhp->fh_export->ex_mnt);
break;
default:
printk("nfsd: bad file type %o in nfsd_create\n", type);
@@ -1810,7 +1814,7 @@ nfsd_permission(struct svc_export *exp,
inode->i_mode,
IS_IMMUTABLE(inode)? " immut" : "",
IS_APPEND(inode)? " append" : "",
- __mnt_is_readonly(exp->mnt)? " ro" : "");
+ __mnt_is_readonly(exp->ex_mnt)? " ro" : "");
dprintk(" owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
#endif
@@ -1821,7 +1825,7 @@ nfsd_permission(struct svc_export *exp,
*/
if (!(acc & MAY_LOCAL_ACCESS))
if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) {
- if (EX_RDONLY(exp) || __mnt_is_readonly(exp->mnt))
+ if (EX_RDONLY(exp) || __mnt_is_readonly(exp->ex_mnt))
return nfserr_rofs;
if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode))
return nfserr_perm;
diff -puN net/unix/af_unix.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-05 15:40:18.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-07-05 15:40:18.000000000 -0700
@@ -815,7 +815,11 @@ static int unix_bind(struct socket *sock
*/
mode = S_IFSOCK |
(SOCK_INODE(sock)->i_mode & ~current->fs->umask);
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto out_mknod_dput;
err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+ mnt_drop_write(nd.mnt);
if (err)
goto out_mknod_dput;
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 14/26] elevate write count for file_update_time()
2007-06-23 7:46 ` Christoph Hellwig
2007-06-25 18:32 ` Dave Hansen
@ 2007-07-06 19:17 ` Dave Hansen
1 sibling, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-07-06 19:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: akpm, linux-fsdevel, viro
On Sat, 2007-06-23 at 08:46 +0100, Christoph Hellwig wrote:
> On Fri, Jun 22, 2007 at 01:03:21PM -0700, Dave Hansen wrote:
> > Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
>
> We really want a guaranteed non-NULL file here, but I don't want to put
> this on your plate also. Please add a comment about bloody NFS exports
> for now.
What's the deal with these? Just that the checks are only for NFS and
we shouldn't have to be doing them?
-- Dave
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-07-05 22:43 ` Dave Hansen
@ 2007-07-07 18:25 ` Jan Engelhardt
2007-07-09 19:04 ` Dave Hansen
2007-07-11 10:22 ` Christoph Hellwig
2007-07-11 10:22 ` Christoph Hellwig
1 sibling, 2 replies; 83+ messages in thread
From: Jan Engelhardt @ 2007-07-07 18:25 UTC (permalink / raw)
To: Dave Hansen; +Cc: Christoph Hellwig, haveblue, akpm, linux-fsdevel, viro
On Jul 5 2007 15:43, Dave Hansen wrote:
>@@ -1911,13 +1911,27 @@ asmlinkage long sys_mknodat(int dfd, con
> error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
> if (error)
> goto out;
>+
> dentry = lookup_create(&nd, 0);
> error = PTR_ERR(dentry);
>+ if (error)
>+ goto out_unlock;
>
> if (!IS_POSIXACL(nd.dentry->d_inode))
> mode &= ~current->fs->umask;
>- if (!IS_ERR(dentry)) {
>- switch (mode & S_IFMT) {
>+ if (S_ISDIR(mode)) {
>+ error = -EPERM;
>+ goto out_dput;
Hm, I wonder why mknod(,S_IFDIR,) returns -EPERM rather than -EINVAL like the
others? It is a bit misleading; EPERM usually means root could do it
somehow (unless the kernel is really pedantic).
IMHO this if(S_ISDIR) should go.
>+ if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) &&
>+ !S_ISFIFO(mode) && !S_ISSOCK(mode) && (mode != 0)) {
If you align the two &&s (^) you can also align ^ the others :)
>+ error = -EINVAL;
>+ goto out_dput;
>+ }
Jan
--
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-07-07 18:25 ` Jan Engelhardt
@ 2007-07-09 19:04 ` Dave Hansen
2007-07-11 10:22 ` Christoph Hellwig
1 sibling, 0 replies; 83+ messages in thread
From: Dave Hansen @ 2007-07-09 19:04 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Christoph Hellwig, haveblue, akpm, linux-fsdevel, viro
On Sat, 2007-07-07 at 20:25 +0200, Jan Engelhardt wrote:
> On Jul 5 2007 15:43, Dave Hansen wrote:
>
> >@@ -1911,13 +1911,27 @@ asmlinkage long sys_mknodat(int dfd, con
> > error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
> > if (error)
> > goto out;
> >+
> > dentry = lookup_create(&nd, 0);
> > error = PTR_ERR(dentry);
> >+ if (error)
> >+ goto out_unlock;
> >
> > if (!IS_POSIXACL(nd.dentry->d_inode))
> > mode &= ~current->fs->umask;
> >- if (!IS_ERR(dentry)) {
> >- switch (mode & S_IFMT) {
> >+ if (S_ISDIR(mode)) {
> >+ error = -EPERM;
> >+ goto out_dput;
>
> Hm, I wonder why mknod(,S_IFDIR,) returns -EPERM rather than -EINVAL like the
> others? It is a bit misleading; EPERM usually means root could do it
> somehow (unless the kernel is really pedantic).
> IMHO this if(S_ISDIR) should go.
I'd prefer to leave that fight for another day. :) It'd be easy to fix
up later.
> >+ if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) &&
> >+ !S_ISFIFO(mode) && !S_ISSOCK(mode) && (mode != 0)) {
>
> If you align the two &&s (^) you can also align ^ the others :)
Good point. I'll fix it up.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-07-05 22:43 ` Dave Hansen
2007-07-07 18:25 ` Jan Engelhardt
@ 2007-07-11 10:22 ` Christoph Hellwig
1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-07-11 10:22 UTC (permalink / raw)
To: Dave Hansen; +Cc: Christoph Hellwig, haveblue, akpm, linux-fsdevel, viro
On Thu, Jul 05, 2007 at 03:43:32PM -0700, Dave Hansen wrote:
> This takes care of all of the direct callers of vfs_mknod().
> Since a few of these cases also handle normal file creation
> as well, this also covers some calls to vfs_create().
>
> So that we don't have to make three mnt_want/drop_write()
> calls inside of the switch statement, we move some of its
> logic outside of the switch.
Looks good to me.
> One thing I noticed: do we actually _need_ the first
> S_ISDIR() check at the top of the function?
The EPERM is required by Posix.
> + if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) &&
> + !S_ISFIFO(mode) && !S_ISSOCK(mode) && (mode != 0)) {
no need for braces around the mode != 0
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
2007-07-07 18:25 ` Jan Engelhardt
2007-07-09 19:04 ` Dave Hansen
@ 2007-07-11 10:22 ` Christoph Hellwig
1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2007-07-11 10:22 UTC (permalink / raw)
To: Jan Engelhardt
Cc: Dave Hansen, Christoph Hellwig, haveblue, akpm, linux-fsdevel,
viro
On Sat, Jul 07, 2007 at 08:25:31PM +0200, Jan Engelhardt wrote:
> Hm, I wonder why mknod(,S_IFDIR,) returns -EPERM rather than -EINVAL like the
> others?
Posix.
^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2007-07-11 10:22 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-22 20:03 [PATCH 00/26] Mount writer count and read-only bind mounts Dave Hansen
2007-06-22 20:03 ` [PATCH 01/26] document nlink function Dave Hansen
2007-06-23 7:36 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 02/26] ext3: remove extra IS_RDONLY() check Dave Hansen
2007-06-23 7:36 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 03/26] ext4: " Dave Hansen
2007-06-23 7:37 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 04/26] filesystem helpers for custom 'struct file's Dave Hansen
2007-06-23 7:38 ` Christoph Hellwig
2007-06-25 14:53 ` Dave Hansen
2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:37 ` Dave Hansen
2007-06-25 17:25 ` Andrew Morton
2007-06-25 17:32 ` Dave Hansen
2007-06-30 9:35 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 05/26] r/o bind mounts: stub functions Dave Hansen
2007-06-23 7:39 ` Christoph Hellwig
2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:49 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 06/26] elevate write count open()'d files Dave Hansen
2007-06-23 7:40 ` Christoph Hellwig
2007-06-25 15:03 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls Dave Hansen
2007-06-23 7:42 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 08/26] elevate writer count for chown and friends Dave Hansen
2007-06-23 7:43 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 09/26] make access() use mnt check Dave Hansen
2007-06-23 7:45 ` Christoph Hellwig
2007-06-25 18:27 ` Dave Hansen
2007-06-26 19:04 ` Dave Kleikamp
2007-06-30 9:37 ` Christoph Hellwig
2007-07-02 16:09 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir() Dave Hansen
2007-06-23 7:45 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 11/26] elevate write count during entire ncp_ioctl() Dave Hansen
2007-06-22 20:03 ` [PATCH 12/26] elevate write count for link and symlink calls Dave Hansen
2007-06-22 20:03 ` [PATCH 13/26] elevate mount count for extended attributes Dave Hansen
2007-06-22 20:03 ` [PATCH 14/26] elevate write count for file_update_time() Dave Hansen
2007-06-23 7:46 ` Christoph Hellwig
2007-06-25 18:32 ` Dave Hansen
2007-06-30 9:38 ` Christoph Hellwig
2007-07-06 19:17 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 15/26] mount_is_safe(): add comment Dave Hansen
2007-06-23 7:47 ` Christoph Hellwig
2007-06-25 15:10 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 16/26] unix_find_other() elevate write count for touch_atime() Dave Hansen
2007-06-23 7:47 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 17/26] elevate write count over calls to vfs_rename() Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 18/26] nfs: check mnt instead of superblock directly Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 19/26] elevate writer count for do_sys_truncate() Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 20/26] elevate write count for do_utimes() Dave Hansen
2007-06-23 7:49 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime() Dave Hansen
2007-06-23 7:50 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create() Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-25 15:19 ` Dave Hansen
2007-06-30 9:39 ` Christoph Hellwig
2007-07-02 23:31 ` Dave Hansen
2007-07-05 22:43 ` Dave Hansen
2007-07-07 18:25 ` Jan Engelhardt
2007-07-09 19:04 ` Dave Hansen
2007-07-11 10:22 ` Christoph Hellwig
2007-07-11 10:22 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 23/26] elevate mnt writers for vfs_unlink() callers Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 24/26] do_rmdir(): elevate write count Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-22 20:03 ` [PATCH 25/26] r/o bind mounts: scalable writer count Dave Hansen
2007-06-23 11:28 ` Miklos Szeredi
2007-06-23 11:31 ` Miklos Szeredi
2007-06-25 15:36 ` Dave Hansen
2007-06-25 19:09 ` Miklos Szeredi
2007-06-23 16:52 ` Andrew Morton
2007-06-25 15:47 ` Dave Hansen
2007-06-22 20:03 ` [PATCH 26/26] honor r/w changes at do_remount() time Dave Hansen
2007-06-23 7:51 ` Christoph Hellwig
2007-06-23 16:52 ` [PATCH 00/26] Mount writer count and read-only bind mounts Andrew Morton
2007-06-25 15:45 ` Dave Hansen
2007-06-30 9:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).