[RFC][PATCH 2/7] get mount write in __dentry

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-11 15:08   ` Miklos Szeredi
  2007-10-10 16:34 ` [RFC][PATCH 3/7] do namei_flags calculation inside open_namei() Dave Hansen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


The first patch fixes an actual bug.  I think the
reset will reduce the chance for any future bugs
to creep in.

--

The r/o bind mount patches require matching mnt_want_write()
at filp creation time with a mnt_drop_write() at __fput().

We used to do this in may_open(), but Miklos pointed out
that __dentry_open() is used as well to create filps.  We
don't currently do mnt_want_write() for these.

If a filp on a writeable file is created this way, and
destroyed via __fput() we'll get a mount count imbalance.

This patch moves the mount write count acquisition from
may_open() into __dentry_open(), where we should catch
many more of the users.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/namei.c |   12 ------------
 lxc-dave/fs/open.c  |   45 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 38 insertions(+), 19 deletions(-)

diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
--- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
@@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
 			return -EACCES;
 
 		flag &= ~O_TRUNC;
-	} else if (flag & FMODE_WRITE) {
-		/*
-		 * effectively: !special_file()
-		 * balanced by __fput()
-		 */
-		error = mnt_want_write(nd->mnt);
-		if (error)
-			return error;
 	}
 
 	error = vfs_permission(nd, acc_mode);
@@ -1778,11 +1770,7 @@ do_last:
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode) {
-		error = mnt_want_write(nd->mnt);
-		if (error)
-			goto exit_mutex_unlock;
 		error = open_namei_create(nd, &path, flag, mode);
-		mnt_drop_write(nd->mnt);
 		if (error)
 			goto exit;
 		return 0;
diff -puN fs/open.c~get-write-in-__dentry_open fs/open.c
--- lxc/fs/open.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-04 18:02:48.000000000 -0700
@@ -766,22 +766,51 @@ out:
 	return error;
 }
 
+/*
+ * You have to be very careful that these write
+ * counts get cleaned up in error cases and
+ * upon __fput().  This should probably never
+ * be called outside of __dentry_open().
+ */
+static inline int __get_file_write_access(struct inode *inode,
+					  struct vfsmount *mnt)
+{
+	int error;
+	error = get_write_access(inode);
+	if (error)
+		return error;
+	/*
+	 * Do not take mount writer counts on
+	 * special files since no writes to
+	 * the mount itself will occur.
+	 */
+	if (special_file(inode->i_mode))
+		return 0;
+
+	/*
+	 * Balanced in __fput()
+	 */
+	error = mnt_want_write(mnt);
+	if (error)
+		put_write_access(inode);
+	return error;
+}
+
 static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
 					int flags, struct file *f,
 					int (*open)(struct inode *, struct file *))
 {
 	struct inode *inode;
-	int error;
+	int error = 0;
 
 	f->f_flags = flags;
 	f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
 				FMODE_PREAD | FMODE_PWRITE;
 	inode = dentry->d_inode;
-	if (f->f_mode & FMODE_WRITE) {
-		error = get_write_access(inode);
-		if (error)
-			goto cleanup_file;
-	}
+	if (f->f_mode & FMODE_WRITE)
+		error = __get_file_write_access(inode, mnt);
+	if (error)
+		goto cleanup_file;
 
 	f->f_mapping = inode->i_mapping;
 	f->f_path.dentry = dentry;
@@ -820,8 +849,10 @@ static struct file *__dentry_open(struct
 
 cleanup_all:
 	fops_put(f->f_op);
-	if (f->f_mode & FMODE_WRITE)
+	if (f->f_mode & FMODE_WRITE) {
 		put_write_access(inode);
+		mnt_drop_write(mnt);
+	}
 	file_kill(f);
 	f->f_path.dentry = NULL;
 	f->f_path.mnt = NULL;
diff -puN fs/file_table.c~get-write-in-__dentry_open fs/file_table.c
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 1/7] init_file(): only take writes on normal files
@ 2007-10-10 16:34 Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen



---

 lxc-dave/fs/file_table.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN fs/file_table.c~init_file-only-take-writes-on-normal-files fs/file_table.c
--- lxc/fs/file_table.c~init_file-only-take-writes-on-normal-files	2007-10-04 13:01:59.000000000 -0700
+++ lxc-dave/fs/file_table.c	2007-10-04 13:03:03.000000000 -0700
@@ -199,7 +199,12 @@ int init_file(struct file *file, struct 
 	file->f_mapping = dentry->d_inode->i_mapping;
 	file->f_mode = mode;
 	file->f_op = fop;
-	if (mode & FMODE_WRITE) {
+	/*
+	 * These mounts don't really matter in practice
+	 * for r/o bind mounts.  They aren't userspace-
+	 * visible.  We do this for consistency.
+	 */
+	if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
 		file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
 		error = mnt_want_write(mnt);
 		WARN_ON(error);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 3/7] do namei_flags calculation inside open_namei()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


My end goal here is to make sure all users of may_open()
return filps.  This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().

This patch moves the sys_open flags to namei flags
calculation into fs/namei.c.  We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/namei.c |   43 ++++++++++++++++++++++++++++++++++---------
 lxc-dave/fs/open.c  |   22 ++--------------------
 2 files changed, 36 insertions(+), 29 deletions(-)

diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- lxc/fs/namei.c~do-namei_flags-calculation-inside-open_namei	2007-10-04 13:13:00.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-04 13:13:00.000000000 -0700
@@ -1672,7 +1672,12 @@ int may_open(struct nameidata *nd, int a
 	return 0;
 }
 
-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function.  Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
 				int flag, int mode)
 {
 	int error;
@@ -1691,26 +1696,46 @@ static int open_namei_create(struct name
 }
 
 /*
+ * Note that while the flag value (low two bits) for sys_open means:
+ *	00 - read-only
+ *	01 - write-only
+ *	10 - read-write
+ *	11 - special
+ * it is changed into
+ *	00 - no permissions needed
+ *	01 - read-permission
+ *	10 - write-permission
+ *	11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * later).
+ *
+*/
+static inline int sys_open_flags_to_namei_flags(int flag)
+{
+	if ((flag+1) & O_ACCMODE)
+		flag++;
+	return flag;
+}
+
+/*
  *	open_namei()
  *
  * namei for open - this is in fact almost the whole open-routine.
  *
  * Note that the low bits of "flag" aren't the same as in the open
- * system call - they are 00 - no permissions needed
- *			  01 - read permission needed
- *			  10 - write permission needed
- *			  11 - read/write permissions needed
- * which is a lot more logical, and also allows the "no perm" needed
- * for symlinks (where the permissions are checked later).
+ * system call.  See sys_open_flags_to_namei_flags().
  * SMP-safe
  */
-int open_namei(int dfd, const char *pathname, int flag,
+int open_namei(int dfd, const char *pathname, int sys_open_flag,
 		int mode, struct nameidata *nd)
 {
 	int acc_mode, error;
 	struct path path;
 	struct dentry *dir;
 	int count = 0;
+	int flag = sys_open_flags_to_namei_flags(sys_open_flag);
 
 	acc_mode = ACC_MODE(flag);
 
@@ -1770,7 +1795,7 @@ do_last:
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode) {
-		error = open_namei_create(nd, &path, flag, mode);
+		error = __open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
 		return 0;
diff -puN fs/open.c~do-namei_flags-calculation-inside-open_namei fs/open.c
--- lxc/fs/open.c~do-namei_flags-calculation-inside-open_namei	2007-10-04 13:13:00.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-04 13:13:00.000000000 -0700
@@ -863,31 +863,13 @@ cleanup_file:
 	return ERR_PTR(error);
 }
 
-/*
- * Note that while the flag value (low two bits) for sys_open means:
- *	00 - read-only
- *	01 - write-only
- *	10 - read-write
- *	11 - special
- * it is changed into
- *	00 - no permissions needed
- *	01 - read-permission
- *	10 - write-permission
- *	11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc). 00 is
- * used by symlinks.
- */
 static struct file *do_filp_open(int dfd, const char *filename, int flags,
 				 int mode)
 {
-	int namei_flags, error;
+	int error;
 	struct nameidata nd;
 
-	namei_flags = flags;
-	if ((namei_flags+1) & O_ACCMODE)
-		namei_flags++;
-
-	error = open_namei(dfd, filename, namei_flags, mode, &nd);
+	error = open_namei(dfd, filename, flags, mode, &nd);
 	if (!error)
 		return nameidata_to_filp(&nd, flags);
 
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 4/7] make open_namei() return a filp
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 3/7] do namei_flags calculation inside open_namei() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-11 15:24   ` Miklos Szeredi
  2007-10-10 16:34 ` [RFC][PATCH 5/7] kill do_filp_open() Dave Hansen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


If open_namei() succeeds, there is potentially a mnt_want_write()
that needs to get balanced.  If the caller doesn't create a
'struct file' and eventually __fput() it, or manually drop the
write count on an error, we have a bug.

Forcing open_namei() to return a filp fixes this.  Any caller
getting a 'struct file' back must consider that filp instantiated
and fput() it normally.  The callers no longer have to worry about
ever manually releasing a mnt write count.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/namei.c         |   16 ++++++++--------
 lxc-dave/fs/open.c          |    7 +------
 lxc-dave/include/linux/fs.h |    2 +-
 3 files changed, 10 insertions(+), 15 deletions(-)

diff -puN fs/namei.c~make-open_namei-return-a-filp fs/namei.c
--- lxc/fs/namei.c~make-open_namei-return-a-filp	2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-03 09:01:45.000000000 -0700
@@ -1728,8 +1728,8 @@ static inline int sys_open_flags_to_name
  * system call.  See sys_open_flags_to_namei_flags().
  * SMP-safe
  */
-int open_namei(int dfd, const char *pathname, int sys_open_flag,
-		int mode, struct nameidata *nd)
+struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
+			int mode, struct nameidata *nd)
 {
 	int acc_mode, error;
 	struct path path;
@@ -1755,7 +1755,7 @@ int open_namei(int dfd, const char *path
 		error = path_lookup_open(dfd, pathname, lookup_flags(flag),
 					 nd, flag);
 		if (error)
-			return error;
+			return ERR_PTR(error);
 		goto ok;
 	}
 
@@ -1764,7 +1764,7 @@ int open_namei(int dfd, const char *path
 	 */
 	error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
 	if (error)
-		return error;
+		return ERR_PTR(error);
 
 	/*
 	 * We have the parent and last component. First of all, check
@@ -1798,7 +1798,7 @@ do_last:
 		error = __open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
-		return 0;
+		return nameidata_to_filp(nd, sys_open_flag);
 	}
 
 	/*
@@ -1831,7 +1831,7 @@ ok:
 	error = may_open(nd, acc_mode, flag);
 	if (error)
 		goto exit;
-	return 0;
+	return nameidata_to_filp(nd, sys_open_flag);
 
 exit_mutex_unlock:
 	mutex_unlock(&dir->d_inode->i_mutex);
@@ -1841,7 +1841,7 @@ exit:
 	if (!IS_ERR(nd->intent.open.file))
 		release_open_intent(nd);
 	path_release(nd);
-	return error;
+	return ERR_PTR(error);
 
 do_link:
 	error = -ELOOP;
@@ -1868,7 +1868,7 @@ do_link:
 		 * with "intent.open".
 		 */
 		release_open_intent(nd);
-		return error;
+		return ERR_PTR(error);
 	}
 	nd->flags &= ~LOOKUP_PARENT;
 	if (nd->last_type == LAST_BIND)
diff -puN fs/open.c~make-open_namei-return-a-filp fs/open.c
--- lxc/fs/open.c~make-open_namei-return-a-filp	2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-03 09:01:45.000000000 -0700
@@ -846,14 +846,9 @@ cleanup_file:
 static struct file *do_filp_open(int dfd, const char *filename, int flags,
 				 int mode)
 {
-	int error;
 	struct nameidata nd;
 
-	error = open_namei(dfd, filename, flags, mode, &nd);
-	if (!error)
-		return nameidata_to_filp(&nd, flags);
-
-	return ERR_PTR(error);
+	return open_namei(dfd, filename, flags, mode, &nd);
 }
 
 struct file *filp_open(const char *filename, int flags, int mode)
diff -puN include/linux/fs.h~make-open_namei-return-a-filp include/linux/fs.h
--- lxc/include/linux/fs.h~make-open_namei-return-a-filp	2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/include/linux/fs.h	2007-10-03 09:01:45.000000000 -0700
@@ -1721,7 +1721,7 @@ extern struct file *create_read_pipe(str
 extern struct file *create_write_pipe(void);
 extern void free_write_pipe(struct file *);
 
-extern int open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
 extern int may_open(struct nameidata *, int, int);
 
 extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 5/7] kill do_filp_open()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
                   ` (2 preceding siblings ...)
  2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 6/7] kill filp_open() Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 7/7] keep track of mnt_writer state of struct file Dave Hansen
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


This kills off the almost empty do_filp_open().  The indenting
change in do_sys_open() is because we would have gone over our
80 characters otherwise.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/open.c |   39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff -puN fs/open.c~kill-do_filp_open fs/open.c
--- lxc/fs/open.c~kill-do_filp_open	2007-10-04 13:59:44.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-04 13:59:44.000000000 -0700
@@ -863,17 +863,10 @@ cleanup_file:
 	return ERR_PTR(error);
 }
 
-static struct file *do_filp_open(int dfd, const char *filename, int flags,
-				 int mode)
-{
-	struct nameidata nd;
-
-	return open_namei(dfd, filename, flags, mode, &nd);
-}
-
 struct file *filp_open(const char *filename, int flags, int mode)
 {
-	return do_filp_open(AT_FDCWD, filename, flags, mode);
+	struct nameidata nd;
+	return open_namei(AT_FDCWD, filename, flags, mode, &nd);
 }
 EXPORT_SYMBOL(filp_open);
 
@@ -1072,20 +1065,24 @@ long do_sys_open(int dfd, const char __u
 	char *tmp = getname(filename);
 	int fd = PTR_ERR(tmp);
 
-	if (!IS_ERR(tmp)) {
-		fd = get_unused_fd_flags(flags);
-		if (fd >= 0) {
-			struct file *f = do_filp_open(dfd, tmp, flags, mode);
-			if (IS_ERR(f)) {
-				put_unused_fd(fd);
-				fd = PTR_ERR(f);
-			} else {
-				fsnotify_open(f->f_path.dentry);
-				fd_install(fd, f);
-			}
+	if (IS_ERR(tmp))
+		goto out;
+
+	fd = get_unused_fd_flags(flags);
+	if (fd >= 0) {
+		struct nameidata nd;
+		struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+
+		if (IS_ERR(f)) {
+			put_unused_fd(fd);
+			fd = PTR_ERR(f);
+		} else {
+			fsnotify_open(f->f_path.dentry);
+			fd_install(fd, f);
 		}
-		putname(tmp);
 	}
+	putname(tmp);
+out:
 	return fd;
 }
 
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 6/7] kill filp_open()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
                   ` (3 preceding siblings ...)
  2007-10-10 16:34 ` [RFC][PATCH 5/7] kill do_filp_open() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 7/7] keep track of mnt_writer state of struct file Dave Hansen
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


Replace all callers with open_namei() directly, and move the
nameidata stack allocation into open_namei().

Does it make sense to still call it open_namei(), even though it
doesn't actually take a nameidata any more?

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/drivers/usb/gadget/file_storage.c |    5 +-
 lxc-dave/fs/exec.c                         |    2 
 lxc-dave/fs/namei.c                        |   69 ++++++++++++++---------------
 lxc-dave/fs/open.c                         |    6 --
 lxc-dave/fs/reiserfs/journal.c             |    2 
 lxc-dave/include/linux/fs.h                |    2 
 lxc-dave/kernel/acct.c                     |    2 
 lxc-dave/mm/swapfile.c                     |    4 -
 lxc-dave/sound/sound_firmware.c            |    2 
 9 files changed, 47 insertions(+), 47 deletions(-)

diff -puN drivers/usb/gadget/file_storage.c~kill-filp_open drivers/usb/gadget/file_storage.c
--- lxc/drivers/usb/gadget/file_storage.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/drivers/usb/gadget/file_storage.c	2007-10-03 09:01:47.000000000 -0700
@@ -3468,16 +3468,17 @@ static int open_backing_file(struct lun 
 	struct inode			*inode = NULL;
 	loff_t				size;
 	loff_t				num_sectors;
+	int				mode = O_LARGEFILE;
 
 	/* R/W if we can, R/O if we must */
 	ro = curlun->ro;
 	if (!ro) {
-		filp = filp_open(filename, O_RDWR | O_LARGEFILE, 0);
+		filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
 		if (-EROFS == PTR_ERR(filp))
 			ro = 1;
 	}
 	if (ro)
-		filp = filp_open(filename, O_RDONLY | O_LARGEFILE, 0);
+		filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
 	if (IS_ERR(filp)) {
 		LINFO(curlun, "unable to open backing file: %s\n", filename);
 		return PTR_ERR(filp);
diff -puN fs/exec.c~kill-filp_open fs/exec.c
--- lxc/fs/exec.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/exec.c	2007-10-03 09:01:47.000000000 -0700
@@ -1764,7 +1764,7 @@ int do_coredump(long signr, int exit_cod
  			goto fail_unlock;
  		}
  	} else
- 		file = filp_open(corename,
+		file = open_namei(AT_FDCWD, corename,
 				 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
 				 0600);
 	if (IS_ERR(file))
diff -puN fs/namei.c~kill-filp_open fs/namei.c
--- lxc/fs/namei.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-03 09:01:47.000000000 -0700
@@ -1729,8 +1729,9 @@ static inline int sys_open_flags_to_name
  * SMP-safe
  */
 struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
-			int mode, struct nameidata *nd)
+			int mode)
 {
+	struct nameidata nd;
 	int acc_mode, error;
 	struct path path;
 	struct dentry *dir;
@@ -1753,7 +1754,7 @@ struct file *open_namei(int dfd, const c
 	 */
 	if (!(flag & O_CREAT)) {
 		error = path_lookup_open(dfd, pathname, lookup_flags(flag),
-					 nd, flag);
+					 &nd, flag);
 		if (error)
 			return ERR_PTR(error);
 		goto ok;
@@ -1762,7 +1763,7 @@ struct file *open_namei(int dfd, const c
 	/*
 	 * Create - we need to know the parent.
 	 */
-	error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+	error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,&nd,flag,mode);
 	if (error)
 		return ERR_PTR(error);
 
@@ -1772,14 +1773,14 @@ struct file *open_namei(int dfd, const c
 	 * will not do.
 	 */
 	error = -EISDIR;
-	if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
+	if (nd.last_type != LAST_NORM || nd.last.name[nd.last.len])
 		goto exit;
 
-	dir = nd->dentry;
-	nd->flags &= ~LOOKUP_PARENT;
+	dir = nd.dentry;
+	nd.flags &= ~LOOKUP_PARENT;
 	mutex_lock(&dir->d_inode->i_mutex);
-	path.dentry = lookup_hash(nd);
-	path.mnt = nd->mnt;
+	path.dentry = lookup_hash(&nd);
+	path.mnt = nd.mnt;
 
 do_last:
 	error = PTR_ERR(path.dentry);
@@ -1788,17 +1789,17 @@ do_last:
 		goto exit;
 	}
 
-	if (IS_ERR(nd->intent.open.file)) {
-		error = PTR_ERR(nd->intent.open.file);
+	if (IS_ERR(nd.intent.open.file)) {
+		error = PTR_ERR(nd.intent.open.file);
 		goto exit_mutex_unlock;
 	}
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode) {
-		error = __open_namei_create(nd, &path, flag, mode);
+		error = __open_namei_create(&nd, &path, flag, mode);
 		if (error)
 			goto exit;
-		return nameidata_to_filp(nd, sys_open_flag);
+		return nameidata_to_filp(&nd, sys_open_flag);
 	}
 
 	/*
@@ -1823,24 +1824,24 @@ do_last:
 	if (path.dentry->d_inode->i_op && path.dentry->d_inode->i_op->follow_link)
 		goto do_link;
 
-	path_to_nameidata(&path, nd);
+	path_to_nameidata(&path, &nd);
 	error = -EISDIR;
 	if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
 		goto exit;
 ok:
-	error = may_open(nd, acc_mode, flag);
+	error = may_open(&nd, acc_mode, flag);
 	if (error)
 		goto exit;
-	return nameidata_to_filp(nd, sys_open_flag);
+	return nameidata_to_filp(&nd, sys_open_flag);
 
 exit_mutex_unlock:
 	mutex_unlock(&dir->d_inode->i_mutex);
 exit_dput:
-	dput_path(&path, nd);
+	dput_path(&path, &nd);
 exit:
-	if (!IS_ERR(nd->intent.open.file))
-		release_open_intent(nd);
-	path_release(nd);
+	if (!IS_ERR(&nd.intent.open.file))
+		release_open_intent(&nd);
+	path_release(&nd);
 	return ERR_PTR(error);
 
 do_link:
@@ -1854,42 +1855,42 @@ do_link:
 	 * After that we have the parent and last component, i.e.
 	 * we are in the same situation as after the first path_walk().
 	 * Well, almost - if the last component is normal we get its copy
-	 * stored in nd->last.name and we will have to putname() it when we
+	 * stored in nd.last.name and we will have to putname() it when we
 	 * are done. Procfs-like symlinks just set LAST_BIND.
 	 */
-	nd->flags |= LOOKUP_PARENT;
-	error = security_inode_follow_link(path.dentry, nd);
+	nd.flags |= LOOKUP_PARENT;
+	error = security_inode_follow_link(path.dentry, &nd);
 	if (error)
 		goto exit_dput;
-	error = __do_follow_link(&path, nd);
+	error = __do_follow_link(&path, &nd);
 	if (error) {
 		/* Does someone understand code flow here? Or it is only
 		 * me so stupid? Anathema to whoever designed this non-sense
 		 * with "intent.open".
 		 */
-		release_open_intent(nd);
+		release_open_intent(&nd);
 		return ERR_PTR(error);
 	}
-	nd->flags &= ~LOOKUP_PARENT;
-	if (nd->last_type == LAST_BIND)
+	nd.flags &= ~LOOKUP_PARENT;
+	if (nd.last_type == LAST_BIND)
 		goto ok;
 	error = -EISDIR;
-	if (nd->last_type != LAST_NORM)
+	if (nd.last_type != LAST_NORM)
 		goto exit;
-	if (nd->last.name[nd->last.len]) {
-		__putname(nd->last.name);
+	if (nd.last.name[nd.last.len]) {
+		__putname(nd.last.name);
 		goto exit;
 	}
 	error = -ELOOP;
 	if (count++==32) {
-		__putname(nd->last.name);
+		__putname(nd.last.name);
 		goto exit;
 	}
-	dir = nd->dentry;
+	dir = nd.dentry;
 	mutex_lock(&dir->d_inode->i_mutex);
-	path.dentry = lookup_hash(nd);
-	path.mnt = nd->mnt;
-	__putname(nd->last.name);
+	path.dentry = lookup_hash(&nd);
+	path.mnt = nd.mnt;
+	__putname(nd.last.name);
 	goto do_last;
 }
 
diff -puN fs/open.c~kill-filp_open fs/open.c
--- lxc/fs/open.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-03 09:01:47.000000000 -0700
@@ -845,8 +845,7 @@ cleanup_file:
 
 struct file *filp_open(const char *filename, int flags, int mode)
 {
-	struct nameidata nd;
-	return open_namei(AT_FDCWD, filename, flags, mode, &nd);
+	return open_namei(AT_FDCWD, filename, flags, mode);
 }
 EXPORT_SYMBOL(filp_open);
 
@@ -1050,8 +1049,7 @@ long do_sys_open(int dfd, const char __u
 
 	fd = get_unused_fd_flags(flags);
 	if (fd >= 0) {
-		struct nameidata nd;
-		struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+		struct file *f = open_namei(dfd, tmp, flags, mode);
 
 		if (IS_ERR(f)) {
 			put_unused_fd(fd);
diff -puN fs/reiserfs/journal.c~kill-filp_open fs/reiserfs/journal.c
--- lxc/fs/reiserfs/journal.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/reiserfs/journal.c	2007-10-03 09:01:47.000000000 -0700
@@ -2623,7 +2623,7 @@ static int journal_init_dev(struct super
 		return 0;
 	}
 
-	journal->j_dev_file = filp_open(jdev_name, 0, 0);
+	journal->j_dev_file = open_namei(AT_FDCWD, jdev_name, 0, 0);
 	if (!IS_ERR(journal->j_dev_file)) {
 		struct inode *jdev_inode = journal->j_dev_file->f_mapping->host;
 		if (!S_ISBLK(jdev_inode->i_mode)) {
diff -puN include/linux/fs.h~kill-filp_open include/linux/fs.h
--- lxc/include/linux/fs.h~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/include/linux/fs.h	2007-10-03 09:01:47.000000000 -0700
@@ -1721,7 +1721,7 @@ extern struct file *create_read_pipe(str
 extern struct file *create_write_pipe(void);
 extern void free_write_pipe(struct file *);
 
-extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int);
 extern int may_open(struct nameidata *, int, int);
 
 extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
diff -puN kernel/acct.c~kill-filp_open kernel/acct.c
--- lxc/kernel/acct.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/kernel/acct.c	2007-10-03 09:01:47.000000000 -0700
@@ -208,7 +208,7 @@ static int acct_on(char *name)
 	int error;
 
 	/* Difference from BSD - they don't do O_APPEND */
-	file = filp_open(name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
+	file = open_namei(AT_FDCWD, name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff -puN mm/swapfile.c~kill-filp_open mm/swapfile.c
--- lxc/mm/swapfile.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/mm/swapfile.c	2007-10-03 09:01:47.000000000 -0700
@@ -1198,7 +1198,7 @@ asmlinkage long sys_swapoff(const char _
 	if (IS_ERR(pathname))
 		goto out;
 
-	victim = filp_open(pathname, O_RDWR|O_LARGEFILE, 0);
+	victim = open_namei(AT_FDCWD, pathname, O_RDWR|O_LARGEFILE, 0);
 	putname(pathname);
 	err = PTR_ERR(victim);
 	if (IS_ERR(victim))
@@ -1477,7 +1477,7 @@ asmlinkage long sys_swapon(const char __
 		name = NULL;
 		goto bad_swap_2;
 	}
-	swap_file = filp_open(name, O_RDWR|O_LARGEFILE, 0);
+	swap_file = open_namei(AT_FDCWD, name, O_RDWR|O_LARGEFILE, 0);
 	error = PTR_ERR(swap_file);
 	if (IS_ERR(swap_file)) {
 		swap_file = NULL;
diff -puN sound/sound_firmware.c~kill-filp_open sound/sound_firmware.c
--- lxc/sound/sound_firmware.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/sound/sound_firmware.c	2007-10-03 09:01:47.000000000 -0700
@@ -14,7 +14,7 @@ static int do_mod_firmware_load(const ch
 	char *dp;
 	loff_t pos;
 
-	filp = filp_open(fn, 0, 0);
+	filp = open_namei(AT_FDCWD, fn, 0, 0);
 	if (IS_ERR(filp))
 	{
 		printk(KERN_INFO "Unable to load '%s'.\n", fn);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 7/7] keep track of mnt_writer state of struct file
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
                   ` (4 preceding siblings ...)
  2007-10-10 16:34 ` [RFC][PATCH 6/7] kill filp_open() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


There have been a few oopses caused by 'struct file's with
NULL f_vfsmnts.  There was also a set of potentially missed
mnt_want_write()s from dentry_open() calls.

This patch provides a very simple debugging framework to
catch these kinds of bugs.  It will WARN_ON() them, but
should stop us from having any oopses or mnt_writer
count imbalances.


Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/file_table.c    |   21 +++++++++++++++++++--
 lxc-dave/fs/open.c          |    2 ++
 lxc-dave/include/linux/fs.h |    4 ++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file fs/file_table.c
--- lxc/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file	2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/fs/file_table.c	2007-10-03 14:40:14.000000000 -0700
@@ -42,6 +42,12 @@ static inline void file_free_rcu(struct 
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
+	/*
+	 * At this point, either both or neither of these bits
+	 * should be set.
+	 */
+	WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN);
+	WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED);
 	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
@@ -194,6 +200,7 @@ int init_file(struct file *file, struct 
 	file->f_mode = mode;
 	file->f_op = fop;
 	if (mode & FMODE_WRITE) {
+		file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
 		error = mnt_want_write(mnt);
 		WARN_ON(error);
 	}
@@ -236,8 +243,18 @@ void fastcall __fput(struct file *file)
 	fops_put(file->f_op);
 	if (file->f_mode & FMODE_WRITE) {
 		put_write_access(inode);
-		if (!special_file(inode->i_mode))
-			mnt_drop_write(mnt);
+		if (!special_file(inode->i_mode)) {
+			if (file->f_mnt_write_state == FILE_MNT_WRITE_TAKEN) {
+				mnt_drop_write(mnt);
+				file->f_mnt_write_state |=
+					FILE_MNT_WRITE_RELEASED;
+			} else {
+				printk(KERN_WARNING "__fput() of writeable "
+						"file with no "
+						"mnt_want_write()\n");
+				WARN_ON(1);
+			}
+		}
 	}
 	put_pid(file->f_owner.pid);
 	file_kill(file);
diff -puN fs/namei.c~keep-track-of-mnt_writer-state-of-struct-file fs/namei.c
diff -puN include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file include/linux/fs.h
--- lxc/include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file	2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/include/linux/fs.h	2007-10-03 14:40:14.000000000 -0700
@@ -779,6 +779,9 @@ static inline int ra_has_index(struct fi
 		index <  ra->start + ra->size);
 }
 
+#define FILE_MNT_WRITE_TAKEN	1
+#define FILE_MNT_WRITE_RELEASED	2
+
 struct file {
 	/*
 	 * fu_list becomes invalid after file_free is called and queued via
@@ -813,6 +816,7 @@ struct file {
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	unsigned long f_mnt_write_state;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
diff -puN fs/open.c~keep-track-of-mnt_writer-state-of-struct-file fs/open.c
--- lxc/fs/open.c~keep-track-of-mnt_writer-state-of-struct-file	2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-03 14:42:01.000000000 -0700
@@ -790,6 +790,8 @@ static struct file *__dentry_open(struct
 			mnt_drop_write(mnt);
 			goto cleanup_file;
 		}
+		WARN_ON(f->f_mnt_write_state != 0);
+		f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
 	}
 
 	f->f_mapping = inode->i_mapping;
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
@ 2007-10-11 15:08   ` Miklos Szeredi
  2007-10-11 18:16     ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Miklos Szeredi @ 2007-10-11 15:08 UTC (permalink / raw)
  To: haveblue; +Cc: linux-kernel, miklos, hch, haveblue

> 
> 
> The first patch fixes an actual bug.  I think the
> reset will reduce the chance for any future bugs
> to creep in.
> 
> --
> 
> The r/o bind mount patches require matching mnt_want_write()
> at filp creation time with a mnt_drop_write() at __fput().
> 
> We used to do this in may_open(), but Miklos pointed out
> that __dentry_open() is used as well to create filps.  We
> don't currently do mnt_want_write() for these.
> 
> If a filp on a writeable file is created this way, and
> destroyed via __fput() we'll get a mount count imbalance.
> 
> This patch moves the mount write count acquisition from
> may_open() into __dentry_open(), where we should catch
> many more of the users.
> 
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
> ---
> 
>  lxc-dave/fs/namei.c |   12 ------------
>  lxc-dave/fs/open.c  |   45 ++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 38 insertions(+), 19 deletions(-)
> 
> diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
>  			return -EACCES;
>  
>  		flag &= ~O_TRUNC;
> -	} else if (flag & FMODE_WRITE) {
> -		/*
> -		 * effectively: !special_file()
> -		 * balanced by __fput()
> -		 */
> -		error = mnt_want_write(nd->mnt);
> -		if (error)
> -			return error;
>  	}

Maybe readonly should still be checked here, so that the order of
error checking doesn't change.  If racing with a read-only remount the
order is irrelevant anyway.  Something like this?

	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
		return -EROFS
	}

>  	error = vfs_permission(nd, acc_mode);
> @@ -1778,11 +1770,7 @@ do_last:
>  
>  	/* Negative dentry, just create the file */
>  	if (!path.dentry->d_inode) {
> -		error = mnt_want_write(nd->mnt);
> -		if (error)
> -			goto exit_mutex_unlock;
>  		error = open_namei_create(nd, &path, flag, mode);
> -		mnt_drop_write(nd->mnt);

This is still needed, isn't it?

And they should be added around do_truncate() as well, since you
remove the protection from may_open().

This one introduces an interesting race between ro-remount and
open(O_TRUNC), where the truncate can succeed but the open fail with
EROFS.  Is that a problem?

>  		if (error)
>  			goto exit;
>  		return 0;
> diff -puN fs/open.c~get-write-in-__dentry_open fs/open.c
> --- lxc/fs/open.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> +++ lxc-dave/fs/open.c	2007-10-04 18:02:48.000000000 -0700
> @@ -766,22 +766,51 @@ out:
>  	return error;
>  }
>  
> +/*
> + * You have to be very careful that these write
> + * counts get cleaned up in error cases and
> + * upon __fput().  This should probably never
> + * be called outside of __dentry_open().
> + */
> +static inline int __get_file_write_access(struct inode *inode,
> +					  struct vfsmount *mnt)
> +{
> +	int error;
> +	error = get_write_access(inode);
> +	if (error)
> +		return error;
> +	/*
> +	 * Do not take mount writer counts on
> +	 * special files since no writes to
> +	 * the mount itself will occur.
> +	 */
> +	if (special_file(inode->i_mode))
> +		return 0;
> +
> +	/*
> +	 * Balanced in __fput()
> +	 */
> +	error = mnt_want_write(mnt);
> +	if (error)
> +		put_write_access(inode);
> +	return error;
> +}
> +
>  static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
>  					int flags, struct file *f,
>  					int (*open)(struct inode *, struct file *))
>  {
>  	struct inode *inode;
> -	int error;
> +	int error = 0;
>  
>  	f->f_flags = flags;
>  	f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
>  				FMODE_PREAD | FMODE_PWRITE;
>  	inode = dentry->d_inode;
> -	if (f->f_mode & FMODE_WRITE) {
> -		error = get_write_access(inode);
> -		if (error)
> -			goto cleanup_file;
> -	}
> +	if (f->f_mode & FMODE_WRITE)
> +		error = __get_file_write_access(inode, mnt);
> +	if (error)
> +		goto cleanup_file;
>  
>  	f->f_mapping = inode->i_mapping;
>  	f->f_path.dentry = dentry;
> @@ -820,8 +849,10 @@ static struct file *__dentry_open(struct
>  
>  cleanup_all:
>  	fops_put(f->f_op);
> -	if (f->f_mode & FMODE_WRITE)
> +	if (f->f_mode & FMODE_WRITE) {
>  		put_write_access(inode);
> +		mnt_drop_write(mnt);

Shouldn't this be conditional on !special_file()?

Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 4/7] make open_namei() return a filp
  2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
@ 2007-10-11 15:24   ` Miklos Szeredi
  0 siblings, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2007-10-11 15:24 UTC (permalink / raw)
  To: haveblue; +Cc: linux-kernel, miklos, hch, haveblue

> If open_namei() succeeds, there is potentially a mnt_want_write()
> that needs to get balanced.  If the caller doesn't create a
> 'struct file' and eventually __fput() it, or manually drop the
> write count on an error, we have a bug.
> 
> Forcing open_namei() to return a filp fixes this.  Any caller
> getting a 'struct file' back must consider that filp instantiated
> and fput() it normally.  The callers no longer have to worry about
> ever manually releasing a mnt write count.

I think this changelog is out of date, as this problem should have
been dealt with in patch-2/7.

Otherwise I don't object to this change, it looks like a fine cleanup.

Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-11 15:08   ` Miklos Szeredi
@ 2007-10-11 18:16     ` Dave Hansen
  2007-10-11 18:31       ` Miklos Szeredi
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2007-10-11 18:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, hch

On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> > +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> >  			return -EACCES;
> >  
> >  		flag &= ~O_TRUNC;
> > -	} else if (flag & FMODE_WRITE) {
> > -		/*
> > -		 * effectively: !special_file()
> > -		 * balanced by __fput()
> > -		 */
> > -		error = mnt_want_write(nd->mnt);
> > -		if (error)
> > -			return error;
> >  	}
> 
> Maybe readonly should still be checked here, so that the order of
> error checking doesn't change.  If racing with a read-only remount the
> order is irrelevant anyway.  Something like this?
> 
> 	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> 		return -EROFS
> 	}

I think that would be a bug if anything actually managed to trip that
code.  all of the may_open() calls should have been covered by the
__dentry_open() mnt writer.

> >  	error = vfs_permission(nd, acc_mode);
> > @@ -1778,11 +1770,7 @@ do_last:
> >  
> >  	/* Negative dentry, just create the file */
> >  	if (!path.dentry->d_inode) {
> > -		error = mnt_want_write(nd->mnt);
> > -		if (error)
> > -			goto exit_mutex_unlock;
> >  		error = open_namei_create(nd, &path, flag, mode);
> > -		mnt_drop_write(nd->mnt);
> 
> This is still needed, isn't it?

Yes, it is.  I'll add a big fat comment this time about why we need it.

> And they should be added around do_truncate() as well, since you
> remove the protection from may_open().
> 
> This one introduces an interesting race between ro-remount and
> open(O_TRUNC), where the truncate can succeed but the open fail with
> EROFS.  Is that a problem?

You're right, this does introduce that race, and it is relatively hard
to fix properly.  But, the 'return a filp' patch makes it easy to fix.
I've put a temporary kludge in the updated version of this patch, and
fixed it properly in that later patch.  

> >  cleanup_all:
> >  	fops_put(f->f_op);
> > -	if (f->f_mode & FMODE_WRITE)
> > +	if (f->f_mode & FMODE_WRITE) {
> >  		put_write_access(inode);
> > +		mnt_drop_write(mnt);
> 
> Shouldn't this be conditional on !special_file()?

It certainly should.

-- Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-11 18:16     ` Dave Hansen
@ 2007-10-11 18:31       ` Miklos Szeredi
  2007-10-11 19:24         ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Miklos Szeredi @ 2007-10-11 18:31 UTC (permalink / raw)
  To: haveblue; +Cc: miklos, linux-kernel, hch

> On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > > --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> > > +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> > > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > >  			return -EACCES;
> > >  
> > >  		flag &= ~O_TRUNC;
> > > -	} else if (flag & FMODE_WRITE) {
> > > -		/*
> > > -		 * effectively: !special_file()
> > > -		 * balanced by __fput()
> > > -		 */
> > > -		error = mnt_want_write(nd->mnt);
> > > -		if (error)
> > > -			return error;
> > >  	}
> > 
> > Maybe readonly should still be checked here, so that the order of
> > error checking doesn't change.  If racing with a read-only remount the
> > order is irrelevant anyway.  Something like this?
> > 
> > 	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> > 		return -EROFS
> > 	}
> 
> I think that would be a bug if anything actually managed to trip that
> code.  all of the may_open() calls should have been covered by the
> __dentry_open() mnt writer.

AFACIS, __dentry_open() will normally be called later than may_open().
And we don't want it earlier, because ->open() may have side affects,
that could be unsafe if done before permission checking.

> > And they should be added around do_truncate() as well, since you
> > remove the protection from may_open().
> > 
> > This one introduces an interesting race between ro-remount and
> > open(O_TRUNC), where the truncate can succeed but the open fail with
> > EROFS.  Is that a problem?
> 
> You're right, this does introduce that race, and it is relatively hard
> to fix properly.  But, the 'return a filp' patch makes it easy to fix.
> I've put a temporary kludge in the updated version of this patch, and
> fixed it properly in that later patch.  

If you fix this properly, that should take care of the first problem
as well.

Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-11 18:31       ` Miklos Szeredi
@ 2007-10-11 19:24         ` Dave Hansen
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-11 19:24 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, hch

On Thu, 2007-10-11 at 20:31 +0200, Miklos Szeredi wrote:
> > On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > > > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > > > --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> > > > +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> > > > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > > >  			return -EACCES;
> > > >  
> > > >  		flag &= ~O_TRUNC;
> > > > -	} else if (flag & FMODE_WRITE) {
> > > > -		/*
> > > > -		 * effectively: !special_file()
> > > > -		 * balanced by __fput()
> > > > -		 */
> > > > -		error = mnt_want_write(nd->mnt);
> > > > -		if (error)
> > > > -			return error;
> > > >  	}
> > > 
> > > Maybe readonly should still be checked here, so that the order of
> > > error checking doesn't change.  If racing with a read-only remount the
> > > order is irrelevant anyway.  Something like this?
> > > 
> > > 	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> > > 		return -EROFS
> > > 	}
> > 
> > I think that would be a bug if anything actually managed to trip that
> > code.  all of the may_open() calls should have been covered by the
> > __dentry_open() mnt writer.
> 
> AFACIS, __dentry_open() will normally be called later than may_open().
> And we don't want it earlier, because ->open() may have side affects,
> that could be unsafe if done before permission checking.

I actually check the mount write count before the ->open() in
__dentry_open().  The truncates are also definitely wrapped in their own
mnt_want_write() calls now.

> > > And they should be added around do_truncate() as well, since you
> > > remove the protection from may_open().
> > > 
> > > This one introduces an interesting race between ro-remount and
> > > open(O_TRUNC), where the truncate can succeed but the open fail with
> > > EROFS.  Is that a problem?
> > 
> > You're right, this does introduce that race, and it is relatively hard
> > to fix properly.  But, the 'return a filp' patch makes it easy to fix.
> > I've put a temporary kludge in the updated version of this patch, and
> > fixed it properly in that later patch.  
> 
> If you fix this properly, that should take care of the first problem
> as well.

Yup.  New series coming up. 

-- Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-10-11 19:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
2007-10-11 15:08   ` Miklos Szeredi
2007-10-11 18:16     ` Dave Hansen
2007-10-11 18:31       ` Miklos Szeredi
2007-10-11 19:24         ` Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 3/7] do namei_flags calculation inside open_namei() Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
2007-10-11 15:24   ` Miklos Szeredi
2007-10-10 16:34 ` [RFC][PATCH 5/7] kill do_filp_open() Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 6/7] kill filp_open() Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 7/7] keep track of mnt_writer state of struct file Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox