[RFC][PATCH 2/7] get mount write in __dentry

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-11 15:08   ` Miklos Szeredi
  2007-10-10 16:34 ` [RFC][PATCH 3/7] do namei_flags calculation inside open_namei() Dave Hansen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


The first patch fixes an actual bug.  I think the
reset will reduce the chance for any future bugs
to creep in.

--

The r/o bind mount patches require matching mnt_want_write()
at filp creation time with a mnt_drop_write() at __fput().

We used to do this in may_open(), but Miklos pointed out
that __dentry_open() is used as well to create filps.  We
don't currently do mnt_want_write() for these.

If a filp on a writeable file is created this way, and
destroyed via __fput() we'll get a mount count imbalance.

This patch moves the mount write count acquisition from
may_open() into __dentry_open(), where we should catch
many more of the users.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/namei.c |   12 ------------
 lxc-dave/fs/open.c  |   45 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 38 insertions(+), 19 deletions(-)

diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
--- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
@@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
 			return -EACCES;
 
 		flag &= ~O_TRUNC;
-	} else if (flag & FMODE_WRITE) {
-		/*
-		 * effectively: !special_file()
-		 * balanced by __fput()
-		 */
-		error = mnt_want_write(nd->mnt);
-		if (error)
-			return error;
 	}
 
 	error = vfs_permission(nd, acc_mode);
@@ -1778,11 +1770,7 @@ do_last:
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode) {
-		error = mnt_want_write(nd->mnt);
-		if (error)
-			goto exit_mutex_unlock;
 		error = open_namei_create(nd, &path, flag, mode);
-		mnt_drop_write(nd->mnt);
 		if (error)
 			goto exit;
 		return 0;
diff -puN fs/open.c~get-write-in-__dentry_open fs/open.c
--- lxc/fs/open.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-04 18:02:48.000000000 -0700
@@ -766,22 +766,51 @@ out:
 	return error;
 }
 
+/*
+ * You have to be very careful that these write
+ * counts get cleaned up in error cases and
+ * upon __fput().  This should probably never
+ * be called outside of __dentry_open().
+ */
+static inline int __get_file_write_access(struct inode *inode,
+					  struct vfsmount *mnt)
+{
+	int error;
+	error = get_write_access(inode);
+	if (error)
+		return error;
+	/*
+	 * Do not take mount writer counts on
+	 * special files since no writes to
+	 * the mount itself will occur.
+	 */
+	if (special_file(inode->i_mode))
+		return 0;
+
+	/*
+	 * Balanced in __fput()
+	 */
+	error = mnt_want_write(mnt);
+	if (error)
+		put_write_access(inode);
+	return error;
+}
+
 static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
 					int flags, struct file *f,
 					int (*open)(struct inode *, struct file *))
 {
 	struct inode *inode;
-	int error;
+	int error = 0;
 
 	f->f_flags = flags;
 	f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
 				FMODE_PREAD | FMODE_PWRITE;
 	inode = dentry->d_inode;
-	if (f->f_mode & FMODE_WRITE) {
-		error = get_write_access(inode);
-		if (error)
-			goto cleanup_file;
-	}
+	if (f->f_mode & FMODE_WRITE)
+		error = __get_file_write_access(inode, mnt);
+	if (error)
+		goto cleanup_file;
 
 	f->f_mapping = inode->i_mapping;
 	f->f_path.dentry = dentry;
@@ -820,8 +849,10 @@ static struct file *__dentry_open(struct
 
 cleanup_all:
 	fops_put(f->f_op);
-	if (f->f_mode & FMODE_WRITE)
+	if (f->f_mode & FMODE_WRITE) {
 		put_write_access(inode);
+		mnt_drop_write(mnt);
+	}
 	file_kill(f);
 	f->f_path.dentry = NULL;
 	f->f_path.mnt = NULL;
diff -puN fs/file_table.c~get-write-in-__dentry_open fs/file_table.c
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 1/7] init_file(): only take writes on normal files
@ 2007-10-10 16:34 Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen



---

 lxc-dave/fs/file_table.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN fs/file_table.c~init_file-only-take-writes-on-normal-files fs/file_table.c
--- lxc/fs/file_table.c~init_file-only-take-writes-on-normal-files	2007-10-04 13:01:59.000000000 -0700
+++ lxc-dave/fs/file_table.c	2007-10-04 13:03:03.000000000 -0700
@@ -199,7 +199,12 @@ int init_file(struct file *file, struct 
 	file->f_mapping = dentry->d_inode->i_mapping;
 	file->f_mode = mode;
 	file->f_op = fop;
-	if (mode & FMODE_WRITE) {
+	/*
+	 * These mounts don't really matter in practice
+	 * for r/o bind mounts.  They aren't userspace-
+	 * visible.  We do this for consistency.
+	 */
+	if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
 		file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
 		error = mnt_want_write(mnt);
 		WARN_ON(error);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 3/7] do namei_flags calculation inside open_namei()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


My end goal here is to make sure all users of may_open()
return filps.  This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().

This patch moves the sys_open flags to namei flags
calculation into fs/namei.c.  We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/namei.c |   43 ++++++++++++++++++++++++++++++++++---------
 lxc-dave/fs/open.c  |   22 ++--------------------
 2 files changed, 36 insertions(+), 29 deletions(-)

diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- lxc/fs/namei.c~do-namei_flags-calculation-inside-open_namei	2007-10-04 13:13:00.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-04 13:13:00.000000000 -0700
@@ -1672,7 +1672,12 @@ int may_open(struct nameidata *nd, int a
 	return 0;
 }
 
-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function.  Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
 				int flag, int mode)
 {
 	int error;
@@ -1691,26 +1696,46 @@ static int open_namei_create(struct name
 }
 
 /*
+ * Note that while the flag value (low two bits) for sys_open means:
+ *	00 - read-only
+ *	01 - write-only
+ *	10 - read-write
+ *	11 - special
+ * it is changed into
+ *	00 - no permissions needed
+ *	01 - read-permission
+ *	10 - write-permission
+ *	11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * later).
+ *
+*/
+static inline int sys_open_flags_to_namei_flags(int flag)
+{
+	if ((flag+1) & O_ACCMODE)
+		flag++;
+	return flag;
+}
+
+/*
  *	open_namei()
  *
  * namei for open - this is in fact almost the whole open-routine.
  *
  * Note that the low bits of "flag" aren't the same as in the open
- * system call - they are 00 - no permissions needed
- *			  01 - read permission needed
- *			  10 - write permission needed
- *			  11 - read/write permissions needed
- * which is a lot more logical, and also allows the "no perm" needed
- * for symlinks (where the permissions are checked later).
+ * system call.  See sys_open_flags_to_namei_flags().
  * SMP-safe
  */
-int open_namei(int dfd, const char *pathname, int flag,
+int open_namei(int dfd, const char *pathname, int sys_open_flag,
 		int mode, struct nameidata *nd)
 {
 	int acc_mode, error;
 	struct path path;
 	struct dentry *dir;
 	int count = 0;
+	int flag = sys_open_flags_to_namei_flags(sys_open_flag);
 
 	acc_mode = ACC_MODE(flag);
 
@@ -1770,7 +1795,7 @@ do_last:
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode) {
-		error = open_namei_create(nd, &path, flag, mode);
+		error = __open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
 		return 0;
diff -puN fs/open.c~do-namei_flags-calculation-inside-open_namei fs/open.c
--- lxc/fs/open.c~do-namei_flags-calculation-inside-open_namei	2007-10-04 13:13:00.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-04 13:13:00.000000000 -0700
@@ -863,31 +863,13 @@ cleanup_file:
 	return ERR_PTR(error);
 }
 
-/*
- * Note that while the flag value (low two bits) for sys_open means:
- *	00 - read-only
- *	01 - write-only
- *	10 - read-write
- *	11 - special
- * it is changed into
- *	00 - no permissions needed
- *	01 - read-permission
- *	10 - write-permission
- *	11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc). 00 is
- * used by symlinks.
- */
 static struct file *do_filp_open(int dfd, const char *filename, int flags,
 				 int mode)
 {
-	int namei_flags, error;
+	int error;
 	struct nameidata nd;
 
-	namei_flags = flags;
-	if ((namei_flags+1) & O_ACCMODE)
-		namei_flags++;
-
-	error = open_namei(dfd, filename, namei_flags, mode, &nd);
+	error = open_namei(dfd, filename, flags, mode, &nd);
 	if (!error)
 		return nameidata_to_filp(&nd, flags);
 
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 4/7] make open_namei() return a filp
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 3/7] do namei_flags calculation inside open_namei() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-11 15:24   ` Miklos Szeredi
  2007-10-10 16:34 ` [RFC][PATCH 5/7] kill do_filp_open() Dave Hansen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


If open_namei() succeeds, there is potentially a mnt_want_write()
that needs to get balanced.  If the caller doesn't create a
'struct file' and eventually __fput() it, or manually drop the
write count on an error, we have a bug.

Forcing open_namei() to return a filp fixes this.  Any caller
getting a 'struct file' back must consider that filp instantiated
and fput() it normally.  The callers no longer have to worry about
ever manually releasing a mnt write count.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/namei.c         |   16 ++++++++--------
 lxc-dave/fs/open.c          |    7 +------
 lxc-dave/include/linux/fs.h |    2 +-
 3 files changed, 10 insertions(+), 15 deletions(-)

diff -puN fs/namei.c~make-open_namei-return-a-filp fs/namei.c
--- lxc/fs/namei.c~make-open_namei-return-a-filp	2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-03 09:01:45.000000000 -0700
@@ -1728,8 +1728,8 @@ static inline int sys_open_flags_to_name
  * system call.  See sys_open_flags_to_namei_flags().
  * SMP-safe
  */
-int open_namei(int dfd, const char *pathname, int sys_open_flag,
-		int mode, struct nameidata *nd)
+struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
+			int mode, struct nameidata *nd)
 {
 	int acc_mode, error;
 	struct path path;
@@ -1755,7 +1755,7 @@ int open_namei(int dfd, const char *path
 		error = path_lookup_open(dfd, pathname, lookup_flags(flag),
 					 nd, flag);
 		if (error)
-			return error;
+			return ERR_PTR(error);
 		goto ok;
 	}
 
@@ -1764,7 +1764,7 @@ int open_namei(int dfd, const char *path
 	 */
 	error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
 	if (error)
-		return error;
+		return ERR_PTR(error);
 
 	/*
 	 * We have the parent and last component. First of all, check
@@ -1798,7 +1798,7 @@ do_last:
 		error = __open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
-		return 0;
+		return nameidata_to_filp(nd, sys_open_flag);
 	}
 
 	/*
@@ -1831,7 +1831,7 @@ ok:
 	error = may_open(nd, acc_mode, flag);
 	if (error)
 		goto exit;
-	return 0;
+	return nameidata_to_filp(nd, sys_open_flag);
 
 exit_mutex_unlock:
 	mutex_unlock(&dir->d_inode->i_mutex);
@@ -1841,7 +1841,7 @@ exit:
 	if (!IS_ERR(nd->intent.open.file))
 		release_open_intent(nd);
 	path_release(nd);
-	return error;
+	return ERR_PTR(error);
 
 do_link:
 	error = -ELOOP;
@@ -1868,7 +1868,7 @@ do_link:
 		 * with "intent.open".
 		 */
 		release_open_intent(nd);
-		return error;
+		return ERR_PTR(error);
 	}
 	nd->flags &= ~LOOKUP_PARENT;
 	if (nd->last_type == LAST_BIND)
diff -puN fs/open.c~make-open_namei-return-a-filp fs/open.c
--- lxc/fs/open.c~make-open_namei-return-a-filp	2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-03 09:01:45.000000000 -0700
@@ -846,14 +846,9 @@ cleanup_file:
 static struct file *do_filp_open(int dfd, const char *filename, int flags,
 				 int mode)
 {
-	int error;
 	struct nameidata nd;
 
-	error = open_namei(dfd, filename, flags, mode, &nd);
-	if (!error)
-		return nameidata_to_filp(&nd, flags);
-
-	return ERR_PTR(error);
+	return open_namei(dfd, filename, flags, mode, &nd);
 }
 
 struct file *filp_open(const char *filename, int flags, int mode)
diff -puN include/linux/fs.h~make-open_namei-return-a-filp include/linux/fs.h
--- lxc/include/linux/fs.h~make-open_namei-return-a-filp	2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/include/linux/fs.h	2007-10-03 09:01:45.000000000 -0700
@@ -1721,7 +1721,7 @@ extern struct file *create_read_pipe(str
 extern struct file *create_write_pipe(void);
 extern void free_write_pipe(struct file *);
 
-extern int open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
 extern int may_open(struct nameidata *, int, int);
 
 extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 5/7] kill do_filp_open()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
                   ` (2 preceding siblings ...)
  2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 6/7] kill filp_open() Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 7/7] keep track of mnt_writer state of struct file Dave Hansen
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


This kills off the almost empty do_filp_open().  The indenting
change in do_sys_open() is because we would have gone over our
80 characters otherwise.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/open.c |   39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff -puN fs/open.c~kill-do_filp_open fs/open.c
--- lxc/fs/open.c~kill-do_filp_open	2007-10-04 13:59:44.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-04 13:59:44.000000000 -0700
@@ -863,17 +863,10 @@ cleanup_file:
 	return ERR_PTR(error);
 }
 
-static struct file *do_filp_open(int dfd, const char *filename, int flags,
-				 int mode)
-{
-	struct nameidata nd;
-
-	return open_namei(dfd, filename, flags, mode, &nd);
-}
-
 struct file *filp_open(const char *filename, int flags, int mode)
 {
-	return do_filp_open(AT_FDCWD, filename, flags, mode);
+	struct nameidata nd;
+	return open_namei(AT_FDCWD, filename, flags, mode, &nd);
 }
 EXPORT_SYMBOL(filp_open);
 
@@ -1072,20 +1065,24 @@ long do_sys_open(int dfd, const char __u
 	char *tmp = getname(filename);
 	int fd = PTR_ERR(tmp);
 
-	if (!IS_ERR(tmp)) {
-		fd = get_unused_fd_flags(flags);
-		if (fd >= 0) {
-			struct file *f = do_filp_open(dfd, tmp, flags, mode);
-			if (IS_ERR(f)) {
-				put_unused_fd(fd);
-				fd = PTR_ERR(f);
-			} else {
-				fsnotify_open(f->f_path.dentry);
-				fd_install(fd, f);
-			}
+	if (IS_ERR(tmp))
+		goto out;
+
+	fd = get_unused_fd_flags(flags);
+	if (fd >= 0) {
+		struct nameidata nd;
+		struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+
+		if (IS_ERR(f)) {
+			put_unused_fd(fd);
+			fd = PTR_ERR(f);
+		} else {
+			fsnotify_open(f->f_path.dentry);
+			fd_install(fd, f);
 		}
-		putname(tmp);
 	}
+	putname(tmp);
+out:
 	return fd;
 }
 
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 6/7] kill filp_open()
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
                   ` (3 preceding siblings ...)
  2007-10-10 16:34 ` [RFC][PATCH 5/7] kill do_filp_open() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  2007-10-10 16:34 ` [RFC][PATCH 7/7] keep track of mnt_writer state of struct file Dave Hansen
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


Replace all callers with open_namei() directly, and move the
nameidata stack allocation into open_namei().

Does it make sense to still call it open_namei(), even though it
doesn't actually take a nameidata any more?

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/drivers/usb/gadget/file_storage.c |    5 +-
 lxc-dave/fs/exec.c                         |    2 
 lxc-dave/fs/namei.c                        |   69 ++++++++++++++---------------
 lxc-dave/fs/open.c                         |    6 --
 lxc-dave/fs/reiserfs/journal.c             |    2 
 lxc-dave/include/linux/fs.h                |    2 
 lxc-dave/kernel/acct.c                     |    2 
 lxc-dave/mm/swapfile.c                     |    4 -
 lxc-dave/sound/sound_firmware.c            |    2 
 9 files changed, 47 insertions(+), 47 deletions(-)

diff -puN drivers/usb/gadget/file_storage.c~kill-filp_open drivers/usb/gadget/file_storage.c
--- lxc/drivers/usb/gadget/file_storage.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/drivers/usb/gadget/file_storage.c	2007-10-03 09:01:47.000000000 -0700
@@ -3468,16 +3468,17 @@ static int open_backing_file(struct lun 
 	struct inode			*inode = NULL;
 	loff_t				size;
 	loff_t				num_sectors;
+	int				mode = O_LARGEFILE;
 
 	/* R/W if we can, R/O if we must */
 	ro = curlun->ro;
 	if (!ro) {
-		filp = filp_open(filename, O_RDWR | O_LARGEFILE, 0);
+		filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
 		if (-EROFS == PTR_ERR(filp))
 			ro = 1;
 	}
 	if (ro)
-		filp = filp_open(filename, O_RDONLY | O_LARGEFILE, 0);
+		filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
 	if (IS_ERR(filp)) {
 		LINFO(curlun, "unable to open backing file: %s\n", filename);
 		return PTR_ERR(filp);
diff -puN fs/exec.c~kill-filp_open fs/exec.c
--- lxc/fs/exec.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/exec.c	2007-10-03 09:01:47.000000000 -0700
@@ -1764,7 +1764,7 @@ int do_coredump(long signr, int exit_cod
  			goto fail_unlock;
  		}
  	} else
- 		file = filp_open(corename,
+		file = open_namei(AT_FDCWD, corename,
 				 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
 				 0600);
 	if (IS_ERR(file))
diff -puN fs/namei.c~kill-filp_open fs/namei.c
--- lxc/fs/namei.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/namei.c	2007-10-03 09:01:47.000000000 -0700
@@ -1729,8 +1729,9 @@ static inline int sys_open_flags_to_name
  * SMP-safe
  */
 struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
-			int mode, struct nameidata *nd)
+			int mode)
 {
+	struct nameidata nd;
 	int acc_mode, error;
 	struct path path;
 	struct dentry *dir;
@@ -1753,7 +1754,7 @@ struct file *open_namei(int dfd, const c
 	 */
 	if (!(flag & O_CREAT)) {
 		error = path_lookup_open(dfd, pathname, lookup_flags(flag),
-					 nd, flag);
+					 &nd, flag);
 		if (error)
 			return ERR_PTR(error);
 		goto ok;
@@ -1762,7 +1763,7 @@ struct file *open_namei(int dfd, const c
 	/*
 	 * Create - we need to know the parent.
 	 */
-	error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+	error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,&nd,flag,mode);
 	if (error)
 		return ERR_PTR(error);
 
@@ -1772,14 +1773,14 @@ struct file *open_namei(int dfd, const c
 	 * will not do.
 	 */
 	error = -EISDIR;
-	if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
+	if (nd.last_type != LAST_NORM || nd.last.name[nd.last.len])
 		goto exit;
 
-	dir = nd->dentry;
-	nd->flags &= ~LOOKUP_PARENT;
+	dir = nd.dentry;
+	nd.flags &= ~LOOKUP_PARENT;
 	mutex_lock(&dir->d_inode->i_mutex);
-	path.dentry = lookup_hash(nd);
-	path.mnt = nd->mnt;
+	path.dentry = lookup_hash(&nd);
+	path.mnt = nd.mnt;
 
 do_last:
 	error = PTR_ERR(path.dentry);
@@ -1788,17 +1789,17 @@ do_last:
 		goto exit;
 	}
 
-	if (IS_ERR(nd->intent.open.file)) {
-		error = PTR_ERR(nd->intent.open.file);
+	if (IS_ERR(nd.intent.open.file)) {
+		error = PTR_ERR(nd.intent.open.file);
 		goto exit_mutex_unlock;
 	}
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode) {
-		error = __open_namei_create(nd, &path, flag, mode);
+		error = __open_namei_create(&nd, &path, flag, mode);
 		if (error)
 			goto exit;
-		return nameidata_to_filp(nd, sys_open_flag);
+		return nameidata_to_filp(&nd, sys_open_flag);
 	}
 
 	/*
@@ -1823,24 +1824,24 @@ do_last:
 	if (path.dentry->d_inode->i_op && path.dentry->d_inode->i_op->follow_link)
 		goto do_link;
 
-	path_to_nameidata(&path, nd);
+	path_to_nameidata(&path, &nd);
 	error = -EISDIR;
 	if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
 		goto exit;
 ok:
-	error = may_open(nd, acc_mode, flag);
+	error = may_open(&nd, acc_mode, flag);
 	if (error)
 		goto exit;
-	return nameidata_to_filp(nd, sys_open_flag);
+	return nameidata_to_filp(&nd, sys_open_flag);
 
 exit_mutex_unlock:
 	mutex_unlock(&dir->d_inode->i_mutex);
 exit_dput:
-	dput_path(&path, nd);
+	dput_path(&path, &nd);
 exit:
-	if (!IS_ERR(nd->intent.open.file))
-		release_open_intent(nd);
-	path_release(nd);
+	if (!IS_ERR(&nd.intent.open.file))
+		release_open_intent(&nd);
+	path_release(&nd);
 	return ERR_PTR(error);
 
 do_link:
@@ -1854,42 +1855,42 @@ do_link:
 	 * After that we have the parent and last component, i.e.
 	 * we are in the same situation as after the first path_walk().
 	 * Well, almost - if the last component is normal we get its copy
-	 * stored in nd->last.name and we will have to putname() it when we
+	 * stored in nd.last.name and we will have to putname() it when we
 	 * are done. Procfs-like symlinks just set LAST_BIND.
 	 */
-	nd->flags |= LOOKUP_PARENT;
-	error = security_inode_follow_link(path.dentry, nd);
+	nd.flags |= LOOKUP_PARENT;
+	error = security_inode_follow_link(path.dentry, &nd);
 	if (error)
 		goto exit_dput;
-	error = __do_follow_link(&path, nd);
+	error = __do_follow_link(&path, &nd);
 	if (error) {
 		/* Does someone understand code flow here? Or it is only
 		 * me so stupid? Anathema to whoever designed this non-sense
 		 * with "intent.open".
 		 */
-		release_open_intent(nd);
+		release_open_intent(&nd);
 		return ERR_PTR(error);
 	}
-	nd->flags &= ~LOOKUP_PARENT;
-	if (nd->last_type == LAST_BIND)
+	nd.flags &= ~LOOKUP_PARENT;
+	if (nd.last_type == LAST_BIND)
 		goto ok;
 	error = -EISDIR;
-	if (nd->last_type != LAST_NORM)
+	if (nd.last_type != LAST_NORM)
 		goto exit;
-	if (nd->last.name[nd->last.len]) {
-		__putname(nd->last.name);
+	if (nd.last.name[nd.last.len]) {
+		__putname(nd.last.name);
 		goto exit;
 	}
 	error = -ELOOP;
 	if (count++==32) {
-		__putname(nd->last.name);
+		__putname(nd.last.name);
 		goto exit;
 	}
-	dir = nd->dentry;
+	dir = nd.dentry;
 	mutex_lock(&dir->d_inode->i_mutex);
-	path.dentry = lookup_hash(nd);
-	path.mnt = nd->mnt;
-	__putname(nd->last.name);
+	path.dentry = lookup_hash(&nd);
+	path.mnt = nd.mnt;
+	__putname(nd.last.name);
 	goto do_last;
 }
 
diff -puN fs/open.c~kill-filp_open fs/open.c
--- lxc/fs/open.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-03 09:01:47.000000000 -0700
@@ -845,8 +845,7 @@ cleanup_file:
 
 struct file *filp_open(const char *filename, int flags, int mode)
 {
-	struct nameidata nd;
-	return open_namei(AT_FDCWD, filename, flags, mode, &nd);
+	return open_namei(AT_FDCWD, filename, flags, mode);
 }
 EXPORT_SYMBOL(filp_open);
 
@@ -1050,8 +1049,7 @@ long do_sys_open(int dfd, const char __u
 
 	fd = get_unused_fd_flags(flags);
 	if (fd >= 0) {
-		struct nameidata nd;
-		struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+		struct file *f = open_namei(dfd, tmp, flags, mode);
 
 		if (IS_ERR(f)) {
 			put_unused_fd(fd);
diff -puN fs/reiserfs/journal.c~kill-filp_open fs/reiserfs/journal.c
--- lxc/fs/reiserfs/journal.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/reiserfs/journal.c	2007-10-03 09:01:47.000000000 -0700
@@ -2623,7 +2623,7 @@ static int journal_init_dev(struct super
 		return 0;
 	}
 
-	journal->j_dev_file = filp_open(jdev_name, 0, 0);
+	journal->j_dev_file = open_namei(AT_FDCWD, jdev_name, 0, 0);
 	if (!IS_ERR(journal->j_dev_file)) {
 		struct inode *jdev_inode = journal->j_dev_file->f_mapping->host;
 		if (!S_ISBLK(jdev_inode->i_mode)) {
diff -puN include/linux/fs.h~kill-filp_open include/linux/fs.h
--- lxc/include/linux/fs.h~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/include/linux/fs.h	2007-10-03 09:01:47.000000000 -0700
@@ -1721,7 +1721,7 @@ extern struct file *create_read_pipe(str
 extern struct file *create_write_pipe(void);
 extern void free_write_pipe(struct file *);
 
-extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int);
 extern int may_open(struct nameidata *, int, int);
 
 extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
diff -puN kernel/acct.c~kill-filp_open kernel/acct.c
--- lxc/kernel/acct.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/kernel/acct.c	2007-10-03 09:01:47.000000000 -0700
@@ -208,7 +208,7 @@ static int acct_on(char *name)
 	int error;
 
 	/* Difference from BSD - they don't do O_APPEND */
-	file = filp_open(name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
+	file = open_namei(AT_FDCWD, name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff -puN mm/swapfile.c~kill-filp_open mm/swapfile.c
--- lxc/mm/swapfile.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/mm/swapfile.c	2007-10-03 09:01:47.000000000 -0700
@@ -1198,7 +1198,7 @@ asmlinkage long sys_swapoff(const char _
 	if (IS_ERR(pathname))
 		goto out;
 
-	victim = filp_open(pathname, O_RDWR|O_LARGEFILE, 0);
+	victim = open_namei(AT_FDCWD, pathname, O_RDWR|O_LARGEFILE, 0);
 	putname(pathname);
 	err = PTR_ERR(victim);
 	if (IS_ERR(victim))
@@ -1477,7 +1477,7 @@ asmlinkage long sys_swapon(const char __
 		name = NULL;
 		goto bad_swap_2;
 	}
-	swap_file = filp_open(name, O_RDWR|O_LARGEFILE, 0);
+	swap_file = open_namei(AT_FDCWD, name, O_RDWR|O_LARGEFILE, 0);
 	error = PTR_ERR(swap_file);
 	if (IS_ERR(swap_file)) {
 		swap_file = NULL;
diff -puN sound/sound_firmware.c~kill-filp_open sound/sound_firmware.c
--- lxc/sound/sound_firmware.c~kill-filp_open	2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/sound/sound_firmware.c	2007-10-03 09:01:47.000000000 -0700
@@ -14,7 +14,7 @@ static int do_mod_firmware_load(const ch
 	char *dp;
 	loff_t pos;
 
-	filp = filp_open(fn, 0, 0);
+	filp = open_namei(AT_FDCWD, fn, 0, 0);
 	if (IS_ERR(filp))
 	{
 		printk(KERN_INFO "Unable to load '%s'.\n", fn);
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 7/7] keep track of mnt_writer state of struct file
  2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
                   ` (4 preceding siblings ...)
  2007-10-10 16:34 ` [RFC][PATCH 6/7] kill filp_open() Dave Hansen
@ 2007-10-10 16:34 ` Dave Hansen
  5 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-10 16:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: miklos, hch, Dave Hansen


There have been a few oopses caused by 'struct file's with
NULL f_vfsmnts.  There was also a set of potentially missed
mnt_want_write()s from dentry_open() calls.

This patch provides a very simple debugging framework to
catch these kinds of bugs.  It will WARN_ON() them, but
should stop us from having any oopses or mnt_writer
count imbalances.


Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 lxc-dave/fs/file_table.c    |   21 +++++++++++++++++++--
 lxc-dave/fs/open.c          |    2 ++
 lxc-dave/include/linux/fs.h |    4 ++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file fs/file_table.c
--- lxc/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file	2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/fs/file_table.c	2007-10-03 14:40:14.000000000 -0700
@@ -42,6 +42,12 @@ static inline void file_free_rcu(struct 
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
+	/*
+	 * At this point, either both or neither of these bits
+	 * should be set.
+	 */
+	WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN);
+	WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED);
 	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
@@ -194,6 +200,7 @@ int init_file(struct file *file, struct 
 	file->f_mode = mode;
 	file->f_op = fop;
 	if (mode & FMODE_WRITE) {
+		file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
 		error = mnt_want_write(mnt);
 		WARN_ON(error);
 	}
@@ -236,8 +243,18 @@ void fastcall __fput(struct file *file)
 	fops_put(file->f_op);
 	if (file->f_mode & FMODE_WRITE) {
 		put_write_access(inode);
-		if (!special_file(inode->i_mode))
-			mnt_drop_write(mnt);
+		if (!special_file(inode->i_mode)) {
+			if (file->f_mnt_write_state == FILE_MNT_WRITE_TAKEN) {
+				mnt_drop_write(mnt);
+				file->f_mnt_write_state |=
+					FILE_MNT_WRITE_RELEASED;
+			} else {
+				printk(KERN_WARNING "__fput() of writeable "
+						"file with no "
+						"mnt_want_write()\n");
+				WARN_ON(1);
+			}
+		}
 	}
 	put_pid(file->f_owner.pid);
 	file_kill(file);
diff -puN fs/namei.c~keep-track-of-mnt_writer-state-of-struct-file fs/namei.c
diff -puN include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file include/linux/fs.h
--- lxc/include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file	2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/include/linux/fs.h	2007-10-03 14:40:14.000000000 -0700
@@ -779,6 +779,9 @@ static inline int ra_has_index(struct fi
 		index <  ra->start + ra->size);
 }
 
+#define FILE_MNT_WRITE_TAKEN	1
+#define FILE_MNT_WRITE_RELEASED	2
+
 struct file {
 	/*
 	 * fu_list becomes invalid after file_free is called and queued via
@@ -813,6 +816,7 @@ struct file {
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
+	unsigned long f_mnt_write_state;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
diff -puN fs/open.c~keep-track-of-mnt_writer-state-of-struct-file fs/open.c
--- lxc/fs/open.c~keep-track-of-mnt_writer-state-of-struct-file	2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/fs/open.c	2007-10-03 14:42:01.000000000 -0700
@@ -790,6 +790,8 @@ static struct file *__dentry_open(struct
 			mnt_drop_write(mnt);
 			goto cleanup_file;
 		}
+		WARN_ON(f->f_mnt_write_state != 0);
+		f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
 	}
 
 	f->f_mapping = inode->i_mapping;
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
@ 2007-10-11 15:08   ` Miklos Szeredi
  2007-10-11 18:16     ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Miklos Szeredi @ 2007-10-11 15:08 UTC (permalink / raw)
  To: haveblue; +Cc: linux-kernel, miklos, hch, haveblue

> 
> 
> The first patch fixes an actual bug.  I think the
> reset will reduce the chance for any future bugs
> to creep in.
> 
> --
> 
> The r/o bind mount patches require matching mnt_want_write()
> at filp creation time with a mnt_drop_write() at __fput().
> 
> We used to do this in may_open(), but Miklos pointed out
> that __dentry_open() is used as well to create filps.  We
> don't currently do mnt_want_write() for these.
> 
> If a filp on a writeable file is created this way, and
> destroyed via __fput() we'll get a mount count imbalance.
> 
> This patch moves the mount write count acquisition from
> may_open() into __dentry_open(), where we should catch
> many more of the users.
> 
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
> ---
> 
>  lxc-dave/fs/namei.c |   12 ------------
>  lxc-dave/fs/open.c  |   45 ++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 38 insertions(+), 19 deletions(-)
> 
> diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
>  			return -EACCES;
>  
>  		flag &= ~O_TRUNC;
> -	} else if (flag & FMODE_WRITE) {
> -		/*
> -		 * effectively: !special_file()
> -		 * balanced by __fput()
> -		 */
> -		error = mnt_want_write(nd->mnt);
> -		if (error)
> -			return error;
>  	}

Maybe readonly should still be checked here, so that the order of
error checking doesn't change.  If racing with a read-only remount the
order is irrelevant anyway.  Something like this?

	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
		return -EROFS
	}

>  	error = vfs_permission(nd, acc_mode);
> @@ -1778,11 +1770,7 @@ do_last:
>  
>  	/* Negative dentry, just create the file */
>  	if (!path.dentry->d_inode) {
> -		error = mnt_want_write(nd->mnt);
> -		if (error)
> -			goto exit_mutex_unlock;
>  		error = open_namei_create(nd, &path, flag, mode);
> -		mnt_drop_write(nd->mnt);

This is still needed, isn't it?

And they should be added around do_truncate() as well, since you
remove the protection from may_open().

This one introduces an interesting race between ro-remount and
open(O_TRUNC), where the truncate can succeed but the open fail with
EROFS.  Is that a problem?

>  		if (error)
>  			goto exit;
>  		return 0;
> diff -puN fs/open.c~get-write-in-__dentry_open fs/open.c
> --- lxc/fs/open.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> +++ lxc-dave/fs/open.c	2007-10-04 18:02:48.000000000 -0700
> @@ -766,22 +766,51 @@ out:
>  	return error;
>  }
>  
> +/*
> + * You have to be very careful that these write
> + * counts get cleaned up in error cases and
> + * upon __fput().  This should probably never
> + * be called outside of __dentry_open().
> + */
> +static inline int __get_file_write_access(struct inode *inode,
> +					  struct vfsmount *mnt)
> +{
> +	int error;
> +	error = get_write_access(inode);
> +	if (error)
> +		return error;
> +	/*
> +	 * Do not take mount writer counts on
> +	 * special files since no writes to
> +	 * the mount itself will occur.
> +	 */
> +	if (special_file(inode->i_mode))
> +		return 0;
> +
> +	/*
> +	 * Balanced in __fput()
> +	 */
> +	error = mnt_want_write(mnt);
> +	if (error)
> +		put_write_access(inode);
> +	return error;
> +}
> +
>  static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
>  					int flags, struct file *f,
>  					int (*open)(struct inode *, struct file *))
>  {
>  	struct inode *inode;
> -	int error;
> +	int error = 0;
>  
>  	f->f_flags = flags;
>  	f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
>  				FMODE_PREAD | FMODE_PWRITE;
>  	inode = dentry->d_inode;
> -	if (f->f_mode & FMODE_WRITE) {
> -		error = get_write_access(inode);
> -		if (error)
> -			goto cleanup_file;
> -	}
> +	if (f->f_mode & FMODE_WRITE)
> +		error = __get_file_write_access(inode, mnt);
> +	if (error)
> +		goto cleanup_file;
>  
>  	f->f_mapping = inode->i_mapping;
>  	f->f_path.dentry = dentry;
> @@ -820,8 +849,10 @@ static struct file *__dentry_open(struct
>  
>  cleanup_all:
>  	fops_put(f->f_op);
> -	if (f->f_mode & FMODE_WRITE)
> +	if (f->f_mode & FMODE_WRITE) {
>  		put_write_access(inode);
> +		mnt_drop_write(mnt);

Shouldn't this be conditional on !special_file()?

Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 4/7] make open_namei() return a filp
  2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
@ 2007-10-11 15:24   ` Miklos Szeredi
  0 siblings, 0 replies; 12+ messages in thread
From: Miklos Szeredi @ 2007-10-11 15:24 UTC (permalink / raw)
  To: haveblue; +Cc: linux-kernel, miklos, hch, haveblue

> If open_namei() succeeds, there is potentially a mnt_want_write()
> that needs to get balanced.  If the caller doesn't create a
> 'struct file' and eventually __fput() it, or manually drop the
> write count on an error, we have a bug.
> 
> Forcing open_namei() to return a filp fixes this.  Any caller
> getting a 'struct file' back must consider that filp instantiated
> and fput() it normally.  The callers no longer have to worry about
> ever manually releasing a mnt write count.

I think this changelog is out of date, as this problem should have
been dealt with in patch-2/7.

Otherwise I don't object to this change, it looks like a fine cleanup.

Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-11 15:08   ` Miklos Szeredi
@ 2007-10-11 18:16     ` Dave Hansen
  2007-10-11 18:31       ` Miklos Szeredi
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2007-10-11 18:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, hch

On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> > +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> >  			return -EACCES;
> >  
> >  		flag &= ~O_TRUNC;
> > -	} else if (flag & FMODE_WRITE) {
> > -		/*
> > -		 * effectively: !special_file()
> > -		 * balanced by __fput()
> > -		 */
> > -		error = mnt_want_write(nd->mnt);
> > -		if (error)
> > -			return error;
> >  	}
> 
> Maybe readonly should still be checked here, so that the order of
> error checking doesn't change.  If racing with a read-only remount the
> order is irrelevant anyway.  Something like this?
> 
> 	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> 		return -EROFS
> 	}

I think that would be a bug if anything actually managed to trip that
code.  all of the may_open() calls should have been covered by the
__dentry_open() mnt writer.

> >  	error = vfs_permission(nd, acc_mode);
> > @@ -1778,11 +1770,7 @@ do_last:
> >  
> >  	/* Negative dentry, just create the file */
> >  	if (!path.dentry->d_inode) {
> > -		error = mnt_want_write(nd->mnt);
> > -		if (error)
> > -			goto exit_mutex_unlock;
> >  		error = open_namei_create(nd, &path, flag, mode);
> > -		mnt_drop_write(nd->mnt);
> 
> This is still needed, isn't it?

Yes, it is.  I'll add a big fat comment this time about why we need it.

> And they should be added around do_truncate() as well, since you
> remove the protection from may_open().
> 
> This one introduces an interesting race between ro-remount and
> open(O_TRUNC), where the truncate can succeed but the open fail with
> EROFS.  Is that a problem?

You're right, this does introduce that race, and it is relatively hard
to fix properly.  But, the 'return a filp' patch makes it easy to fix.
I've put a temporary kludge in the updated version of this patch, and
fixed it properly in that later patch.  

> >  cleanup_all:
> >  	fops_put(f->f_op);
> > -	if (f->f_mode & FMODE_WRITE)
> > +	if (f->f_mode & FMODE_WRITE) {
> >  		put_write_access(inode);
> > +		mnt_drop_write(mnt);
> 
> Shouldn't this be conditional on !special_file()?

It certainly should.

-- Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-11 18:16     ` Dave Hansen
@ 2007-10-11 18:31       ` Miklos Szeredi
  2007-10-11 19:24         ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Miklos Szeredi @ 2007-10-11 18:31 UTC (permalink / raw)
  To: haveblue; +Cc: miklos, linux-kernel, hch

> On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > > --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> > > +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> > > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > >  			return -EACCES;
> > >  
> > >  		flag &= ~O_TRUNC;
> > > -	} else if (flag & FMODE_WRITE) {
> > > -		/*
> > > -		 * effectively: !special_file()
> > > -		 * balanced by __fput()
> > > -		 */
> > > -		error = mnt_want_write(nd->mnt);
> > > -		if (error)
> > > -			return error;
> > >  	}
> > 
> > Maybe readonly should still be checked here, so that the order of
> > error checking doesn't change.  If racing with a read-only remount the
> > order is irrelevant anyway.  Something like this?
> > 
> > 	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> > 		return -EROFS
> > 	}
> 
> I think that would be a bug if anything actually managed to trip that
> code.  all of the may_open() calls should have been covered by the
> __dentry_open() mnt writer.

AFACIS, __dentry_open() will normally be called later than may_open().
And we don't want it earlier, because ->open() may have side affects,
that could be unsafe if done before permission checking.

> > And they should be added around do_truncate() as well, since you
> > remove the protection from may_open().
> > 
> > This one introduces an interesting race between ro-remount and
> > open(O_TRUNC), where the truncate can succeed but the open fail with
> > EROFS.  Is that a problem?
> 
> You're right, this does introduce that race, and it is relatively hard
> to fix properly.  But, the 'return a filp' patch makes it easy to fix.
> I've put a temporary kludge in the updated version of this patch, and
> fixed it properly in that later patch.  

If you fix this properly, that should take care of the first problem
as well.

Miklos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/7] get mount write in __dentry_open()
  2007-10-11 18:31       ` Miklos Szeredi
@ 2007-10-11 19:24         ` Dave Hansen
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2007-10-11 19:24 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, hch

On Thu, 2007-10-11 at 20:31 +0200, Miklos Szeredi wrote:
> > On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > > > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > > > --- lxc/fs/namei.c~get-write-in-__dentry_open	2007-10-03 14:44:52.000000000 -0700
> > > > +++ lxc-dave/fs/namei.c	2007-10-04 18:02:48.000000000 -0700
> > > > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > > >  			return -EACCES;
> > > >  
> > > >  		flag &= ~O_TRUNC;
> > > > -	} else if (flag & FMODE_WRITE) {
> > > > -		/*
> > > > -		 * effectively: !special_file()
> > > > -		 * balanced by __fput()
> > > > -		 */
> > > > -		error = mnt_want_write(nd->mnt);
> > > > -		if (error)
> > > > -			return error;
> > > >  	}
> > > 
> > > Maybe readonly should still be checked here, so that the order of
> > > error checking doesn't change.  If racing with a read-only remount the
> > > order is irrelevant anyway.  Something like this?
> > > 
> > > 	} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> > > 		return -EROFS
> > > 	}
> > 
> > I think that would be a bug if anything actually managed to trip that
> > code.  all of the may_open() calls should have been covered by the
> > __dentry_open() mnt writer.
> 
> AFACIS, __dentry_open() will normally be called later than may_open().
> And we don't want it earlier, because ->open() may have side affects,
> that could be unsafe if done before permission checking.

I actually check the mount write count before the ->open() in
__dentry_open().  The truncates are also definitely wrapped in their own
mnt_want_write() calls now.

> > > And they should be added around do_truncate() as well, since you
> > > remove the protection from may_open().
> > > 
> > > This one introduces an interesting race between ro-remount and
> > > open(O_TRUNC), where the truncate can succeed but the open fail with
> > > EROFS.  Is that a problem?
> > 
> > You're right, this does introduce that race, and it is relatively hard
> > to fix properly.  But, the 'return a filp' patch makes it easy to fix.
> > I've put a temporary kludge in the updated version of this patch, and
> > fixed it properly in that later patch.  
> 
> If you fix this properly, that should take care of the first problem
> as well.

Yup.  New series coming up. 

-- Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-10-11 19:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-10 16:34 [RFC][PATCH 1/7] init_file(): only take writes on normal files Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 2/7] get mount write in __dentry_open() Dave Hansen
2007-10-11 15:08   ` Miklos Szeredi
2007-10-11 18:16     ` Dave Hansen
2007-10-11 18:31       ` Miklos Szeredi
2007-10-11 19:24         ` Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 3/7] do namei_flags calculation inside open_namei() Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 4/7] make open_namei() return a filp Dave Hansen
2007-10-11 15:24   ` Miklos Szeredi
2007-10-10 16:34 ` [RFC][PATCH 5/7] kill do_filp_open() Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 6/7] kill filp_open() Dave Hansen
2007-10-10 16:34 ` [RFC][PATCH 7/7] keep track of mnt_writer state of struct file Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.