linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs: fix i_writecount on shmem and friends
@ 2014-03-03 15:16 David Herrmann
  2014-03-11 19:05 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: David Herrmann @ 2014-03-03 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, Linus Torvalds, Andrew Morton, David Herrmann,
	Al Viro, David Howells, Oleg Nesterov, stable

VM_DENYWRITE currently relies on i_writecount. Unless there's an active
writable reference to an inode, VM_DENYWRITE is not allowed.
Unfortunately, alloc_file() does not increase i_writecount, therefore,
does not prevent a following VM_DENYWRITE even though the new file might
have been opened with FMODE_WRITE. However, callers of alloc_file() expect
the file object to be fully instantiated so they can call fput() on it. We
could now either fix all callers to do an get_write_access() if opened
with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the
latter.

Note that this bug allows some rather subtle misbehavior. The following
sequence of calls should work just fine, but currently fails:
    int p[2], orig, ro, rw;
    char buf[128];

    pipe(p);
    sprintf(buf, "/proc/self/fd/%d", p[1]);
    ro = open("/proc/self/fd/$orig", O_RDONLY);
    close(p[1]);
    rw = open("/proc/self/fd/$ro", O_RDWR);

The final open() cannot succeed as close(p[1]) caused an integer underflow
on i_writecount, effectively causing VM_DENYWRITE on the inode. The open
will fail with -ETXTBUSY.

It's a rather odd sequence of calls and given that open() doesn't use
alloc_file() (and thus not affected by this bug), it's rather unlikely
that this is a serious issue. But stuff like anon_inode shares a *single*
inode across a huge set of interfaces. If any of these is broken like
pipe(), it will affect all of these (ranging from dma-buf to epoll).

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/file_table.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5fff903..e3c8dd0 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -167,6 +167,7 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 		const struct file_operations *fop)
 {
 	struct file *file;
+	int error;
 
 	file = get_empty_filp();
 	if (IS_ERR(file))
@@ -178,15 +179,23 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 	file->f_mode = mode;
 	file->f_op = fop;
 
-	/*
-	 * These mounts don't really matter in practice
-	 * for r/o bind mounts.  They aren't userspace-
-	 * visible.  We do this for consistency, and so
-	 * that we can do debugging checks at __fput()
-	 */
-	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
-		file_take_write(file);
-		WARN_ON(mnt_clone_write(path->mnt));
+	if (mode & FMODE_WRITE) {
+		error = get_write_access(path->dentry->d_inode);
+		if (error) {
+			put_filp(file);
+			return ERR_PTR(error);
+		}
+
+		/*
+		 * These mounts don't really matter in practice
+		 * for r/o bind mounts.  They aren't userspace-
+		 * visible.  We do this for consistency, and so
+		 * that we can do debugging checks at __fput()
+		 */
+		if (!special_file(path->dentry->d_inode->i_mode)) {
+			file_take_write(file);
+			WARN_ON(mnt_clone_write(path->mnt));
+		}
 	}
 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-03 15:16 [PATCH] fs: fix i_writecount on shmem and friends David Herrmann
@ 2014-03-11 19:05 ` Linus Torvalds
  2014-03-12 18:19   ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-03-11 19:05 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, linux-fsdevel, Andrew Morton, Al Viro,
	David Howells, Oleg Nesterov, stable

Al, any comments?

David's test-program is some broken mix of C and shell scripting, but
the fixed version does show the issue he talks about:

    int main(int argc, char **argv)
    {
            int p[2], ro;
            char buf[128];

            pipe(p);
            sprintf(buf, "/proc/self/fd/%d", p[1]);
            ro = open(buf, O_RDONLY);
            sprintf(buf, "/proc/self/fd/%d", ro);
            close(p[1]);
            return open(buf, O_RDWR);
    }

which returns ETXTBSY (most easily seen by just stracing it).

The patch would also seem to make sense, with the i_readcount_inc()
being immediately below for the FMODE_READ case.

[ Quoting the whole email for context, sorry ]

                  Linus

On Mon, Mar 3, 2014 at 7:16 AM, David Herrmann <dh.herrmann@gmail.com> wrote:
> VM_DENYWRITE currently relies on i_writecount. Unless there's an active
> writable reference to an inode, VM_DENYWRITE is not allowed.
> Unfortunately, alloc_file() does not increase i_writecount, therefore,
> does not prevent a following VM_DENYWRITE even though the new file might
> have been opened with FMODE_WRITE. However, callers of alloc_file() expect
> the file object to be fully instantiated so they can call fput() on it. We
> could now either fix all callers to do an get_write_access() if opened
> with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the
> latter.
>
> Note that this bug allows some rather subtle misbehavior. The following
> sequence of calls should work just fine, but currently fails:
>     int p[2], orig, ro, rw;
>     char buf[128];
>
>     pipe(p);
>     sprintf(buf, "/proc/self/fd/%d", p[1]);
>     ro = open("/proc/self/fd/$orig", O_RDONLY);
>     close(p[1]);
>     rw = open("/proc/self/fd/$ro", O_RDWR);
>
> The final open() cannot succeed as close(p[1]) caused an integer underflow
> on i_writecount, effectively causing VM_DENYWRITE on the inode. The open
> will fail with -ETXTBUSY.
>
> It's a rather odd sequence of calls and given that open() doesn't use
> alloc_file() (and thus not affected by this bug), it's rather unlikely
> that this is a serious issue. But stuff like anon_inode shares a *single*
> inode across a huge set of interfaces. If any of these is broken like
> pipe(), it will affect all of these (ranging from dma-buf to epoll).
>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  fs/file_table.c | 27 ++++++++++++++++++---------
>  1 file changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 5fff903..e3c8dd0 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -167,6 +167,7 @@ struct file *alloc_file(struct path *path, fmode_t mode,
>                 const struct file_operations *fop)
>  {
>         struct file *file;
> +       int error;
>
>         file = get_empty_filp();
>         if (IS_ERR(file))
> @@ -178,15 +179,23 @@ struct file *alloc_file(struct path *path, fmode_t mode,
>         file->f_mode = mode;
>         file->f_op = fop;
>
> -       /*
> -        * These mounts don't really matter in practice
> -        * for r/o bind mounts.  They aren't userspace-
> -        * visible.  We do this for consistency, and so
> -        * that we can do debugging checks at __fput()
> -        */
> -       if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
> -               file_take_write(file);
> -               WARN_ON(mnt_clone_write(path->mnt));
> +       if (mode & FMODE_WRITE) {
> +               error = get_write_access(path->dentry->d_inode);
> +               if (error) {
> +                       put_filp(file);
> +                       return ERR_PTR(error);
> +               }
> +
> +               /*
> +                * These mounts don't really matter in practice
> +                * for r/o bind mounts.  They aren't userspace-
> +                * visible.  We do this for consistency, and so
> +                * that we can do debugging checks at __fput()
> +                */
> +               if (!special_file(path->dentry->d_inode->i_mode)) {
> +                       file_take_write(file);
> +                       WARN_ON(mnt_clone_write(path->mnt));
> +               }
>         }
>         if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>                 i_readcount_inc(path->dentry->d_inode);
> --
> 1.9.0
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-11 19:05 ` Linus Torvalds
@ 2014-03-12 18:19   ` Al Viro
  2014-03-12 22:30     ` David Herrmann
  2014-03-13  4:08     ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Al Viro @ 2014-03-12 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Herrmann, Linux Kernel Mailing List, linux-fsdevel,
	Andrew Morton, David Howells, Oleg Nesterov, stable, Neil Brown

On Tue, Mar 11, 2014 at 12:05:09PM -0700, Linus Torvalds wrote:
> 
> which returns ETXTBSY (most easily seen by just stracing it).
> 
> The patch would also seem to make sense, with the i_readcount_inc()
> being immediately below for the FMODE_READ case.

I think it's trying to fix the problem in the wrong place.  The bug is real,
all right, but it's not that alloc_file() for non-regulars doesn't grab
writecount; it's that drop_file_write_access() drops it for those.

What the hell would we want to play with that counter for, anyway?  It's not
as if they could be mmapped, so all it does is making pipe(2) and socket(2)
more costly, for no visible reason.

I would prefer to flip
        put_write_access(inode);

        if (special_file(inode->i_mode))
                return;
in drop_file_write_access() instead.

<goes to looks at i_writecount users>
Oh, shit...

drivers/md/md.c:
/* similar to deny_write_access, but accounts for our holding a reference
 * to the file ourselves */
static int deny_bitmap_write_access(struct file * file)
{
        struct inode *inode = file->f_mapping->host;

        spin_lock(&inode->i_lock);
        if (atomic_read(&inode->i_writecount) > 1) {
                spin_unlock(&inode->i_lock);
                return -ETXTBSY;
        }
        atomic_set(&inode->i_writecount, -1);
        spin_unlock(&inode->i_lock);

        return 0;
}

Broken.  get_write_access() will happily increment i_writecount e.g. from
1 to 2, without even looking at i_lock.  Moreover, it's paired with
void restore_bitmap_write_access(struct file *file)
{
        struct inode *inode = file->f_mapping->host;

        spin_lock(&inode->i_lock);
        atomic_set(&inode->i_writecount, 1);
        spin_unlock(&inode->i_lock);
}
Just what will happen if we do denywrite mmap() of that file in between?
Even worse, the caller take file straight from fget(), with no sanity
checks whatsoever.  Just what will happen if I give it e.g. a directory?
Or a procfs/sysfs/whatnot file, for that matter?  Neil?  I realize that
it's root-only, but still...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-12 18:19   ` Al Viro
@ 2014-03-12 22:30     ` David Herrmann
  2014-03-13  0:37       ` Al Viro
  2014-03-13  4:08     ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: David Herrmann @ 2014-03-12 22:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Andrew Morton, David Howells, Oleg Nesterov, stable, Neil Brown

Hi

> I think it's trying to fix the problem in the wrong place.  The bug is real,
> all right, but it's not that alloc_file() for non-regulars doesn't grab
> writecount; it's that drop_file_write_access() drops it for those.
>
> What the hell would we want to play with that counter for, anyway?  It's not
> as if they could be mmapped, so all it does is making pipe(2) and socket(2)
> more costly, for no visible reason.

Please see:
  shmem_zero_setup()
    shmem_file_setup()
      __shmem_file_setup()
        alloc_file()

shmem_zero_setup() is used by /dev/zero (drivers/char/mem.c) and
mmap(MAP_ANON). So yes, we do call mmap() on these files.
I also disagree on "for no visible reason". The reason to do this is
uniformity. We make i_writecount work on all inodes regardless how
they got created. Breaking consistent behavior just to save an
atomic_inc_unless_negative() in the alloc_file() path seems
unreasonable to me.

Anyhow, I haven't found any bug if we follow your recommendation.
Every path using alloc_file() either prevents write access on the
underlying inode (eg., anon-inode) or prevents user-space from getting
an FD (all the shmem_file_setup() paths). So it's up to you. Feel free
to fix it yourself, otherwise I will send a second patch following
your idea tomorrow.

Oh, and yes, the md stuff is broken, but I thought that's an orthogonal issue..

Thanks
David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-12 22:30     ` David Herrmann
@ 2014-03-13  0:37       ` Al Viro
  2014-03-13 11:03         ` David Herrmann
  2014-03-20 11:13         ` David Herrmann
  0 siblings, 2 replies; 11+ messages in thread
From: Al Viro @ 2014-03-13  0:37 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Andrew Morton, David Howells, Oleg Nesterov, stable, Neil Brown

On Wed, Mar 12, 2014 at 11:30:09PM +0100, David Herrmann wrote:
> Hi
> 
> > I think it's trying to fix the problem in the wrong place.  The bug is real,
> > all right, but it's not that alloc_file() for non-regulars doesn't grab
> > writecount; it's that drop_file_write_access() drops it for those.
> >
> > What the hell would we want to play with that counter for, anyway?  It's not
> > as if they could be mmapped, so all it does is making pipe(2) and socket(2)
> > more costly, for no visible reason.
> 
> Please see:
>   shmem_zero_setup()
>     shmem_file_setup()
>       __shmem_file_setup()
>         alloc_file()
> 
> shmem_zero_setup() is used by /dev/zero (drivers/char/mem.c) and
> mmap(MAP_ANON). So yes, we do call mmap() on these files.

We do, but how do you get anything even attempt deny_write_access() on
those?  And what would the semantics of that be, anyway?

> I also disagree on "for no visible reason". The reason to do this is
> uniformity. We make i_writecount work on all inodes regardless how
> they got created. Breaking consistent behavior just to save an
> atomic_inc_unless_negative() in the alloc_file() path seems
> unreasonable to me.

i_writecount is a hack used to implement MAP_DENYWRITE (ignored since way
back) and to provide historical execve() behaviour (note, BTW, that it's
not consistent - e.g. it does apply to binary itself, but not to shared
libraries; scripts don't get much protection either - if the sucker was
opened for write during execve(), you get ETXTBUSY, but as soon as execve()
has opened the interpreter, script itself can be opened for write just fine).

Maintaining it for pipes and sockets is just plain nuts.

> Anyhow, I haven't found any bug if we follow your recommendation.
> Every path using alloc_file() either prevents write access on the
> underlying inode (eg., anon-inode) or prevents user-space from getting
> an FD (all the shmem_file_setup() paths). So it's up to you. Feel free
> to fix it yourself, otherwise I will send a second patch following
> your idea tomorrow.

We want the same in __get_file_write_access() as well (i.e. doing nothing
if the file isn't regular).

file_take_write(), f_mnt_write_state and the rest of CONFIG_DEBUG_WRITECOUNT
stuff should probably just disappear.

TBH, I'm none too happy about all those if (!special_file()) in that logics.
I would rather have do_dentry_open() do something like

	if ((f->f_mode & FMODE_WRITE) && !special_file(inode->i_mode)) {
		error = get_write_access(inode);
		if (error)
                        goto cleanup_file;
                error = __mnt_want_write(f->f_path.mnt);
		if (error) {
			put_write_access(inode);
			goto cleanup_file;
		}
		f->f_mode |= FMODE_WRITER;
	}
....
cleanup_all:
        fops_put(f->f_op);
        if (f->f_mode & FMODE_WRITER) {
                put_write_access(inode);
		__mnt_drop_write(f->f_path.mnt);
	}
cleanup_file:
....

fput() would just do
....
        if (file->f_mode & FMODE_WRITER) {
                put_write_access(inode);
		__mnt_drop_write(mnt);
	}
....
and alloc_file() would stop playing these games:
        /*
         * These mounts don't really matter in practice
         * for r/o bind mounts.  They aren't userspace-
         * visible.  We do this for consistency, and so
         * that we can do debugging checks at __fput()
         */
        if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
                file_take_write(file);
                WARN_ON(mnt_clone_write(path->mnt));
        }

And to hell with __get_file_write_access()/drop_file_write_access().
IOW, FMODE_WRITER == "we have bumped i_writecount and writer count on
vfsmount".  Makes life much simpler...

IOW, how about the following (completely untested):

diff --git a/fs/file_table.c b/fs/file_table.c
index 5b24008..ce1504f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -52,7 +52,6 @@ static void file_free_rcu(struct rcu_head *head)
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
-	file_check_state(f);
 	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
@@ -178,47 +177,12 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 	file->f_mapping = path->dentry->d_inode->i_mapping;
 	file->f_mode = mode;
 	file->f_op = fop;
-
-	/*
-	 * These mounts don't really matter in practice
-	 * for r/o bind mounts.  They aren't userspace-
-	 * visible.  We do this for consistency, and so
-	 * that we can do debugging checks at __fput()
-	 */
-	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
-		file_take_write(file);
-		WARN_ON(mnt_clone_write(path->mnt));
-	}
 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
 	return file;
 }
 EXPORT_SYMBOL(alloc_file);
 
-/**
- * drop_file_write_access - give up ability to write to a file
- * @file: the file to which we will stop writing
- *
- * This is a central place which will give up the ability
- * to write to @file, along with access to write through
- * its vfsmount.
- */
-static void drop_file_write_access(struct file *file)
-{
-	struct vfsmount *mnt = file->f_path.mnt;
-	struct dentry *dentry = file->f_path.dentry;
-	struct inode *inode = dentry->d_inode;
-
-	put_write_access(inode);
-
-	if (special_file(inode->i_mode))
-		return;
-	if (file_check_writeable(file) != 0)
-		return;
-	__mnt_drop_write(mnt);
-	file_release_write(file);
-}
-
 /* the real guts of fput() - releasing the last reference to file
  */
 static void __fput(struct file *file)
@@ -253,8 +217,10 @@ static void __fput(struct file *file)
 	put_pid(file->f_owner.pid);
 	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_dec(inode);
-	if (file->f_mode & FMODE_WRITE)
-		drop_file_write_access(file);
+	if (file->f_mode & FMODE_WRITER) {
+		put_write_access(inode);
+		__mnt_drop_write(mnt);
+	}
 	file->f_path.dentry = NULL;
 	file->f_path.mnt = NULL;
 	file->f_inode = NULL;
diff --git a/fs/open.c b/fs/open.c
index b9ed8b2..ea765e4 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -632,35 +632,6 @@ out:
 	return error;
 }
 
-/*
- * You have to be very careful that these write
- * counts get cleaned up in error cases and
- * upon __fput().  This should probably never
- * be called outside of __dentry_open().
- */
-static inline int __get_file_write_access(struct inode *inode,
-					  struct vfsmount *mnt)
-{
-	int error;
-	error = get_write_access(inode);
-	if (error)
-		return error;
-	/*
-	 * Do not take mount writer counts on
-	 * special files since no writes to
-	 * the mount itself will occur.
-	 */
-	if (!special_file(inode->i_mode)) {
-		/*
-		 * Balanced in __fput()
-		 */
-		error = __mnt_want_write(mnt);
-		if (error)
-			put_write_access(inode);
-	}
-	return error;
-}
-
 int open_check_o_direct(struct file *f)
 {
 	/* NB: we're sure to have correct a_ops only after f_op->open */
@@ -690,12 +661,15 @@ static int do_dentry_open(struct file *f,
 
 	path_get(&f->f_path);
 	inode = f->f_inode = f->f_path.dentry->d_inode;
-	if (f->f_mode & FMODE_WRITE) {
-		error = __get_file_write_access(inode, f->f_path.mnt);
-		if (error)
+	if ((f->f_mode & FMODE_WRITE) && !special_file(inode->i_mode)) {
+		error = get_write_access(inode);
+		if (unlikely(error))
+			goto cleanup_file;
+		error = __mnt_want_write(f->f_path.mnt);
+		if (unlikely(error)) {
+			put_write_access(inode);
 			goto cleanup_file;
-		if (!special_file(inode->i_mode))
-			file_take_write(f);
+		}
 	}
 
 	f->f_mapping = inode->i_mapping;
@@ -741,18 +715,9 @@ static int do_dentry_open(struct file *f,
 
 cleanup_all:
 	fops_put(f->f_op);
-	if (f->f_mode & FMODE_WRITE) {
+	if (f->f_mode & FMODE_WRITER) {
 		put_write_access(inode);
-		if (!special_file(inode->i_mode)) {
-			/*
-			 * We don't consider this a real
-			 * mnt_want/drop_write() pair
-			 * because it all happenend right
-			 * here, so just reset the state.
-			 */
-			file_reset_write(f);
-			__mnt_drop_write(f->f_path.mnt);
-		}
+		__mnt_drop_write(f->f_path.mnt);
 	}
 cleanup_file:
 	path_put(&f->f_path);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 23b2a35..d9d88a0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -125,6 +125,8 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 
 /* File needs atomic accesses to f_pos */
 #define FMODE_ATOMIC_POS	((__force fmode_t)0x8000)
+/* Write access to underlying fs */
+#define FMODE_WRITER		((__force fmode_t)0x10000)
 
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
@@ -769,9 +771,6 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 		index <  ra->start + ra->size);
 }
 
-#define FILE_MNT_WRITE_TAKEN	1
-#define FILE_MNT_WRITE_RELEASED	2
-
 struct file {
 	union {
 		struct llist_node	fu_llist;
@@ -809,9 +808,6 @@ struct file {
 	struct list_head	f_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
-#ifdef CONFIG_DEBUG_WRITECOUNT
-	unsigned long f_mnt_write_state;
-#endif
 } __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -829,49 +825,6 @@ static inline struct file *get_file(struct file *f)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
-#ifdef CONFIG_DEBUG_WRITECOUNT
-static inline void file_take_write(struct file *f)
-{
-	WARN_ON(f->f_mnt_write_state != 0);
-	f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
-}
-static inline void file_release_write(struct file *f)
-{
-	f->f_mnt_write_state |= FILE_MNT_WRITE_RELEASED;
-}
-static inline void file_reset_write(struct file *f)
-{
-	f->f_mnt_write_state = 0;
-}
-static inline void file_check_state(struct file *f)
-{
-	/*
-	 * At this point, either both or neither of these bits
-	 * should be set.
-	 */
-	WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN);
-	WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED);
-}
-static inline int file_check_writeable(struct file *f)
-{
-	if (f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN)
-		return 0;
-	printk(KERN_WARNING "writeable file with no "
-			    "mnt_want_write()\n");
-	WARN_ON(1);
-	return -EINVAL;
-}
-#else /* !CONFIG_DEBUG_WRITECOUNT */
-static inline void file_take_write(struct file *filp) {}
-static inline void file_release_write(struct file *filp) {}
-static inline void file_reset_write(struct file *filp) {}
-static inline void file_check_state(struct file *filp) {}
-static inline int file_check_writeable(struct file *filp)
-{
-	return 0;
-}
-#endif /* CONFIG_DEBUG_WRITECOUNT */
-
 #define	MAX_NON_LFS	((1UL<<31) - 1)
 
 /* Page cache limit. The filesystems should put that into their s_maxbytes 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a48abea..7a08593 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1030,16 +1030,6 @@ config DEBUG_BUGVERBOSE
 	  of the BUG call as well as the EIP and oops trace.  This aids
 	  debugging but costs about 70-100K of memory.
 
-config DEBUG_WRITECOUNT
-	bool "Debug filesystem writers count"
-	depends on DEBUG_KERNEL
-	help
-	  Enable this to catch wrong use of the writers count in struct
-	  vfsmount.  This will increase the size of each file struct by
-	  32 bits.
-
-	  If unsure, say N.
-
 config DEBUG_LIST
 	bool "Debug linked list manipulation"
 	depends on DEBUG_KERNEL

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-12 18:19   ` Al Viro
  2014-03-12 22:30     ` David Herrmann
@ 2014-03-13  4:08     ` NeilBrown
  2014-03-13  4:29       ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2014-03-13  4:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Herrmann, Linux Kernel Mailing List,
	linux-fsdevel, Andrew Morton, David Howells, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 6596 bytes --]

On Wed, 12 Mar 2014 18:19:25 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Tue, Mar 11, 2014 at 12:05:09PM -0700, Linus Torvalds wrote:
> > 
> > which returns ETXTBSY (most easily seen by just stracing it).
> > 
> > The patch would also seem to make sense, with the i_readcount_inc()
> > being immediately below for the FMODE_READ case.
> 
> I think it's trying to fix the problem in the wrong place.  The bug is real,
> all right, but it's not that alloc_file() for non-regulars doesn't grab
> writecount; it's that drop_file_write_access() drops it for those.
> 
> What the hell would we want to play with that counter for, anyway?  It's not
> as if they could be mmapped, so all it does is making pipe(2) and socket(2)
> more costly, for no visible reason.
> 
> I would prefer to flip
>         put_write_access(inode);
> 
>         if (special_file(inode->i_mode))
>                 return;
> in drop_file_write_access() instead.
> 
> <goes to looks at i_writecount users>
> Oh, shit...
> 
> drivers/md/md.c:
> /* similar to deny_write_access, but accounts for our holding a reference
>  * to the file ourselves */
> static int deny_bitmap_write_access(struct file * file)
> {
>         struct inode *inode = file->f_mapping->host;
> 
>         spin_lock(&inode->i_lock);
>         if (atomic_read(&inode->i_writecount) > 1) {
>                 spin_unlock(&inode->i_lock);
>                 return -ETXTBSY;
>         }
>         atomic_set(&inode->i_writecount, -1);
>         spin_unlock(&inode->i_lock);
> 
>         return 0;
> }
> 
> Broken.  get_write_access() will happily increment i_writecount e.g. from
> 1 to 2, without even looking at i_lock.

I guess someone changed exactly how i_writecount is used without check all
users ... that isn't like you Al :-)


>                                          Moreover, it's paired with
> void restore_bitmap_write_access(struct file *file)
> {
>         struct inode *inode = file->f_mapping->host;
> 
>         spin_lock(&inode->i_lock);
>         atomic_set(&inode->i_writecount, 1);
>         spin_unlock(&inode->i_lock);
> }
> Just what will happen if we do denywrite mmap() of that file in between?
> Even worse, the caller take file straight from fget(), with no sanity
> checks whatsoever.  Just what will happen if I give it e.g. a directory?
> Or a procfs/sysfs/whatnot file, for that matter?  Neil?  I realize that
> it's root-only, but still...

But as you point out, even "fixing" it to match the current i_writecount
behaviour wouldn't really be a proper fix.
Probably best to stop messing with i_writecount and just use it to guard
against using the same bitmap twice.
It won't prevent some other process writing to the file but that shouldn't
happen anyway.

Maybe something like the following once I've tested it.

Thanks,
NeilBrown

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 4195a01b1535..9a8e66ae04f5 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1988,7 +1988,6 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
 		if (mddev->bitmap_info.file) {
 			struct file *f = mddev->bitmap_info.file;
 			mddev->bitmap_info.file = NULL;
-			restore_bitmap_write_access(f);
 			fput(f);
 		}
 	} else {
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e28c9d2a1166..223126046e02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5181,32 +5181,6 @@ static int restart_array(struct mddev *mddev)
 	return 0;
 }
 
-/* similar to deny_write_access, but accounts for our holding a reference
- * to the file ourselves */
-static int deny_bitmap_write_access(struct file * file)
-{
-	struct inode *inode = file->f_mapping->host;
-
-	spin_lock(&inode->i_lock);
-	if (atomic_read(&inode->i_writecount) > 1) {
-		spin_unlock(&inode->i_lock);
-		return -ETXTBSY;
-	}
-	atomic_set(&inode->i_writecount, -1);
-	spin_unlock(&inode->i_lock);
-
-	return 0;
-}
-
-void restore_bitmap_write_access(struct file *file)
-{
-	struct inode *inode = file->f_mapping->host;
-
-	spin_lock(&inode->i_lock);
-	atomic_set(&inode->i_writecount, 1);
-	spin_unlock(&inode->i_lock);
-}
-
 static void md_clean(struct mddev *mddev)
 {
 	mddev->array_sectors = 0;
@@ -5427,7 +5401,6 @@ static int do_md_stop(struct mddev * mddev, int mode,
 
 		bitmap_destroy(mddev);
 		if (mddev->bitmap_info.file) {
-			restore_bitmap_write_access(mddev->bitmap_info.file);
 			fput(mddev->bitmap_info.file);
 			mddev->bitmap_info.file = NULL;
 		}
@@ -5991,6 +5964,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 
 
 	if (fd >= 0) {
+		struct inode *inode;
 		if (mddev->bitmap)
 			return -EEXIST; /* cannot add when bitmap is present */
 		mddev->bitmap_info.file = fget(fd);
@@ -6001,13 +5975,20 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 			return -EBADF;
 		}
 
-		err = deny_bitmap_write_access(mddev->bitmap_info.file);
-		if (err) {
+		inode = mddev->bitmap_info.file->f_mapping->host;
+		if (!S_ISREG(inode->i_mode)) {
+			printk(KERN_ERR "%s: error: bitmap file must be a regular file\n",
+			       mdname(mddev));
+			fput(mddev->bitmap_info.file);
+			mddev->bitmap_info.file = NULL;
+			return -EBADF;
+		}
+		if (atomic_read(&inode->i_writecount) != 1) {
 			printk(KERN_ERR "%s: error: bitmap file is already in use\n",
 			       mdname(mddev));
 			fput(mddev->bitmap_info.file);
 			mddev->bitmap_info.file = NULL;
-			return err;
+			return -EBUSY;
 		}
 		mddev->bitmap_info.offset = 0; /* file overrides offset */
 	} else if (mddev->bitmap == NULL)
@@ -6027,10 +6008,8 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 		mddev->pers->quiesce(mddev, 0);
 	}
 	if (fd < 0) {
-		if (mddev->bitmap_info.file) {
-			restore_bitmap_write_access(mddev->bitmap_info.file);
+		if (mddev->bitmap_info.file)
 			fput(mddev->bitmap_info.file);
-		}
 		mddev->bitmap_info.file = NULL;
 	}
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 07bba96de260..a49d991f3fe1 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -605,7 +605,6 @@ extern int md_check_no_bitmap(struct mddev *mddev);
 extern int md_integrity_register(struct mddev *mddev);
 extern void md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev);
 extern int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale);
-extern void restore_bitmap_write_access(struct file *file);
 
 extern void mddev_init(struct mddev *mddev);
 extern int md_run(struct mddev *mddev);


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-13  4:08     ` NeilBrown
@ 2014-03-13  4:29       ` Al Viro
  2014-03-13  5:55         ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2014-03-13  4:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, David Herrmann, Linux Kernel Mailing List,
	linux-fsdevel, Andrew Morton, David Howells, Oleg Nesterov

On Thu, Mar 13, 2014 at 03:08:00PM +1100, NeilBrown wrote:
> +		inode = mddev->bitmap_info.file->f_mapping->host;
> +		if (!S_ISREG(inode->i_mode)) {
> +			printk(KERN_ERR "%s: error: bitmap file must be a regular file\n",
> +			       mdname(mddev));
> +			fput(mddev->bitmap_info.file);
> +			mddev->bitmap_info.file = NULL;
> +			return -EBADF;
> +		}
> +		if (atomic_read(&inode->i_writecount) != 1) {

Umm...  I think you ought to check more than that.  At the very least you
want to check that you have it opened for write - you don't want e.g.
a filesystem containing that puppy remounted r/o under you.  Another thing
is, what happens if it's not a buffer cache backed one?  Hell, what happens
if it's a file on NFS?  You are relying on bmap() working, right?  So it
looks like you ought to check if ->bmap() is there.  And I really wonder
how well does it play with journalling fs...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-13  4:29       ` Al Viro
@ 2014-03-13  5:55         ` NeilBrown
  2014-03-14  4:51           ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2014-03-13  5:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Herrmann, Linux Kernel Mailing List,
	linux-fsdevel, Andrew Morton, David Howells, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 1458 bytes --]

On Thu, 13 Mar 2014 04:29:34 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Mar 13, 2014 at 03:08:00PM +1100, NeilBrown wrote:
> > +		inode = mddev->bitmap_info.file->f_mapping->host;
> > +		if (!S_ISREG(inode->i_mode)) {
> > +			printk(KERN_ERR "%s: error: bitmap file must be a regular file\n",
> > +			       mdname(mddev));
> > +			fput(mddev->bitmap_info.file);
> > +			mddev->bitmap_info.file = NULL;
> > +			return -EBADF;
> > +		}
> > +		if (atomic_read(&inode->i_writecount) != 1) {
> 
> Umm...  I think you ought to check more than that.  At the very least you
> want to check that you have it opened for write - you don't want e.g.
> a filesystem containing that puppy remounted r/o under you.  Another thing
> is, what happens if it's not a buffer cache backed one?  Hell, what happens
> if it's a file on NFS?  You are relying on bmap() working, right?  So it
> looks like you ought to check if ->bmap() is there.  And I really wonder
> how well does it play with journalling fs...

Can we do direct writes from kernel space yet?  If so I'll change the code to
do that so that it will work with any filesystem (which supports direct
writes).
(The documentation says we that bitmap files should only be used on ext2 or
ext3.  Most people use bitmaps on the raw devices so hopefully the few who
have a need for files will read the documentation :-)

(and yes, I check for FMODE_WRITE)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-13  0:37       ` Al Viro
@ 2014-03-13 11:03         ` David Herrmann
  2014-03-20 11:13         ` David Herrmann
  1 sibling, 0 replies; 11+ messages in thread
From: David Herrmann @ 2014-03-13 11:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Andrew Morton, David Howells, Oleg Nesterov, stable, Neil Brown

Hi Al

On Thu, Mar 13, 2014 at 1:37 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Mar 12, 2014 at 11:30:09PM +0100, David Herrmann wrote:
>> Please see:
>>   shmem_zero_setup()
>>     shmem_file_setup()
>>       __shmem_file_setup()
>>         alloc_file()
>>
>> shmem_zero_setup() is used by /dev/zero (drivers/char/mem.c) and
>> mmap(MAP_ANON). So yes, we do call mmap() on these files.
>
> We do, but how do you get anything even attempt deny_write_access() on
> those?  And what would the semantics of that be, anyway?

I haven't found any such path either. Just wanted to point it out in
case you missed these.

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 5b24008..ce1504f 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -178,47 +177,12 @@ struct file *alloc_file(struct path *path, fmode_t mode,
>         file->f_mapping = path->dentry->d_inode->i_mapping;
>         file->f_mode = mode;
>         file->f_op = fop;
> -
> -       /*
> -        * These mounts don't really matter in practice
> -        * for r/o bind mounts.  They aren't userspace-
> -        * visible.  We do this for consistency, and so
> -        * that we can do debugging checks at __fput()
> -        */
> -       if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
> -               file_take_write(file);
> -               WARN_ON(mnt_clone_write(path->mnt));
> -       }
>         if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>                 i_readcount_inc(path->dentry->d_inode);
>         return file;
>  }
>  EXPORT_SYMBOL(alloc_file);
>
> -/**
> - * drop_file_write_access - give up ability to write to a file
> - * @file: the file to which we will stop writing
> - *
> - * This is a central place which will give up the ability
> - * to write to @file, along with access to write through
> - * its vfsmount.
> - */
> -static void drop_file_write_access(struct file *file)
> -{
> -       struct vfsmount *mnt = file->f_path.mnt;
> -       struct dentry *dentry = file->f_path.dentry;
> -       struct inode *inode = dentry->d_inode;
> -
> -       put_write_access(inode);
> -
> -       if (special_file(inode->i_mode))
> -               return;
> -       if (file_check_writeable(file) != 0)
> -               return;
> -       __mnt_drop_write(mnt);
> -       file_release_write(file);
> -}
> -
>  /* the real guts of fput() - releasing the last reference to file
>   */
>  static void __fput(struct file *file)
> @@ -253,8 +217,10 @@ static void __fput(struct file *file)
>         put_pid(file->f_owner.pid);
>         if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
>                 i_readcount_dec(inode);
> -       if (file->f_mode & FMODE_WRITE)
> -               drop_file_write_access(file);
> +       if (file->f_mode & FMODE_WRITER) {
> +               put_write_access(inode);
> +               __mnt_drop_write(mnt);
> +       }
>         file->f_path.dentry = NULL;
>         file->f_path.mnt = NULL;
>         file->f_inode = NULL;
> diff --git a/fs/open.c b/fs/open.c
> index b9ed8b2..ea765e4 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -632,35 +632,6 @@ out:
>         return error;
>  }
>
> -/*
> - * You have to be very careful that these write
> - * counts get cleaned up in error cases and
> - * upon __fput().  This should probably never
> - * be called outside of __dentry_open().
> - */
> -static inline int __get_file_write_access(struct inode *inode,
> -                                         struct vfsmount *mnt)
> -{
> -       int error;
> -       error = get_write_access(inode);
> -       if (error)
> -               return error;
> -       /*
> -        * Do not take mount writer counts on
> -        * special files since no writes to
> -        * the mount itself will occur.
> -        */
> -       if (!special_file(inode->i_mode)) {
> -               /*
> -                * Balanced in __fput()
> -                */
> -               error = __mnt_want_write(mnt);
> -               if (error)
> -                       put_write_access(inode);
> -       }
> -       return error;
> -}
> -
>  int open_check_o_direct(struct file *f)
>  {
>         /* NB: we're sure to have correct a_ops only after f_op->open */
> @@ -690,12 +661,15 @@ static int do_dentry_open(struct file *f,
>
>         path_get(&f->f_path);
>         inode = f->f_inode = f->f_path.dentry->d_inode;
> -       if (f->f_mode & FMODE_WRITE) {
> -               error = __get_file_write_access(inode, f->f_path.mnt);
> -               if (error)
> +       if ((f->f_mode & FMODE_WRITE) && !special_file(inode->i_mode)) {
> +               error = get_write_access(inode);
> +               if (unlikely(error))
> +                       goto cleanup_file;
> +               error = __mnt_want_write(f->f_path.mnt);
> +               if (unlikely(error)) {
> +                       put_write_access(inode);
>                         goto cleanup_file;
> -               if (!special_file(inode->i_mode))
> -                       file_take_write(f);

+                f->f_mode |= FMODE_WRITER;

Apart from that, the patch looks fine to me. In fact, if anyone wanted
the FMODE_WRITER behavior for alloc_file(), they can do that
themselves after it returns.
>From a first glance you could splitt of the removal of
DEBUG_WRITECOUNT into a separate patch. Makes reviewing easier.

Thanks
David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-13  5:55         ` NeilBrown
@ 2014-03-14  4:51           ` Al Viro
  0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2014-03-14  4:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, David Herrmann, Linux Kernel Mailing List,
	linux-fsdevel, Andrew Morton, David Howells, Oleg Nesterov

On Thu, Mar 13, 2014 at 04:55:41PM +1100, NeilBrown wrote:

> Can we do direct writes from kernel space yet?  If so I'll change the code to
> do that so that it will work with any filesystem (which supports direct
> writes).

You can - see __swap_writepage() (mm/page_io.c).  However, that area is
about to get a lot of massage, so it would make sense to wait a bit...

> (The documentation says we that bitmap files should only be used on ext2 or
> ext3.  Most people use bitmaps on the raw devices so hopefully the few who
> have a need for files will read the documentation :-)
> 
> (and yes, I check for FMODE_WRITE)

Umm...  Where?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: fix i_writecount on shmem and friends
  2014-03-13  0:37       ` Al Viro
  2014-03-13 11:03         ` David Herrmann
@ 2014-03-20 11:13         ` David Herrmann
  1 sibling, 0 replies; 11+ messages in thread
From: David Herrmann @ 2014-03-20 11:13 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Andrew Morton, David Howells, Oleg Nesterov, stable

[-- Attachment #1: Type: text/plain, Size: 2126 bytes --]

Hi

On Thu, Mar 13, 2014 at 1:37 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> We do, but how do you get anything even attempt deny_write_access() on
> those?  And what would the semantics of that be, anyway?

I just discovered /proc/self/map_files/ (only enabled with
CONFIG_CHECKPOINT_RESTORE). Attached is an example to trigger an
i_writecount underrun on an anon-mapping with your recommended fix
applied. You can also find it at [1].

This example is a bit more complex, but basically does this:
 - get a fresh, executable zero-mapping
 - write an executable into the mapping
 - get a read-only file-descriptor to the underlying shmem file via
/proc/self/map_files/
 - execute that mapping via /proc/self/map_files/
 - try to get a writable FD via /proc/self/map_files, this will fail
with ETXTBUSY _even though_ we still own a writable mapping

Ok, maybe this example is stupid, uses non-standard functionality
(CONFIG_CHECKPOINT_RESTORE) and is a very unlikely use-case, but it
shows that there _are_ ways to trigger this underrun. The hard part is
to get access to an executable file-descriptor that was created via
alloc_file() rather than open(). Once you got this, you can always
trigger the underrun by executing the file while you still have a
_single_ writable mapping which wasn't accounted for in i_writecount.

/proc/self/map_files/ is root only, so this is not a security problem.
I'm fine if you argue /proc/self/map_files/ is insecure and shouldn't
be used. Or maybe it must be non-executable. I'm just saying there
might always be new interfaces that give you a file-descriptor to
files allocated with alloc_file() rather than open(). And once you got
this FD, you can always execve() it via /proc and get
deny_write_access() this way. By initializing i_writecount on all
files, we make sure this never happens.

Anyhow, as I really want this fixed either way, please let me know how
to proceed. I can polish your patch and resend it if you don't intend
to apply it yourself.

Thanks
david

[1] https://gist.githubusercontent.com/dvdhrm/9661331/raw/b751cc7859267bb62af393ba3817b92c5225a372/gistfile1.txt

[-- Attachment #2: test.c --]
[-- Type: text/x-csrc, Size: 2116 bytes --]

#define _GNU_SOURCE
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	char *args[64];
	struct stat st;
	sigset_t set;
	char path[128];
	void *map;
	size_t size;
	int sig, fd_exe, fd_map, fd;
	pid_t pid;
	ssize_t res;

	/*
	 * If called with arguments, we just act as a sleeping process that
	 * waits for any signal to arrive.
	 */
	if (argc > 1) {
		printf("dummy waiter init\n");
		/* dummy waiter; SIGTERM terminates us anyway */
		sigemptyset(&set);
		sigaddset(&set, SIGTERM);
		sigwait(&set, &sig);
		printf("dummy waiter exit\n");
		exit(0);
	}

	fd_exe = open("/proc/self/exe", O_RDONLY);
	if (fd_exe < 0) {
		printf("open(/proc/self/exe) failed: %m\n");
		abort();
	}

	if (fstat(fd_exe, &st) < 0) {
		printf("fstat(fd_exe) failed: %m\n");
		abort();
	}

	/* page-align */
	size = (st.st_size + 4095ULL) & ~4095ULL;

	map = mmap(NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC,
		   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	if (map == MAP_FAILED) {
		printf("mmap(MAP_ANON) failed: %m\n");
		abort();
	}

	res = read(fd_exe, map, st.st_size);
	if (res != st.st_size) {
		if (res < 0)
			printf("read(fd_exe) failed: %m\n");
		else
			printf("read(fd_exe) was truncated to %zu\n",
			       (size_t)res);
		abort();
	}

	sprintf(path, "/proc/self/map_files/%lx-%lx",
		(unsigned long)map,
		(unsigned long)map + size);
	fd_map = open(path, O_RDONLY);
	if (fd_map < 0) {
		printf("open(%s) failed: %m\n", path);
		abort();
	}

	pid = fork();
	if (pid < 0) {
		printf("fork() failed: %m\n");
		abort();
	} else if (!pid) {
		args[0] = argv[0];
		args[1] = "dummy";
		args[2] = NULL;
		execve(path, args, NULL);
		printf("execve() failed: %m\n");
		abort();
	}

	/* sleep 100ms to make sure the execve() really worked */
	usleep(100 * 1000ULL);

	sprintf(path, "/proc/self/fd/%d", fd_map);
	fd = open(path, O_RDWR);
	if (fd < 0) {
		printf("open(%s) failed: %m\n", path);
		abort();
	}

	munmap(map, size);
	close(fd_exe);

	close(fd);
	close(fd_map);
	printf("exiting\n");

	return 0;
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-03-20 11:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-03 15:16 [PATCH] fs: fix i_writecount on shmem and friends David Herrmann
2014-03-11 19:05 ` Linus Torvalds
2014-03-12 18:19   ` Al Viro
2014-03-12 22:30     ` David Herrmann
2014-03-13  0:37       ` Al Viro
2014-03-13 11:03         ` David Herrmann
2014-03-20 11:13         ` David Herrmann
2014-03-13  4:08     ` NeilBrown
2014-03-13  4:29       ` Al Viro
2014-03-13  5:55         ` NeilBrown
2014-03-14  4:51           ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).