linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* i_mutex locking in generic_file_splice_write()
@ 2006-10-12 19:01 Mark Fasheh
  2006-10-12 19:54 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Fasheh @ 2006-10-12 19:01 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andrew Morton, Jens Axboe

Hi,
	generic_file_splice_write() will call into a file systems
->prepare_write() and ->commit_write() via the the pipe_to_file() actor.
pipe_to_file() is careful to take the pipe inode i_mutex, but nowhere in the
call path do I see i_mutex on the inode being written to taken.

Shouldn't we be taking this before calling into ->prepare_write() and
->commit_write(). What's preventing generic_file_splice_write() from racing
a truncate? Or maybe even another write?

A quick look through other callers reveals that generic_file_aio_write() and
do_lo_send_aops() both are careful to take i_mutex.

Thanks,
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-12 19:01 i_mutex locking in generic_file_splice_write() Mark Fasheh
@ 2006-10-12 19:54 ` Andrew Morton
  2006-10-13  0:17   ` Mark Fasheh
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2006-10-12 19:54 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel, Jens Axboe, Michael Halcrow

On Thu, 12 Oct 2006 12:01:52 -0700
Mark Fasheh <mark.fasheh@oracle.com> wrote:

> Hi,
> 	generic_file_splice_write() will call into a file systems
> ->prepare_write() and ->commit_write() via the the pipe_to_file() actor.
> pipe_to_file() is careful to take the pipe inode i_mutex, but nowhere in the
> call path do I see i_mutex on the inode being written to taken.

I'm not sure that ecryptfs has full i_mutex coverage either.

> Shouldn't we be taking this before calling into ->prepare_write() and
> ->commit_write(). What's preventing generic_file_splice_write() from racing
> a truncate? Or maybe even another write?

The lock_page() will block truncate and will block write()s to this particular
page.

> A quick look through other callers reveals that generic_file_aio_write() and
> do_lo_send_aops() both are careful to take i_mutex.

I'm trying to remember what i_mutex actually protects in this context. 
i_size, certainly - if we go changing the file size without locks then
other places might get surprised.  For example, a concurrent write() at a
larger file offset might try to increase i_size but if it loses the race
against the unlocked i_size-changing thread, the inode ends up with the
smaller i_size.

So yup, we need i_mutex if only for that reason.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-12 19:54 ` Andrew Morton
@ 2006-10-13  0:17   ` Mark Fasheh
  2006-10-13  7:45     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Fasheh @ 2006-10-13  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Jens Axboe, Michael Halcrow

On Thu, Oct 12, 2006 at 12:54:09PM -0700, Andrew Morton wrote:
> > Shouldn't we be taking this before calling into ->prepare_write() and
> > ->commit_write(). What's preventing generic_file_splice_write() from racing
> > a truncate? Or maybe even another write?
> 
> The lock_page() will block truncate and will block write()s to this particular
> page.
Ok.


> > A quick look through other callers reveals that generic_file_aio_write() and
> > do_lo_send_aops() both are careful to take i_mutex.
> 
> I'm trying to remember what i_mutex actually protects in this context. 
> i_size, certainly - if we go changing the file size without locks then
> other places might get surprised.  For example, a concurrent write() at a
> larger file offset might try to increase i_size but if it loses the race
> against the unlocked i_size-changing thread, the inode ends up with the
> smaller i_size.
I'm also worried about concurrent allocation tree changes. Perhaps I'm
mistaken and all file systems we care about can handle them happening
concurrently, but otherwise couldn't two processes writing to different
sparse regions in a file cause problems? One process via file write and the
other via a splice write.


> So yup, we need i_mutex if only for that reason.
Ok. Here's a first pass. The double lock is ugly, but as far as I can tell
we need it. Unless there's a rule about ordering between pipe inodes and
"other" inodes that I don't know about.

Compile tested only. I probably won't get a chance to actually run it until
late this weekend at the earliest :/
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


From: Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Take i_mutex in splice_from_pipe()

The splice_actor may be calling ->prepare_write() and ->commit_write(). We
want i_mutex on the inode being written to before calling those so that we
don't race i_size changes.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

---

 fs/splice.c |   21 +++++++++++++++++++--
 1 files changed, 19 insertions(+), 2 deletions(-)

96e880f22bd1c0e2809ebbfe5bf122ed67019e33
diff --git a/fs/splice.c b/fs/splice.c
index 13e92dd..e1ecb9e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 {
 	int ret, do_wakeup, err;
 	struct splice_desc sd;
+	struct inode *inode = out->f_mapping->host;
 
 	ret = 0;
 	do_wakeup = 0;
@@ -722,8 +723,23 @@ ssize_t splice_from_pipe(struct pipe_ino
 	sd.file = out;
 	sd.pos = *ppos;
 
-	if (pipe->inode)
-		mutex_lock(&pipe->inode->i_mutex);
+	/*
+	 * The actor worker might be calling ->prepare_write and
+	 * ->commit_write. Most of the time, these expect i_mutex to
+	 * be held. Since this may result in an ABBA deadlock with
+	 * pipe->inode, we have to order lock acquiry here.
+	 */
+	if (pipe->inode) {
+		if (pipe->inode < inode) {
+			mutex_lock_nested(&pipe->inode->i_mutex, I_MUTEX_PARENT);
+			mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
+		} else {
+			mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT);
+			mutex_lock_nested(&pipe->inode->i_mutex, I_MUTEX_CHILD);
+		}
+	} else {
+		mutex_lock(&inode->i_mutex);
+	}
 
 	for (;;) {
 		if (pipe->nrbufs) {
@@ -799,6 +815,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 
 	if (pipe->inode)
 		mutex_unlock(&pipe->inode->i_mutex);
+	mutex_unlock(&inode->i_mutex);
 
 	if (do_wakeup) {
 		smp_mb();
-- 
1.3.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-13  0:17   ` Mark Fasheh
@ 2006-10-13  7:45     ` Jens Axboe
  2006-10-13  8:11       ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2006-10-13  7:45 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow

On Thu, Oct 12 2006, Mark Fasheh wrote:
> On Thu, Oct 12, 2006 at 12:54:09PM -0700, Andrew Morton wrote:
> > > Shouldn't we be taking this before calling into ->prepare_write() and
> > > ->commit_write(). What's preventing generic_file_splice_write() from racing
> > > a truncate? Or maybe even another write?
> > 
> > The lock_page() will block truncate and will block write()s to this particular
> > page.
> Ok.
> 
> 
> > > A quick look through other callers reveals that generic_file_aio_write() and
> > > do_lo_send_aops() both are careful to take i_mutex.
> > 
> > I'm trying to remember what i_mutex actually protects in this context. 
> > i_size, certainly - if we go changing the file size without locks then
> > other places might get surprised.  For example, a concurrent write() at a
> > larger file offset might try to increase i_size but if it loses the race
> > against the unlocked i_size-changing thread, the inode ends up with the
> > smaller i_size.
> I'm also worried about concurrent allocation tree changes. Perhaps I'm
> mistaken and all file systems we care about can handle them happening
> concurrently, but otherwise couldn't two processes writing to different
> sparse regions in a file cause problems? One process via file write and the
> other via a splice write.
> 
> 
> > So yup, we need i_mutex if only for that reason.
> Ok. Here's a first pass. The double lock is ugly, but as far as I can tell
> we need it. Unless there's a rule about ordering between pipe inodes and
> "other" inodes that I don't know about.
> 
> Compile tested only. I probably won't get a chance to actually run it until
> late this weekend at the earliest :/

Patch looks ok to me. The double lock test only works, as long as splice
is the only one ever locking both mutexes. Or if others follow the same
ordering rules. I'm not very well versed in vfs matters, is that
guarenteed?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-13  7:45     ` Jens Axboe
@ 2006-10-13  8:11       ` Andrew Morton
  2006-10-13  8:18         ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2006-10-13  8:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mark Fasheh, linux-fsdevel, Michael Halcrow

On Fri, 13 Oct 2006 09:45:17 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> > Compile tested only. I probably won't get a chance to actually run it until
> > late this weekend at the earliest :/
> 
> Patch looks ok to me. The double lock test only works, as long as splice
> is the only one ever locking both mutexes. Or if others follow the same
> ordering rules. I'm not very well versed in vfs matters, is that
> guarenteed?

Taking the lowest-addressed lock first is the usual convention, if we really
have to do that.  I'm not aware of anywhere else where we pull this
trick with i_mutex though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-13  8:11       ` Andrew Morton
@ 2006-10-13  8:18         ` Jens Axboe
  2006-10-13 19:44           ` Mark Fasheh
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2006-10-13  8:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mark Fasheh, linux-fsdevel, Michael Halcrow

On Fri, Oct 13 2006, Andrew Morton wrote:
> On Fri, 13 Oct 2006 09:45:17 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > Compile tested only. I probably won't get a chance to actually run it until
> > > late this weekend at the earliest :/
> > 
> > Patch looks ok to me. The double lock test only works, as long as splice
> > is the only one ever locking both mutexes. Or if others follow the same
> > ordering rules. I'm not very well versed in vfs matters, is that
> > guarenteed?
> 
> Taking the lowest-addressed lock first is the usual convention, if we really
> have to do that.  I'm not aware of anywhere else where we pull this
> trick with i_mutex though.

It is, but people had concerns with that approach when it was originally
done for pipe tee'ing. Things like lock_rename() look scary.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-13  8:18         ` Jens Axboe
@ 2006-10-13 19:44           ` Mark Fasheh
  2006-10-15 18:05             ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Fasheh @ 2006-10-13 19:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow

On Fri, Oct 13, 2006 at 10:18:56AM +0200, Jens Axboe wrote:
> > Taking the lowest-addressed lock first is the usual convention, if we really
> > have to do that.  I'm not aware of anywhere else where we pull this
> > trick with i_mutex though.
> 
> It is, but people had concerns with that approach when it was originally
> done for pipe tee'ing. Things like lock_rename() look scary.
Yeah, I'm not too thrilled with the double locking either. Eventually I
think we'll need some _nolock variant of generic_file_splice_write() so that
OCFS2 can order cluster locks within i_mutex (much like you see with
ocfs2_file_aio_write(), etc).

At the very least, I'd rather not have this stuff open coded everywhere. If
we consolidate the behavior within a set of functions, we'll be able to
avoid questions later about locking order, etc.

I felt dirty just typing up this patch. Sigh. Let me know what you think.
	--Mark


From: Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Take i_mutex in splice_from_pipe()

The splice_actor may be calling ->prepare_write() and ->commit_write(). We
want i_mutex on the inode being written to before calling those so that we
don't race i_size changes.

The double locking behavior is done elsewhere in splice.c, and if we
eventually want _nolock variants of generic_file_splice_write(), fs modules
might have to replicate the nasty locking code. We introduce
inode_double_lock() and inode_double_unlock() to consolidate the locking
rules into one set of functions.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
---
 fs/splice.c        |   24 +++++++++++-------------
 include/linux/fs.h |   29 +++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 13e92dd..4d8a774 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 {
 	int ret, do_wakeup, err;
 	struct splice_desc sd;
+	struct inode *inode = out->f_mapping->host;
 
 	ret = 0;
 	do_wakeup = 0;
@@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino
 	sd.file = out;
 	sd.pos = *ppos;
 
-	if (pipe->inode)
-		mutex_lock(&pipe->inode->i_mutex);
+	/*
+	 * The actor worker might be calling ->prepare_write and
+	 * ->commit_write. Most of the time, these expect i_mutex to
+	 * be held. Since this may result in an ABBA deadlock with
+	 * pipe->inode, we have to order lock acquiry here.
+	 */
+	inode_double_lock(inode, pipe->inode);
 
 	for (;;) {
 		if (pipe->nrbufs) {
@@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 		pipe_wait(pipe);
 	}
 
-	if (pipe->inode)
-		mutex_unlock(&pipe->inode->i_mutex);
+	inode_double_unlock(inode, pipe->inode);
 
 	if (do_wakeup) {
 		smp_mb();
@@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i
 	 * grabbing by inode address. Otherwise two different processes
 	 * could deadlock (one doing tee from A -> B, the other from B -> A).
 	 */
-	if (ipipe->inode < opipe->inode) {
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD);
-	}
+	inode_double_lock(ipipe->inode, opipe->inode);
 
 	do {
 		if (!opipe->readers) {
@@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i
 		i++;
 	} while (len);
 
-	mutex_unlock(&ipipe->inode->i_mutex);
-	mutex_unlock(&opipe->inode->i_mutex);
+	inode_double_unlock(ipipe->inode, opipe->inode);
 
 	/*
 	 * If we put data in the output pipe, wakeup any potential readers.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34406ed..31d8548 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -624,6 +624,35 @@ enum inode_i_mutex_lock_class
 };
 
 /*
+ * We rarely want to lock two inodes that do not have a parent/child
+ * relationship (such as directory, child inode) simultaneously. The
+ * vast majority of file systems should be able to get along fine
+ * without this. Do not use these functions except as a last resort.
+ */
+static inline void inode_double_lock(struct inode *inode1, struct inode *inode2)
+{
+	if (!inode2) {
+		mutex_lock(&inode1->i_mutex);
+		return;
+	}
+
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+	}
+}
+
+static inline void inode_double_unlock(struct inode *inode1, struct inode *inode2)
+{
+	mutex_unlock(&inode1->i_mutex);
+	if (inode2)
+		mutex_unlock(&inode2->i_mutex);
+}
+
+/*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
  * with respect to the local cpu (unlike with preempt disabled),
-- 
1.4.2.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-13 19:44           ` Mark Fasheh
@ 2006-10-15 18:05             ` Jens Axboe
  2006-10-15 19:56               ` Mark Fasheh
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2006-10-15 18:05 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow

On Fri, Oct 13 2006, Mark Fasheh wrote:
>  /*
> + * We rarely want to lock two inodes that do not have a parent/child
> + * relationship (such as directory, child inode) simultaneously. The
> + * vast majority of file systems should be able to get along fine
> + * without this. Do not use these functions except as a last resort.
> + */
> +static inline void inode_double_lock(struct inode *inode1, struct inode *inode2)
> +{
> +	if (!inode2) {
> +		mutex_lock(&inode1->i_mutex);
> +		return;
> +	}
> +
> +	if (inode1 < inode2) {
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
> +	} else {
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
> +	}
> +}

This wont fly, it's completely tailored to splice in that one one inode
can be NULL. It also requires the 2nd inode to be that potentially NULL
pointer.

The function should work for any of the inodes to be NULL, if exported
as a helper.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-15 18:05             ` Jens Axboe
@ 2006-10-15 19:56               ` Mark Fasheh
  2006-10-15 20:08                 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Fasheh @ 2006-10-15 19:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow

On Sun, Oct 15, 2006 at 08:05:20PM +0200, Jens Axboe wrote:
> This wont fly, it's completely tailored to splice in that one one inode
> can be NULL. It also requires the 2nd inode to be that potentially NULL
> pointer.
*shrug* from my memory, 1st pointer always valid, 2nd possibly NULL seemed
to be a common paradigm, though I can't really point to any code, now that I
think of it.


> The function should work for any of the inodes to be NULL, if exported
> as a helper.
Here's a version that also handles both inodes being the same. We're quickly
approaching an elevated level of scariness in inode_double_lock :) Also,
things have been un-inlined.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


From: Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Take i_mutex in splice_from_pipe()

The splice_actor may be calling ->prepare_write() and ->commit_write(). We
want i_mutex on the inode being written to before calling those so that we
don't race i_size changes.

The double locking behavior is done elsewhere in splice.c, and if we
eventually want _nolock variants of generic_file_splice_write(), fs modules
might have to replicate the nasty locking code. We introduce
inode_double_lock() and inode_double_unlock() to consolidate the locking
rules into one set of functions.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
---
 fs/inode.c         |   31 +++++++++++++++++++++++++++++++
 fs/splice.c        |   24 +++++++++++-------------
 include/linux/fs.h |    3 +++
 3 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index bf6bec4..f42920d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1306,6 +1306,37 @@ void wake_up_inode(struct inode *inode)
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 
+/*
+ * We rarely want to lock two inodes that do not have a parent/child
+ * relationship (such as directory, child inode) simultaneously. The
+ * vast majority of file systems should be able to get along fine
+ * without this. Do not use these functions except as a last resort.
+ */
+void inode_double_lock(struct inode *inode1, struct inode *inode2)
+{
+	if (!inode2) {
+		mutex_lock(&inode1->i_mutex);
+		return;
+	}
+
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+	}
+}
+EXPORT_SYMBOL(inode_double_lock);
+
+void inode_double_unlock(struct inode *inode1, struct inode *inode2)
+{
+	mutex_unlock(&inode1->i_mutex);
+	if (inode2)
+		mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(inode_double_unlock);
+
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
diff --git a/fs/splice.c b/fs/splice.c
index 13e92dd..4d8a774 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 {
 	int ret, do_wakeup, err;
 	struct splice_desc sd;
+	struct inode *inode = out->f_mapping->host;
 
 	ret = 0;
 	do_wakeup = 0;
@@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino
 	sd.file = out;
 	sd.pos = *ppos;
 
-	if (pipe->inode)
-		mutex_lock(&pipe->inode->i_mutex);
+	/*
+	 * The actor worker might be calling ->prepare_write and
+	 * ->commit_write. Most of the time, these expect i_mutex to
+	 * be held. Since this may result in an ABBA deadlock with
+	 * pipe->inode, we have to order lock acquiry here.
+	 */
+	inode_double_lock(inode, pipe->inode);
 
 	for (;;) {
 		if (pipe->nrbufs) {
@@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 		pipe_wait(pipe);
 	}
 
-	if (pipe->inode)
-		mutex_unlock(&pipe->inode->i_mutex);
+	inode_double_unlock(inode, pipe->inode);
 
 	if (do_wakeup) {
 		smp_mb();
@@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i
 	 * grabbing by inode address. Otherwise two different processes
 	 * could deadlock (one doing tee from A -> B, the other from B -> A).
 	 */
-	if (ipipe->inode < opipe->inode) {
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD);
-	}
+	inode_double_lock(ipipe->inode, opipe->inode);
 
 	do {
 		if (!opipe->readers) {
@@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i
 		i++;
 	} while (len);
 
-	mutex_unlock(&ipipe->inode->i_mutex);
-	mutex_unlock(&opipe->inode->i_mutex);
+	inode_double_unlock(ipipe->inode, opipe->inode);
 
 	/*
 	 * If we put data in the output pipe, wakeup any potential readers.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34406ed..ddbb833 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -623,6 +623,9 @@ enum inode_i_mutex_lock_class
 	I_MUTEX_QUOTA
 };
 
+extern void inode_double_lock(struct inode *inode1, struct inode *inode2);
+extern void inode_double_unlock(struct inode *inode1, struct inode *inode2);
+
 /*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
-- 
1.4.2.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-15 19:56               ` Mark Fasheh
@ 2006-10-15 20:08                 ` Jens Axboe
  2006-10-15 20:14                   ` Mark Fasheh
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2006-10-15 20:08 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow

On Sun, Oct 15 2006, Mark Fasheh wrote:
> On Sun, Oct 15, 2006 at 08:05:20PM +0200, Jens Axboe wrote:
> > This wont fly, it's completely tailored to splice in that one one inode
> > can be NULL. It also requires the 2nd inode to be that potentially NULL
> > pointer.
> *shrug* from my memory, 1st pointer always valid, 2nd possibly NULL seemed
> to be a common paradigm, though I can't really point to any code, now that I
> think of it.
> 
> 
> > The function should work for any of the inodes to be NULL, if exported
> > as a helper.
> Here's a version that also handles both inodes being the same. We're quickly
> approaching an elevated level of scariness in inode_double_lock :) Also,
> things have been un-inlined.

Hmm, it doesn't seem any different :-). Bad diff attached?

> +void inode_double_lock(struct inode *inode1, struct inode *inode2)
> +{
> +	if (!inode2) {
> +		mutex_lock(&inode1->i_mutex);
> +		return;
> +	}
> +
> +	if (inode1 < inode2) {
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
> +	} else {
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
> +	}
> +}

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-15 20:08                 ` Jens Axboe
@ 2006-10-15 20:14                   ` Mark Fasheh
  2006-10-16 17:58                     ` Andreas Dilger
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Fasheh @ 2006-10-15 20:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow

On Sun, Oct 15, 2006 at 10:08:28PM +0200, Jens Axboe wrote:
> Hmm, it doesn't seem any different :-). Bad diff attached?
Ahh, silly me - I applied the wrong version of the patch :) Here's the
correct one.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


From: Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Take i_mutex in splice_from_pipe()

The splice_actor may be calling ->prepare_write() and ->commit_write(). We
want i_mutex on the inode being written to before calling those so that we
don't race i_size changes.

The double locking behavior is done elsewhere in splice.c, and if we
eventually want _nolock variants of generic_file_splice_write(), fs modules
might have to replicate the nasty locking code. We introduce
inode_double_lock() and inode_double_unlock() to consolidate the locking
rules into one set of functions.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
---
 fs/inode.c         |   38 ++++++++++++++++++++++++++++++++++++++
 fs/splice.c        |   24 +++++++++++-------------
 include/linux/fs.h |    3 +++
 3 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d9a21d1..f94b02b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1306,6 +1306,44 @@ void wake_up_inode(struct inode *inode)
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 
+/*
+ * We rarely want to lock two inodes that do not have a parent/child
+ * relationship (such as directory, child inode) simultaneously. The
+ * vast majority of file systems should be able to get along fine
+ * without this. Do not use these functions except as a last resort.
+ */
+void inode_double_lock(struct inode *inode1, struct inode *inode2)
+{
+	if (!inode1 || !inode2) {
+		if (inode1)
+			mutex_lock(&inode1->i_mutex);
+		else if (inode2)
+			mutex_lock(&inode2->i_mutex);
+		return;
+	}
+
+	if (inode1 == inode2) {
+		mutex_lock(&inode1->i_mutex);
+	} else if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+	}
+}
+EXPORT_SYMBOL(inode_double_lock);
+
+void inode_double_unlock(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1)
+		mutex_unlock(&inode1->i_mutex);
+
+	if (inode2 && inode2 != inode1)
+		mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(inode_double_unlock);
+
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
diff --git a/fs/splice.c b/fs/splice.c
index a567010..c1072b6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 {
 	int ret, do_wakeup, err;
 	struct splice_desc sd;
+	struct inode *inode = out->f_mapping->host;
 
 	ret = 0;
 	do_wakeup = 0;
@@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino
 	sd.file = out;
 	sd.pos = *ppos;
 
-	if (pipe->inode)
-		mutex_lock(&pipe->inode->i_mutex);
+	/*
+	 * The actor worker might be calling ->prepare_write and
+	 * ->commit_write. Most of the time, these expect i_mutex to
+	 * be held. Since this may result in an ABBA deadlock with
+	 * pipe->inode, we have to order lock acquiry here.
+	 */
+	inode_double_lock(inode, pipe->inode);
 
 	for (;;) {
 		if (pipe->nrbufs) {
@@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 		pipe_wait(pipe);
 	}
 
-	if (pipe->inode)
-		mutex_unlock(&pipe->inode->i_mutex);
+	inode_double_unlock(inode, pipe->inode);
 
 	if (do_wakeup) {
 		smp_mb();
@@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i
 	 * grabbing by inode address. Otherwise two different processes
 	 * could deadlock (one doing tee from A -> B, the other from B -> A).
 	 */
-	if (ipipe->inode < opipe->inode) {
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD);
-	}
+	inode_double_lock(ipipe->inode, opipe->inode);
 
 	do {
 		if (!opipe->readers) {
@@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i
 		i++;
 	} while (len);
 
-	mutex_unlock(&ipipe->inode->i_mutex);
-	mutex_unlock(&opipe->inode->i_mutex);
+	inode_double_unlock(ipipe->inode, opipe->inode);
 
 	/*
 	 * If we put data in the output pipe, wakeup any potential readers.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34406ed..ddbb833 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -623,6 +623,9 @@ enum inode_i_mutex_lock_class
 	I_MUTEX_QUOTA
 };
 
+extern void inode_double_lock(struct inode *inode1, struct inode *inode2);
+extern void inode_double_unlock(struct inode *inode1, struct inode *inode2);
+
 /*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
-- 
1.4.2.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-15 20:14                   ` Mark Fasheh
@ 2006-10-16 17:58                     ` Andreas Dilger
  2006-10-16 22:24                       ` Mark Fasheh
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2006-10-16 17:58 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Jens Axboe, Andrew Morton, linux-fsdevel, Michael Halcrow

On Oct 15, 2006  13:14 -0700, Mark Fasheh wrote:
> +void inode_double_lock(struct inode *inode1, struct inode *inode2)
> +{
> +	if (!inode1 || !inode2) {
> +		if (inode1)
> +			mutex_lock(&inode1->i_mutex);
> +		else if (inode2)
> +			mutex_lock(&inode2->i_mutex);
> +		return;
> +	}
> +
> +	if (inode1 == inode2) {
> +		mutex_lock(&inode1->i_mutex);

Could be a tiny bit cleaner if you put the "inode1 == inode2" case above:

void inode_double_lock(struct inode *inode1, struct inode *inode2)
{
	if (inode1 == NULL || inode2 == NULL || inode1 == inode2) {
		if (inode1)
			mutex_lock(&inode1->i_mutex);
		else if (inode2)
			mutex_lock(&inode2->i_mutex);
		return;
	}

	if (inode1 < inode2) {
		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
	} else {
		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
	}
}

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: i_mutex locking in generic_file_splice_write()
  2006-10-16 17:58                     ` Andreas Dilger
@ 2006-10-16 22:24                       ` Mark Fasheh
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Fasheh @ 2006-10-16 22:24 UTC (permalink / raw)
  To: Jens Axboe, Andrew Morton, linux-fsdevel, Michael Halcrow

On Mon, Oct 16, 2006 at 11:58:08AM -0600, Andreas Dilger wrote:
> Could be a tiny bit cleaner if you put the "inode1 == inode2" case above:
Good catch - thanks for pointing it out. Here's a version with your cleanup.
This version of the patch was actually tested too.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


From: Mark Fasheh <mark.fasheh@oracle.com>
[PATCH] Take i_mutex in splice_from_pipe()

The splice_actor may be calling ->prepare_write() and ->commit_write(). We
want i_mutex on the inode being written to before calling those so that we
don't race i_size changes.

The double locking behavior is done elsewhere in splice.c, and if we
eventually want _nolock variants of generic_file_splice_write(), fs modules
might have to replicate the nasty locking code. We introduce
inode_double_lock() and inode_double_unlock() to consolidate the locking
rules into one set of functions.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

---

 fs/inode.c         |   36 ++++++++++++++++++++++++++++++++++++
 fs/splice.c        |   24 +++++++++++-------------
 include/linux/fs.h |    3 +++
 3 files changed, 50 insertions(+), 13 deletions(-)

049fb2a0358c842a6c5256a4c0d488441bfd8e11
diff --git a/fs/inode.c b/fs/inode.c
index d9a21d1..26cdb11 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1306,6 +1306,42 @@ void wake_up_inode(struct inode *inode)
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 
+/*
+ * We rarely want to lock two inodes that do not have a parent/child
+ * relationship (such as directory, child inode) simultaneously. The
+ * vast majority of file systems should be able to get along fine
+ * without this. Do not use these functions except as a last resort.
+ */
+void inode_double_lock(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1 == NULL || inode2 == NULL || inode1 == inode2) {
+		if (inode1)
+			mutex_lock(&inode1->i_mutex);
+		else if (inode2)
+			mutex_lock(&inode2->i_mutex);
+		return;
+	}
+
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+	}
+}
+EXPORT_SYMBOL(inode_double_lock);
+
+void inode_double_unlock(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1)
+		mutex_unlock(&inode1->i_mutex);
+
+	if (inode2 && inode2 != inode1)
+		mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(inode_double_unlock);
+
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
diff --git a/fs/splice.c b/fs/splice.c
index a567010..c1072b6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 {
 	int ret, do_wakeup, err;
 	struct splice_desc sd;
+	struct inode *inode = out->f_mapping->host;
 
 	ret = 0;
 	do_wakeup = 0;
@@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino
 	sd.file = out;
 	sd.pos = *ppos;
 
-	if (pipe->inode)
-		mutex_lock(&pipe->inode->i_mutex);
+	/*
+	 * The actor worker might be calling ->prepare_write and
+	 * ->commit_write. Most of the time, these expect i_mutex to
+	 * be held. Since this may result in an ABBA deadlock with
+	 * pipe->inode, we have to order lock acquiry here.
+	 */
+	inode_double_lock(inode, pipe->inode);
 
 	for (;;) {
 		if (pipe->nrbufs) {
@@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino
 		pipe_wait(pipe);
 	}
 
-	if (pipe->inode)
-		mutex_unlock(&pipe->inode->i_mutex);
+	inode_double_unlock(inode, pipe->inode);
 
 	if (do_wakeup) {
 		smp_mb();
@@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i
 	 * grabbing by inode address. Otherwise two different processes
 	 * could deadlock (one doing tee from A -> B, the other from B -> A).
 	 */
-	if (ipipe->inode < opipe->inode) {
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD);
-	}
+	inode_double_lock(ipipe->inode, opipe->inode);
 
 	do {
 		if (!opipe->readers) {
@@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i
 		i++;
 	} while (len);
 
-	mutex_unlock(&ipipe->inode->i_mutex);
-	mutex_unlock(&opipe->inode->i_mutex);
+	inode_double_unlock(ipipe->inode, opipe->inode);
 
 	/*
 	 * If we put data in the output pipe, wakeup any potential readers.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34406ed..ddbb833 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -623,6 +623,9 @@ enum inode_i_mutex_lock_class
 	I_MUTEX_QUOTA
 };
 
+extern void inode_double_lock(struct inode *inode1, struct inode *inode2);
+extern void inode_double_unlock(struct inode *inode1, struct inode *inode2);
+
 /*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
-- 
1.3.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-10-16 22:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-12 19:01 i_mutex locking in generic_file_splice_write() Mark Fasheh
2006-10-12 19:54 ` Andrew Morton
2006-10-13  0:17   ` Mark Fasheh
2006-10-13  7:45     ` Jens Axboe
2006-10-13  8:11       ` Andrew Morton
2006-10-13  8:18         ` Jens Axboe
2006-10-13 19:44           ` Mark Fasheh
2006-10-15 18:05             ` Jens Axboe
2006-10-15 19:56               ` Mark Fasheh
2006-10-15 20:08                 ` Jens Axboe
2006-10-15 20:14                   ` Mark Fasheh
2006-10-16 17:58                     ` Andreas Dilger
2006-10-16 22:24                       ` Mark Fasheh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).