* i_mutex locking in generic_file_splice_write() @ 2006-10-12 19:01 Mark Fasheh 2006-10-12 19:54 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: Mark Fasheh @ 2006-10-12 19:01 UTC (permalink / raw) To: linux-fsdevel; +Cc: Andrew Morton, Jens Axboe Hi, generic_file_splice_write() will call into a file systems ->prepare_write() and ->commit_write() via the the pipe_to_file() actor. pipe_to_file() is careful to take the pipe inode i_mutex, but nowhere in the call path do I see i_mutex on the inode being written to taken. Shouldn't we be taking this before calling into ->prepare_write() and ->commit_write(). What's preventing generic_file_splice_write() from racing a truncate? Or maybe even another write? A quick look through other callers reveals that generic_file_aio_write() and do_lo_send_aops() both are careful to take i_mutex. Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-12 19:01 i_mutex locking in generic_file_splice_write() Mark Fasheh @ 2006-10-12 19:54 ` Andrew Morton 2006-10-13 0:17 ` Mark Fasheh 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2006-10-12 19:54 UTC (permalink / raw) To: Mark Fasheh; +Cc: linux-fsdevel, Jens Axboe, Michael Halcrow On Thu, 12 Oct 2006 12:01:52 -0700 Mark Fasheh <mark.fasheh@oracle.com> wrote: > Hi, > generic_file_splice_write() will call into a file systems > ->prepare_write() and ->commit_write() via the the pipe_to_file() actor. > pipe_to_file() is careful to take the pipe inode i_mutex, but nowhere in the > call path do I see i_mutex on the inode being written to taken. I'm not sure that ecryptfs has full i_mutex coverage either. > Shouldn't we be taking this before calling into ->prepare_write() and > ->commit_write(). What's preventing generic_file_splice_write() from racing > a truncate? Or maybe even another write? The lock_page() will block truncate and will block write()s to this particular page. > A quick look through other callers reveals that generic_file_aio_write() and > do_lo_send_aops() both are careful to take i_mutex. I'm trying to remember what i_mutex actually protects in this context. i_size, certainly - if we go changing the file size without locks then other places might get surprised. For example, a concurrent write() at a larger file offset might try to increase i_size but if it loses the race against the unlocked i_size-changing thread, the inode ends up with the smaller i_size. So yup, we need i_mutex if only for that reason. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-12 19:54 ` Andrew Morton @ 2006-10-13 0:17 ` Mark Fasheh 2006-10-13 7:45 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Mark Fasheh @ 2006-10-13 0:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, Jens Axboe, Michael Halcrow On Thu, Oct 12, 2006 at 12:54:09PM -0700, Andrew Morton wrote: > > Shouldn't we be taking this before calling into ->prepare_write() and > > ->commit_write(). What's preventing generic_file_splice_write() from racing > > a truncate? Or maybe even another write? > > The lock_page() will block truncate and will block write()s to this particular > page. Ok. > > A quick look through other callers reveals that generic_file_aio_write() and > > do_lo_send_aops() both are careful to take i_mutex. > > I'm trying to remember what i_mutex actually protects in this context. > i_size, certainly - if we go changing the file size without locks then > other places might get surprised. For example, a concurrent write() at a > larger file offset might try to increase i_size but if it loses the race > against the unlocked i_size-changing thread, the inode ends up with the > smaller i_size. I'm also worried about concurrent allocation tree changes. Perhaps I'm mistaken and all file systems we care about can handle them happening concurrently, but otherwise couldn't two processes writing to different sparse regions in a file cause problems? One process via file write and the other via a splice write. > So yup, we need i_mutex if only for that reason. Ok. Here's a first pass. The double lock is ugly, but as far as I can tell we need it. Unless there's a rule about ordering between pipe inodes and "other" inodes that I don't know about. Compile tested only. I probably won't get a chance to actually run it until late this weekend at the earliest :/ --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com From: Mark Fasheh <mark.fasheh@oracle.com> [PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> --- fs/splice.c | 21 +++++++++++++++++++-- 1 files changed, 19 insertions(+), 2 deletions(-) 96e880f22bd1c0e2809ebbfe5bf122ed67019e33 diff --git a/fs/splice.c b/fs/splice.c index 13e92dd..e1ecb9e 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino { int ret, do_wakeup, err; struct splice_desc sd; + struct inode *inode = out->f_mapping->host; ret = 0; do_wakeup = 0; @@ -722,8 +723,23 @@ ssize_t splice_from_pipe(struct pipe_ino sd.file = out; sd.pos = *ppos; - if (pipe->inode) - mutex_lock(&pipe->inode->i_mutex); + /* + * The actor worker might be calling ->prepare_write and + * ->commit_write. Most of the time, these expect i_mutex to + * be held. Since this may result in an ABBA deadlock with + * pipe->inode, we have to order lock acquiry here. + */ + if (pipe->inode) { + if (pipe->inode < inode) { + mutex_lock_nested(&pipe->inode->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); + } else { + mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&pipe->inode->i_mutex, I_MUTEX_CHILD); + } + } else { + mutex_lock(&inode->i_mutex); + } for (;;) { if (pipe->nrbufs) { @@ -799,6 +815,7 @@ ssize_t splice_from_pipe(struct pipe_ino if (pipe->inode) mutex_unlock(&pipe->inode->i_mutex); + mutex_unlock(&inode->i_mutex); if (do_wakeup) { smp_mb(); -- 1.3.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-13 0:17 ` Mark Fasheh @ 2006-10-13 7:45 ` Jens Axboe 2006-10-13 8:11 ` Andrew Morton 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2006-10-13 7:45 UTC (permalink / raw) To: Mark Fasheh; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow On Thu, Oct 12 2006, Mark Fasheh wrote: > On Thu, Oct 12, 2006 at 12:54:09PM -0700, Andrew Morton wrote: > > > Shouldn't we be taking this before calling into ->prepare_write() and > > > ->commit_write(). What's preventing generic_file_splice_write() from racing > > > a truncate? Or maybe even another write? > > > > The lock_page() will block truncate and will block write()s to this particular > > page. > Ok. > > > > > A quick look through other callers reveals that generic_file_aio_write() and > > > do_lo_send_aops() both are careful to take i_mutex. > > > > I'm trying to remember what i_mutex actually protects in this context. > > i_size, certainly - if we go changing the file size without locks then > > other places might get surprised. For example, a concurrent write() at a > > larger file offset might try to increase i_size but if it loses the race > > against the unlocked i_size-changing thread, the inode ends up with the > > smaller i_size. > I'm also worried about concurrent allocation tree changes. Perhaps I'm > mistaken and all file systems we care about can handle them happening > concurrently, but otherwise couldn't two processes writing to different > sparse regions in a file cause problems? One process via file write and the > other via a splice write. > > > > So yup, we need i_mutex if only for that reason. > Ok. Here's a first pass. The double lock is ugly, but as far as I can tell > we need it. Unless there's a rule about ordering between pipe inodes and > "other" inodes that I don't know about. > > Compile tested only. I probably won't get a chance to actually run it until > late this weekend at the earliest :/ Patch looks ok to me. The double lock test only works, as long as splice is the only one ever locking both mutexes. Or if others follow the same ordering rules. I'm not very well versed in vfs matters, is that guarenteed? -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-13 7:45 ` Jens Axboe @ 2006-10-13 8:11 ` Andrew Morton 2006-10-13 8:18 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2006-10-13 8:11 UTC (permalink / raw) To: Jens Axboe; +Cc: Mark Fasheh, linux-fsdevel, Michael Halcrow On Fri, 13 Oct 2006 09:45:17 +0200 Jens Axboe <jens.axboe@oracle.com> wrote: > > Compile tested only. I probably won't get a chance to actually run it until > > late this weekend at the earliest :/ > > Patch looks ok to me. The double lock test only works, as long as splice > is the only one ever locking both mutexes. Or if others follow the same > ordering rules. I'm not very well versed in vfs matters, is that > guarenteed? Taking the lowest-addressed lock first is the usual convention, if we really have to do that. I'm not aware of anywhere else where we pull this trick with i_mutex though. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-13 8:11 ` Andrew Morton @ 2006-10-13 8:18 ` Jens Axboe 2006-10-13 19:44 ` Mark Fasheh 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2006-10-13 8:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Mark Fasheh, linux-fsdevel, Michael Halcrow On Fri, Oct 13 2006, Andrew Morton wrote: > On Fri, 13 Oct 2006 09:45:17 +0200 > Jens Axboe <jens.axboe@oracle.com> wrote: > > > > Compile tested only. I probably won't get a chance to actually run it until > > > late this weekend at the earliest :/ > > > > Patch looks ok to me. The double lock test only works, as long as splice > > is the only one ever locking both mutexes. Or if others follow the same > > ordering rules. I'm not very well versed in vfs matters, is that > > guarenteed? > > Taking the lowest-addressed lock first is the usual convention, if we really > have to do that. I'm not aware of anywhere else where we pull this > trick with i_mutex though. It is, but people had concerns with that approach when it was originally done for pipe tee'ing. Things like lock_rename() look scary. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-13 8:18 ` Jens Axboe @ 2006-10-13 19:44 ` Mark Fasheh 2006-10-15 18:05 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Mark Fasheh @ 2006-10-13 19:44 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow On Fri, Oct 13, 2006 at 10:18:56AM +0200, Jens Axboe wrote: > > Taking the lowest-addressed lock first is the usual convention, if we really > > have to do that. I'm not aware of anywhere else where we pull this > > trick with i_mutex though. > > It is, but people had concerns with that approach when it was originally > done for pipe tee'ing. Things like lock_rename() look scary. Yeah, I'm not too thrilled with the double locking either. Eventually I think we'll need some _nolock variant of generic_file_splice_write() so that OCFS2 can order cluster locks within i_mutex (much like you see with ocfs2_file_aio_write(), etc). At the very least, I'd rather not have this stuff open coded everywhere. If we consolidate the behavior within a set of functions, we'll be able to avoid questions later about locking order, etc. I felt dirty just typing up this patch. Sigh. Let me know what you think. --Mark From: Mark Fasheh <mark.fasheh@oracle.com> [PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> --- fs/splice.c | 24 +++++++++++------------- include/linux/fs.h | 29 +++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 13 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 13e92dd..4d8a774 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino { int ret, do_wakeup, err; struct splice_desc sd; + struct inode *inode = out->f_mapping->host; ret = 0; do_wakeup = 0; @@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino sd.file = out; sd.pos = *ppos; - if (pipe->inode) - mutex_lock(&pipe->inode->i_mutex); + /* + * The actor worker might be calling ->prepare_write and + * ->commit_write. Most of the time, these expect i_mutex to + * be held. Since this may result in an ABBA deadlock with + * pipe->inode, we have to order lock acquiry here. + */ + inode_double_lock(inode, pipe->inode); for (;;) { if (pipe->nrbufs) { @@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino pipe_wait(pipe); } - if (pipe->inode) - mutex_unlock(&pipe->inode->i_mutex); + inode_double_unlock(inode, pipe->inode); if (do_wakeup) { smp_mb(); @@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i * grabbing by inode address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). */ - if (ipipe->inode < opipe->inode) { - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD); - } else { - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD); - } + inode_double_lock(ipipe->inode, opipe->inode); do { if (!opipe->readers) { @@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i i++; } while (len); - mutex_unlock(&ipipe->inode->i_mutex); - mutex_unlock(&opipe->inode->i_mutex); + inode_double_unlock(ipipe->inode, opipe->inode); /* * If we put data in the output pipe, wakeup any potential readers. diff --git a/include/linux/fs.h b/include/linux/fs.h index 34406ed..31d8548 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -624,6 +624,35 @@ enum inode_i_mutex_lock_class }; /* + * We rarely want to lock two inodes that do not have a parent/child + * relationship (such as directory, child inode) simultaneously. The + * vast majority of file systems should be able to get along fine + * without this. Do not use these functions except as a last resort. + */ +static inline void inode_double_lock(struct inode *inode1, struct inode *inode2) +{ + if (!inode2) { + mutex_lock(&inode1->i_mutex); + return; + } + + if (inode1 < inode2) { + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + } else { + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); + } +} + +static inline void inode_double_unlock(struct inode *inode1, struct inode *inode2) +{ + mutex_unlock(&inode1->i_mutex); + if (inode2) + mutex_unlock(&inode2->i_mutex); +} + +/* * NOTE: in a 32bit arch with a preemptable kernel and * an UP compile the i_size_read/write must be atomic * with respect to the local cpu (unlike with preempt disabled), -- 1.4.2.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-13 19:44 ` Mark Fasheh @ 2006-10-15 18:05 ` Jens Axboe 2006-10-15 19:56 ` Mark Fasheh 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2006-10-15 18:05 UTC (permalink / raw) To: Mark Fasheh; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow On Fri, Oct 13 2006, Mark Fasheh wrote: > /* > + * We rarely want to lock two inodes that do not have a parent/child > + * relationship (such as directory, child inode) simultaneously. The > + * vast majority of file systems should be able to get along fine > + * without this. Do not use these functions except as a last resort. > + */ > +static inline void inode_double_lock(struct inode *inode1, struct inode *inode2) > +{ > + if (!inode2) { > + mutex_lock(&inode1->i_mutex); > + return; > + } > + > + if (inode1 < inode2) { > + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); > + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); > + } else { > + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); > + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); > + } > +} This wont fly, it's completely tailored to splice in that one one inode can be NULL. It also requires the 2nd inode to be that potentially NULL pointer. The function should work for any of the inodes to be NULL, if exported as a helper. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-15 18:05 ` Jens Axboe @ 2006-10-15 19:56 ` Mark Fasheh 2006-10-15 20:08 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Mark Fasheh @ 2006-10-15 19:56 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow On Sun, Oct 15, 2006 at 08:05:20PM +0200, Jens Axboe wrote: > This wont fly, it's completely tailored to splice in that one one inode > can be NULL. It also requires the 2nd inode to be that potentially NULL > pointer. *shrug* from my memory, 1st pointer always valid, 2nd possibly NULL seemed to be a common paradigm, though I can't really point to any code, now that I think of it. > The function should work for any of the inodes to be NULL, if exported > as a helper. Here's a version that also handles both inodes being the same. We're quickly approaching an elevated level of scariness in inode_double_lock :) Also, things have been un-inlined. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com From: Mark Fasheh <mark.fasheh@oracle.com> [PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> --- fs/inode.c | 31 +++++++++++++++++++++++++++++++ fs/splice.c | 24 +++++++++++------------- include/linux/fs.h | 3 +++ 3 files changed, 45 insertions(+), 13 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index bf6bec4..f42920d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1306,6 +1306,37 @@ void wake_up_inode(struct inode *inode) wake_up_bit(&inode->i_state, __I_LOCK); } +/* + * We rarely want to lock two inodes that do not have a parent/child + * relationship (such as directory, child inode) simultaneously. The + * vast majority of file systems should be able to get along fine + * without this. Do not use these functions except as a last resort. + */ +void inode_double_lock(struct inode *inode1, struct inode *inode2) +{ + if (!inode2) { + mutex_lock(&inode1->i_mutex); + return; + } + + if (inode1 < inode2) { + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + } else { + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); + } +} +EXPORT_SYMBOL(inode_double_lock); + +void inode_double_unlock(struct inode *inode1, struct inode *inode2) +{ + mutex_unlock(&inode1->i_mutex); + if (inode2) + mutex_unlock(&inode2->i_mutex); +} +EXPORT_SYMBOL(inode_double_unlock); + static __initdata unsigned long ihash_entries; static int __init set_ihash_entries(char *str) { diff --git a/fs/splice.c b/fs/splice.c index 13e92dd..4d8a774 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino { int ret, do_wakeup, err; struct splice_desc sd; + struct inode *inode = out->f_mapping->host; ret = 0; do_wakeup = 0; @@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino sd.file = out; sd.pos = *ppos; - if (pipe->inode) - mutex_lock(&pipe->inode->i_mutex); + /* + * The actor worker might be calling ->prepare_write and + * ->commit_write. Most of the time, these expect i_mutex to + * be held. Since this may result in an ABBA deadlock with + * pipe->inode, we have to order lock acquiry here. + */ + inode_double_lock(inode, pipe->inode); for (;;) { if (pipe->nrbufs) { @@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino pipe_wait(pipe); } - if (pipe->inode) - mutex_unlock(&pipe->inode->i_mutex); + inode_double_unlock(inode, pipe->inode); if (do_wakeup) { smp_mb(); @@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i * grabbing by inode address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). */ - if (ipipe->inode < opipe->inode) { - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD); - } else { - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD); - } + inode_double_lock(ipipe->inode, opipe->inode); do { if (!opipe->readers) { @@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i i++; } while (len); - mutex_unlock(&ipipe->inode->i_mutex); - mutex_unlock(&opipe->inode->i_mutex); + inode_double_unlock(ipipe->inode, opipe->inode); /* * If we put data in the output pipe, wakeup any potential readers. diff --git a/include/linux/fs.h b/include/linux/fs.h index 34406ed..ddbb833 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -623,6 +623,9 @@ enum inode_i_mutex_lock_class I_MUTEX_QUOTA }; +extern void inode_double_lock(struct inode *inode1, struct inode *inode2); +extern void inode_double_unlock(struct inode *inode1, struct inode *inode2); + /* * NOTE: in a 32bit arch with a preemptable kernel and * an UP compile the i_size_read/write must be atomic -- 1.4.2.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-15 19:56 ` Mark Fasheh @ 2006-10-15 20:08 ` Jens Axboe 2006-10-15 20:14 ` Mark Fasheh 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2006-10-15 20:08 UTC (permalink / raw) To: Mark Fasheh; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow On Sun, Oct 15 2006, Mark Fasheh wrote: > On Sun, Oct 15, 2006 at 08:05:20PM +0200, Jens Axboe wrote: > > This wont fly, it's completely tailored to splice in that one one inode > > can be NULL. It also requires the 2nd inode to be that potentially NULL > > pointer. > *shrug* from my memory, 1st pointer always valid, 2nd possibly NULL seemed > to be a common paradigm, though I can't really point to any code, now that I > think of it. > > > > The function should work for any of the inodes to be NULL, if exported > > as a helper. > Here's a version that also handles both inodes being the same. We're quickly > approaching an elevated level of scariness in inode_double_lock :) Also, > things have been un-inlined. Hmm, it doesn't seem any different :-). Bad diff attached? > +void inode_double_lock(struct inode *inode1, struct inode *inode2) > +{ > + if (!inode2) { > + mutex_lock(&inode1->i_mutex); > + return; > + } > + > + if (inode1 < inode2) { > + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); > + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); > + } else { > + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); > + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); > + } > +} -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-15 20:08 ` Jens Axboe @ 2006-10-15 20:14 ` Mark Fasheh 2006-10-16 17:58 ` Andreas Dilger 0 siblings, 1 reply; 13+ messages in thread From: Mark Fasheh @ 2006-10-15 20:14 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-fsdevel, Michael Halcrow On Sun, Oct 15, 2006 at 10:08:28PM +0200, Jens Axboe wrote: > Hmm, it doesn't seem any different :-). Bad diff attached? Ahh, silly me - I applied the wrong version of the patch :) Here's the correct one. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com From: Mark Fasheh <mark.fasheh@oracle.com> [PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> --- fs/inode.c | 38 ++++++++++++++++++++++++++++++++++++++ fs/splice.c | 24 +++++++++++------------- include/linux/fs.h | 3 +++ 3 files changed, 52 insertions(+), 13 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index d9a21d1..f94b02b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1306,6 +1306,44 @@ void wake_up_inode(struct inode *inode) wake_up_bit(&inode->i_state, __I_LOCK); } +/* + * We rarely want to lock two inodes that do not have a parent/child + * relationship (such as directory, child inode) simultaneously. The + * vast majority of file systems should be able to get along fine + * without this. Do not use these functions except as a last resort. + */ +void inode_double_lock(struct inode *inode1, struct inode *inode2) +{ + if (!inode1 || !inode2) { + if (inode1) + mutex_lock(&inode1->i_mutex); + else if (inode2) + mutex_lock(&inode2->i_mutex); + return; + } + + if (inode1 == inode2) { + mutex_lock(&inode1->i_mutex); + } else if (inode1 < inode2) { + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + } else { + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); + } +} +EXPORT_SYMBOL(inode_double_lock); + +void inode_double_unlock(struct inode *inode1, struct inode *inode2) +{ + if (inode1) + mutex_unlock(&inode1->i_mutex); + + if (inode2 && inode2 != inode1) + mutex_unlock(&inode2->i_mutex); +} +EXPORT_SYMBOL(inode_double_unlock); + static __initdata unsigned long ihash_entries; static int __init set_ihash_entries(char *str) { diff --git a/fs/splice.c b/fs/splice.c index a567010..c1072b6 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino { int ret, do_wakeup, err; struct splice_desc sd; + struct inode *inode = out->f_mapping->host; ret = 0; do_wakeup = 0; @@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino sd.file = out; sd.pos = *ppos; - if (pipe->inode) - mutex_lock(&pipe->inode->i_mutex); + /* + * The actor worker might be calling ->prepare_write and + * ->commit_write. Most of the time, these expect i_mutex to + * be held. Since this may result in an ABBA deadlock with + * pipe->inode, we have to order lock acquiry here. + */ + inode_double_lock(inode, pipe->inode); for (;;) { if (pipe->nrbufs) { @@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino pipe_wait(pipe); } - if (pipe->inode) - mutex_unlock(&pipe->inode->i_mutex); + inode_double_unlock(inode, pipe->inode); if (do_wakeup) { smp_mb(); @@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i * grabbing by inode address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). */ - if (ipipe->inode < opipe->inode) { - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD); - } else { - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD); - } + inode_double_lock(ipipe->inode, opipe->inode); do { if (!opipe->readers) { @@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i i++; } while (len); - mutex_unlock(&ipipe->inode->i_mutex); - mutex_unlock(&opipe->inode->i_mutex); + inode_double_unlock(ipipe->inode, opipe->inode); /* * If we put data in the output pipe, wakeup any potential readers. diff --git a/include/linux/fs.h b/include/linux/fs.h index 34406ed..ddbb833 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -623,6 +623,9 @@ enum inode_i_mutex_lock_class I_MUTEX_QUOTA }; +extern void inode_double_lock(struct inode *inode1, struct inode *inode2); +extern void inode_double_unlock(struct inode *inode1, struct inode *inode2); + /* * NOTE: in a 32bit arch with a preemptable kernel and * an UP compile the i_size_read/write must be atomic -- 1.4.2.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-15 20:14 ` Mark Fasheh @ 2006-10-16 17:58 ` Andreas Dilger 2006-10-16 22:24 ` Mark Fasheh 0 siblings, 1 reply; 13+ messages in thread From: Andreas Dilger @ 2006-10-16 17:58 UTC (permalink / raw) To: Mark Fasheh; +Cc: Jens Axboe, Andrew Morton, linux-fsdevel, Michael Halcrow On Oct 15, 2006 13:14 -0700, Mark Fasheh wrote: > +void inode_double_lock(struct inode *inode1, struct inode *inode2) > +{ > + if (!inode1 || !inode2) { > + if (inode1) > + mutex_lock(&inode1->i_mutex); > + else if (inode2) > + mutex_lock(&inode2->i_mutex); > + return; > + } > + > + if (inode1 == inode2) { > + mutex_lock(&inode1->i_mutex); Could be a tiny bit cleaner if you put the "inode1 == inode2" case above: void inode_double_lock(struct inode *inode1, struct inode *inode2) { if (inode1 == NULL || inode2 == NULL || inode1 == inode2) { if (inode1) mutex_lock(&inode1->i_mutex); else if (inode2) mutex_lock(&inode2->i_mutex); return; } if (inode1 < inode2) { mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); } else { mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); } } Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: i_mutex locking in generic_file_splice_write() 2006-10-16 17:58 ` Andreas Dilger @ 2006-10-16 22:24 ` Mark Fasheh 0 siblings, 0 replies; 13+ messages in thread From: Mark Fasheh @ 2006-10-16 22:24 UTC (permalink / raw) To: Jens Axboe, Andrew Morton, linux-fsdevel, Michael Halcrow On Mon, Oct 16, 2006 at 11:58:08AM -0600, Andreas Dilger wrote: > Could be a tiny bit cleaner if you put the "inode1 == inode2" case above: Good catch - thanks for pointing it out. Here's a version with your cleanup. This version of the patch was actually tested too. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com From: Mark Fasheh <mark.fasheh@oracle.com> [PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> --- fs/inode.c | 36 ++++++++++++++++++++++++++++++++++++ fs/splice.c | 24 +++++++++++------------- include/linux/fs.h | 3 +++ 3 files changed, 50 insertions(+), 13 deletions(-) 049fb2a0358c842a6c5256a4c0d488441bfd8e11 diff --git a/fs/inode.c b/fs/inode.c index d9a21d1..26cdb11 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1306,6 +1306,42 @@ void wake_up_inode(struct inode *inode) wake_up_bit(&inode->i_state, __I_LOCK); } +/* + * We rarely want to lock two inodes that do not have a parent/child + * relationship (such as directory, child inode) simultaneously. The + * vast majority of file systems should be able to get along fine + * without this. Do not use these functions except as a last resort. + */ +void inode_double_lock(struct inode *inode1, struct inode *inode2) +{ + if (inode1 == NULL || inode2 == NULL || inode1 == inode2) { + if (inode1) + mutex_lock(&inode1->i_mutex); + else if (inode2) + mutex_lock(&inode2->i_mutex); + return; + } + + if (inode1 < inode2) { + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + } else { + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); + } +} +EXPORT_SYMBOL(inode_double_lock); + +void inode_double_unlock(struct inode *inode1, struct inode *inode2) +{ + if (inode1) + mutex_unlock(&inode1->i_mutex); + + if (inode2 && inode2 != inode1) + mutex_unlock(&inode2->i_mutex); +} +EXPORT_SYMBOL(inode_double_unlock); + static __initdata unsigned long ihash_entries; static int __init set_ihash_entries(char *str) { diff --git a/fs/splice.c b/fs/splice.c index a567010..c1072b6 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino { int ret, do_wakeup, err; struct splice_desc sd; + struct inode *inode = out->f_mapping->host; ret = 0; do_wakeup = 0; @@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino sd.file = out; sd.pos = *ppos; - if (pipe->inode) - mutex_lock(&pipe->inode->i_mutex); + /* + * The actor worker might be calling ->prepare_write and + * ->commit_write. Most of the time, these expect i_mutex to + * be held. Since this may result in an ABBA deadlock with + * pipe->inode, we have to order lock acquiry here. + */ + inode_double_lock(inode, pipe->inode); for (;;) { if (pipe->nrbufs) { @@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino pipe_wait(pipe); } - if (pipe->inode) - mutex_unlock(&pipe->inode->i_mutex); + inode_double_unlock(inode, pipe->inode); if (do_wakeup) { smp_mb(); @@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i * grabbing by inode address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). */ - if (ipipe->inode < opipe->inode) { - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD); - } else { - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD); - } + inode_double_lock(ipipe->inode, opipe->inode); do { if (!opipe->readers) { @@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i i++; } while (len); - mutex_unlock(&ipipe->inode->i_mutex); - mutex_unlock(&opipe->inode->i_mutex); + inode_double_unlock(ipipe->inode, opipe->inode); /* * If we put data in the output pipe, wakeup any potential readers. diff --git a/include/linux/fs.h b/include/linux/fs.h index 34406ed..ddbb833 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -623,6 +623,9 @@ enum inode_i_mutex_lock_class I_MUTEX_QUOTA }; +extern void inode_double_lock(struct inode *inode1, struct inode *inode2); +extern void inode_double_unlock(struct inode *inode1, struct inode *inode2); + /* * NOTE: in a 32bit arch with a preemptable kernel and * an UP compile the i_size_read/write must be atomic -- 1.3.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-10-16 22:24 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-10-12 19:01 i_mutex locking in generic_file_splice_write() Mark Fasheh 2006-10-12 19:54 ` Andrew Morton 2006-10-13 0:17 ` Mark Fasheh 2006-10-13 7:45 ` Jens Axboe 2006-10-13 8:11 ` Andrew Morton 2006-10-13 8:18 ` Jens Axboe 2006-10-13 19:44 ` Mark Fasheh 2006-10-15 18:05 ` Jens Axboe 2006-10-15 19:56 ` Mark Fasheh 2006-10-15 20:08 ` Jens Axboe 2006-10-15 20:14 ` Mark Fasheh 2006-10-16 17:58 ` Andreas Dilger 2006-10-16 22:24 ` Mark Fasheh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).