* [PATCH 0/3] xfs: lockdep and stack reduction fixes
@ 2014-02-19 4:16 Dave Chinner
2014-02-19 4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Dave Chinner @ 2014-02-19 4:16 UTC (permalink / raw)
To: xfs
Hi folks,
The following patches fix the lockdep regression caused by adding
the inode ilock to the readdir code and reduce the stack of XFS in a
couple of critical paths. The log force change is the critical one,
as it can happen a leaf functions in the code paths and hence
greatly increase the maximum stack depth of any path that needs to
unpin an object or lock a buffer.
Cheers,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-19 4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner @ 2014-02-19 4:16 ` Dave Chinner 2014-02-19 18:24 ` Brian Foster 2014-02-19 4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner 2014-02-19 4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner 2 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2014-02-19 4:16 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Log forces can occur deep in the call chain when we have relatively little stack free. Log forces can also happen at close to the call chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from places where we really don't want to add more stack overhead. This stack overhead occurs because log forces do foreground CIL pushes (xlog_cil_push_foreground()) rather than waking the background push wq and waiting for the for the push to complete. This foreground push was done to avoid confusing the CFQ Io scheduler when fsync()s were issued, as it has trouble dealing with dependent IOs being issued from different process contexts. Avoiding blowing the stack is much more critical than performance optimisations for CFQ, especially as we've been recommending against the use of CFQ for XFS since 3.2 kernels were release because of it's problems with multi-threaded IO workloads. Hence convert xlog_cil_push_foreground() to move the push work to the CIL workqueue. We already do the waiting for the push to complete in xlog_cil_force_lsn(), so there's nothing else we need to modify to make this work. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 39 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b57a8e0..7e54553 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -499,13 +499,6 @@ xlog_cil_push( cil->xc_ctx = new_ctx; /* - * mirror the new sequence into the cil structure so that we can do - * unlocked checks against the current sequence in log forces without - * risking deferencing a freed context pointer. - */ - cil->xc_current_sequence = new_ctx->sequence; - - /* * The switch is now done, so we can drop the context lock and move out * of a shared context. We can't just go straight to the commit record, * though - we need to synchronise with previous and future commits so @@ -523,8 +516,15 @@ xlog_cil_push( * Hence we need to add this context to the committing context list so * that higher sequences will wait for us to write out a commit record * before they do. + * + * xfs_log_force_lsn requires us to mirror the new sequence into the cil + * structure atomically with the addition of this sequence to the + * committing list. This also ensures that we can do unlocked checks + * against the current sequence in log forces without risking + * deferencing a freed context pointer. */ spin_lock(&cil->xc_push_lock); + cil->xc_current_sequence = new_ctx->sequence; list_add(&ctx->committing, &cil->xc_committing); spin_unlock(&cil->xc_push_lock); up_write(&cil->xc_ctx_lock); @@ -662,8 +662,14 @@ xlog_cil_push_background( } +/* + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence + * number that is passed. When it returns, the work will be queued for + * @push_seq, but it won't be completed. The caller is expected to do any + * waiting for push_seq to complete if it is required. + */ static void -xlog_cil_push_foreground( +xlog_cil_push_now( struct xlog *log, xfs_lsn_t push_seq) { @@ -688,10 +694,8 @@ xlog_cil_push_foreground( } cil->xc_push_seq = push_seq; + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); spin_unlock(&cil->xc_push_lock); - - /* do the push now */ - xlog_cil_push(log); } bool @@ -795,7 +799,8 @@ xlog_cil_force_lsn( * xlog_cil_push() handles racing pushes for the same sequence, * so no need to deal with it here. */ - xlog_cil_push_foreground(log, sequence); +restart: + xlog_cil_push_now(log, sequence); /* * See if we can find a previous sequence still committing. @@ -803,7 +808,6 @@ xlog_cil_force_lsn( * before allowing the force of push_seq to go ahead. Hence block * on commits for those as well. */ -restart: spin_lock(&cil->xc_push_lock); list_for_each_entry(ctx, &cil->xc_committing, committing) { if (ctx->sequence > sequence) @@ -821,6 +825,28 @@ restart: /* found it! */ commit_lsn = ctx->commit_lsn; } + + /* + * The call to xlog_cil_push_now() executes the push in the background. + * Hence by the time we have got here it our sequence may not have been + * pushed yet. This is true if the current sequence still matches the + * push sequence after the above wait loop and the CIL still contains + * dirty objects. + * + * When the push occurs, it will empty the CIL and + * atomically increment the currect sequence past the push sequence and + * move it into the committing list. Of course, if the CIL is clean at + * the time of the push, it won't have pushed the CIL at all, so in that + * case we should try the push for this sequence again from the start + * just in case. + */ + + if (sequence == cil->xc_current_sequence && + !list_empty(&cil->xc_cil)) { + spin_unlock(&cil->xc_push_lock); + goto restart; + } + spin_unlock(&cil->xc_push_lock); return commit_lsn; } -- 1.8.4.rc3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-19 4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner @ 2014-02-19 18:24 ` Brian Foster 2014-02-20 0:23 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Brian Foster @ 2014-02-19 18:24 UTC (permalink / raw) To: Dave Chinner, xfs On 02/18/2014 11:16 PM, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Log forces can occur deep in the call chain when we have relatively > little stack free. Log forces can also happen at close to the call > chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from > places where we really don't want to add more stack overhead. > > This stack overhead occurs because log forces do foreground CIL > pushes (xlog_cil_push_foreground()) rather than waking the > background push wq and waiting for the for the push to complete. > This foreground push was done to avoid confusing the CFQ Io > scheduler when fsync()s were issued, as it has trouble dealing with > dependent IOs being issued from different process contexts. > > Avoiding blowing the stack is much more critical than performance > optimisations for CFQ, especially as we've been recommending against > the use of CFQ for XFS since 3.2 kernels were release because of > it's problems with multi-threaded IO workloads. > > Hence convert xlog_cil_push_foreground() to move the push work > to the CIL workqueue. We already do the waiting for the push to > complete in xlog_cil_force_lsn(), so there's nothing else we need to > modify to make this work. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 39 insertions(+), 13 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index b57a8e0..7e54553 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -499,13 +499,6 @@ xlog_cil_push( > cil->xc_ctx = new_ctx; > > /* > - * mirror the new sequence into the cil structure so that we can do > - * unlocked checks against the current sequence in log forces without > - * risking deferencing a freed context pointer. > - */ > - cil->xc_current_sequence = new_ctx->sequence; > - > - /* > * The switch is now done, so we can drop the context lock and move out > * of a shared context. We can't just go straight to the commit record, > * though - we need to synchronise with previous and future commits so > @@ -523,8 +516,15 @@ xlog_cil_push( > * Hence we need to add this context to the committing context list so > * that higher sequences will wait for us to write out a commit record > * before they do. > + * > + * xfs_log_force_lsn requires us to mirror the new sequence into the cil > + * structure atomically with the addition of this sequence to the > + * committing list. This also ensures that we can do unlocked checks > + * against the current sequence in log forces without risking > + * deferencing a freed context pointer. > */ > spin_lock(&cil->xc_push_lock); > + cil->xc_current_sequence = new_ctx->sequence; > list_add(&ctx->committing, &cil->xc_committing); > spin_unlock(&cil->xc_push_lock); > up_write(&cil->xc_ctx_lock); > @@ -662,8 +662,14 @@ xlog_cil_push_background( > > } > > +/* > + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence > + * number that is passed. When it returns, the work will be queued for > + * @push_seq, but it won't be completed. The caller is expected to do any > + * waiting for push_seq to complete if it is required. > + */ > static void > -xlog_cil_push_foreground( > +xlog_cil_push_now( > struct xlog *log, > xfs_lsn_t push_seq) > { > @@ -688,10 +694,8 @@ xlog_cil_push_foreground( > } > > cil->xc_push_seq = push_seq; > + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > spin_unlock(&cil->xc_push_lock); > - > - /* do the push now */ > - xlog_cil_push(log); > } > > bool > @@ -795,7 +799,8 @@ xlog_cil_force_lsn( > * xlog_cil_push() handles racing pushes for the same sequence, > * so no need to deal with it here. > */ > - xlog_cil_push_foreground(log, sequence); > +restart: > + xlog_cil_push_now(log, sequence); > > /* > * See if we can find a previous sequence still committing. > @@ -803,7 +808,6 @@ xlog_cil_force_lsn( > * before allowing the force of push_seq to go ahead. Hence block > * on commits for those as well. > */ > -restart: > spin_lock(&cil->xc_push_lock); > list_for_each_entry(ctx, &cil->xc_committing, committing) { > if (ctx->sequence > sequence) > @@ -821,6 +825,28 @@ restart: > /* found it! */ > commit_lsn = ctx->commit_lsn; > } > + > + /* > + * The call to xlog_cil_push_now() executes the push in the background. > + * Hence by the time we have got here it our sequence may not have been > + * pushed yet. This is true if the current sequence still matches the > + * push sequence after the above wait loop and the CIL still contains > + * dirty objects. > + * > + * When the push occurs, it will empty the CIL and > + * atomically increment the currect sequence past the push sequence and > + * move it into the committing list. Of course, if the CIL is clean at > + * the time of the push, it won't have pushed the CIL at all, so in that > + * case we should try the push for this sequence again from the start > + * just in case. > + */ > + > + if (sequence == cil->xc_current_sequence && > + !list_empty(&cil->xc_cil)) { > + spin_unlock(&cil->xc_push_lock); > + goto restart; > + } > + IIUC, the objective here is to make sure we don't leave this code path before the push even starts and the ctx makes it onto the committing list, due to xlog_cil_push_now() moving things to a workqueue. Given that, what's the purpose of re-executing the background push as opposed to restarting the wait sequence (as done previously)? It looks like push_now() won't queue the work again due to cil->xc_push_seq, but it will flush the queue and I suppose make it more likely the push starts. Is that the intent? Brian > spin_unlock(&cil->xc_push_lock); > return commit_lsn; > } > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-19 18:24 ` Brian Foster @ 2014-02-20 0:23 ` Dave Chinner 2014-02-20 14:51 ` Mark Tinguely 2014-02-21 15:04 ` Brian Foster 0 siblings, 2 replies; 19+ messages in thread From: Dave Chinner @ 2014-02-20 0:23 UTC (permalink / raw) To: Brian Foster; +Cc: xfs On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: > On 02/18/2014 11:16 PM, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Log forces can occur deep in the call chain when we have relatively > > little stack free. Log forces can also happen at close to the call > > chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from > > places where we really don't want to add more stack overhead. > > > > This stack overhead occurs because log forces do foreground CIL > > pushes (xlog_cil_push_foreground()) rather than waking the > > background push wq and waiting for the for the push to complete. > > This foreground push was done to avoid confusing the CFQ Io > > scheduler when fsync()s were issued, as it has trouble dealing with > > dependent IOs being issued from different process contexts. > > > > Avoiding blowing the stack is much more critical than performance > > optimisations for CFQ, especially as we've been recommending against > > the use of CFQ for XFS since 3.2 kernels were release because of > > it's problems with multi-threaded IO workloads. > > > > Hence convert xlog_cil_push_foreground() to move the push work > > to the CIL workqueue. We already do the waiting for the push to > > complete in xlog_cil_force_lsn(), so there's nothing else we need to > > modify to make this work. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++------------- > > 1 file changed, 39 insertions(+), 13 deletions(-) > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > index b57a8e0..7e54553 100644 > > --- a/fs/xfs/xfs_log_cil.c > > +++ b/fs/xfs/xfs_log_cil.c > > @@ -499,13 +499,6 @@ xlog_cil_push( > > cil->xc_ctx = new_ctx; > > > > /* > > - * mirror the new sequence into the cil structure so that we can do > > - * unlocked checks against the current sequence in log forces without > > - * risking deferencing a freed context pointer. > > - */ > > - cil->xc_current_sequence = new_ctx->sequence; > > - > > - /* > > * The switch is now done, so we can drop the context lock and move out > > * of a shared context. We can't just go straight to the commit record, > > * though - we need to synchronise with previous and future commits so > > @@ -523,8 +516,15 @@ xlog_cil_push( > > * Hence we need to add this context to the committing context list so > > * that higher sequences will wait for us to write out a commit record > > * before they do. > > + * > > + * xfs_log_force_lsn requires us to mirror the new sequence into the cil > > + * structure atomically with the addition of this sequence to the > > + * committing list. This also ensures that we can do unlocked checks > > + * against the current sequence in log forces without risking > > + * deferencing a freed context pointer. > > */ > > spin_lock(&cil->xc_push_lock); > > + cil->xc_current_sequence = new_ctx->sequence; > > list_add(&ctx->committing, &cil->xc_committing); > > spin_unlock(&cil->xc_push_lock); > > up_write(&cil->xc_ctx_lock); > > @@ -662,8 +662,14 @@ xlog_cil_push_background( > > > > } > > > > +/* > > + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence > > + * number that is passed. When it returns, the work will be queued for > > + * @push_seq, but it won't be completed. The caller is expected to do any > > + * waiting for push_seq to complete if it is required. > > + */ > > static void > > -xlog_cil_push_foreground( > > +xlog_cil_push_now( > > struct xlog *log, > > xfs_lsn_t push_seq) > > { > > @@ -688,10 +694,8 @@ xlog_cil_push_foreground( > > } > > > > cil->xc_push_seq = push_seq; > > + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > > spin_unlock(&cil->xc_push_lock); > > - > > - /* do the push now */ > > - xlog_cil_push(log); > > } > > > > bool > > @@ -795,7 +799,8 @@ xlog_cil_force_lsn( > > * xlog_cil_push() handles racing pushes for the same sequence, > > * so no need to deal with it here. > > */ > > - xlog_cil_push_foreground(log, sequence); > > +restart: > > + xlog_cil_push_now(log, sequence); > > > > /* > > * See if we can find a previous sequence still committing. > > @@ -803,7 +808,6 @@ xlog_cil_force_lsn( > > * before allowing the force of push_seq to go ahead. Hence block > > * on commits for those as well. > > */ > > -restart: > > spin_lock(&cil->xc_push_lock); > > list_for_each_entry(ctx, &cil->xc_committing, committing) { > > if (ctx->sequence > sequence) > > @@ -821,6 +825,28 @@ restart: > > /* found it! */ > > commit_lsn = ctx->commit_lsn; > > } > > + > > + /* > > + * The call to xlog_cil_push_now() executes the push in the background. > > + * Hence by the time we have got here it our sequence may not have been > > + * pushed yet. This is true if the current sequence still matches the > > + * push sequence after the above wait loop and the CIL still contains > > + * dirty objects. > > + * > > + * When the push occurs, it will empty the CIL and > > + * atomically increment the currect sequence past the push sequence and > > + * move it into the committing list. Of course, if the CIL is clean at > > + * the time of the push, it won't have pushed the CIL at all, so in that > > + * case we should try the push for this sequence again from the start > > + * just in case. > > + */ > > + > > + if (sequence == cil->xc_current_sequence && > > + !list_empty(&cil->xc_cil)) { > > + spin_unlock(&cil->xc_push_lock); > > + goto restart; > > + } > > + > > IIUC, the objective here is to make sure we don't leave this code path > before the push even starts and the ctx makes it onto the committing > list, due to xlog_cil_push_now() moving things to a workqueue. Right. > Given that, what's the purpose of re-executing the background push as > opposed to restarting the wait sequence (as done previously)? It looks > like push_now() won't queue the work again due to cil->xc_push_seq, but > it will flush the queue and I suppose make it more likely the push > starts. Is that the intent? Effectively. But the other thing that it is protecting against is that foreground push is done without holding the cil->xc_ctx_lock, and so we can get the situation where we try a foreground push of the current sequence, see that the CIL is empty and return without pushing, wait for previous sequences to commit, then find that the CIL has items on the CIL in the sequence we are supposed to be committing. In this case, we don't know if this occurred because the workqueue has not started working on our push, or whether we raced on an empty CIL, and hence we need to make sure that everything in the sequence we are support to commit is pushed to the log. Hence if the current sequence is dirty after we've ensure that all prior sequences are fully checkpointed, need to go back and push the CIL again to ensure that when we return to the caller the CIL is checkpointed up to the point in time of the log force occurring. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-20 0:23 ` Dave Chinner @ 2014-02-20 14:51 ` Mark Tinguely 2014-02-20 22:07 ` Dave Chinner 2014-02-21 15:04 ` Brian Foster 1 sibling, 1 reply; 19+ messages in thread From: Mark Tinguely @ 2014-02-20 14:51 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, xfs On 02/19/14 18:23, Dave Chinner wrote: > On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: >> On 02/18/2014 11:16 PM, Dave Chinner wrote: >>> From: Dave Chinner<dchinner@redhat.com> >>> >>> Log forces can occur deep in the call chain when we have relatively >>> little stack free. Log forces can also happen at close to the call >>> chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from >>> places where we really don't want to add more stack overhead. >>> >>> This stack overhead occurs because log forces do foreground CIL >>> pushes (xlog_cil_push_foreground()) rather than waking the >>> background push wq and waiting for the for the push to complete. >>> This foreground push was done to avoid confusing the CFQ Io >>> scheduler when fsync()s were issued, as it has trouble dealing with >>> dependent IOs being issued from different process contexts. >>> >>> Avoiding blowing the stack is much more critical than performance >>> optimisations for CFQ, especially as we've been recommending against >>> the use of CFQ for XFS since 3.2 kernels were release because of >>> it's problems with multi-threaded IO workloads. >>> >>> Hence convert xlog_cil_push_foreground() to move the push work >>> to the CIL workqueue. We already do the waiting for the push to >>> complete in xlog_cil_force_lsn(), so there's nothing else we need to >>> modify to make this work. >>> >>> Signed-off-by: Dave Chinner<dchinner@redhat.com> >>> --- >>> fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++------------- >>> 1 file changed, 39 insertions(+), 13 deletions(-) >>> >>> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c >>> index b57a8e0..7e54553 100644 >>> --- a/fs/xfs/xfs_log_cil.c >>> +++ b/fs/xfs/xfs_log_cil.c >>> @@ -499,13 +499,6 @@ xlog_cil_push( >>> cil->xc_ctx = new_ctx; >>> >>> /* >>> - * mirror the new sequence into the cil structure so that we can do >>> - * unlocked checks against the current sequence in log forces without >>> - * risking deferencing a freed context pointer. >>> - */ >>> - cil->xc_current_sequence = new_ctx->sequence; >>> - >>> - /* >>> * The switch is now done, so we can drop the context lock and move out >>> * of a shared context. We can't just go straight to the commit record, >>> * though - we need to synchronise with previous and future commits so >>> @@ -523,8 +516,15 @@ xlog_cil_push( >>> * Hence we need to add this context to the committing context list so >>> * that higher sequences will wait for us to write out a commit record >>> * before they do. >>> + * >>> + * xfs_log_force_lsn requires us to mirror the new sequence into the cil >>> + * structure atomically with the addition of this sequence to the >>> + * committing list. This also ensures that we can do unlocked checks >>> + * against the current sequence in log forces without risking >>> + * deferencing a freed context pointer. >>> */ >>> spin_lock(&cil->xc_push_lock); >>> + cil->xc_current_sequence = new_ctx->sequence; >>> list_add(&ctx->committing,&cil->xc_committing); >>> spin_unlock(&cil->xc_push_lock); >>> up_write(&cil->xc_ctx_lock); >>> @@ -662,8 +662,14 @@ xlog_cil_push_background( >>> >>> } >>> >>> +/* >>> + * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence >>> + * number that is passed. When it returns, the work will be queued for >>> + * @push_seq, but it won't be completed. The caller is expected to do any >>> + * waiting for push_seq to complete if it is required. >>> + */ >>> static void >>> -xlog_cil_push_foreground( >>> +xlog_cil_push_now( >>> struct xlog *log, >>> xfs_lsn_t push_seq) >>> { >>> @@ -688,10 +694,8 @@ xlog_cil_push_foreground( >>> } >>> >>> cil->xc_push_seq = push_seq; >>> + queue_work(log->l_mp->m_cil_workqueue,&cil->xc_push_work); >>> spin_unlock(&cil->xc_push_lock); >>> - >>> - /* do the push now */ >>> - xlog_cil_push(log); >>> } >>> >>> bool >>> @@ -795,7 +799,8 @@ xlog_cil_force_lsn( >>> * xlog_cil_push() handles racing pushes for the same sequence, >>> * so no need to deal with it here. >>> */ >>> - xlog_cil_push_foreground(log, sequence); >>> +restart: >>> + xlog_cil_push_now(log, sequence); >>> >>> /* >>> * See if we can find a previous sequence still committing. >>> @@ -803,7 +808,6 @@ xlog_cil_force_lsn( >>> * before allowing the force of push_seq to go ahead. Hence block >>> * on commits for those as well. >>> */ >>> -restart: >>> spin_lock(&cil->xc_push_lock); >>> list_for_each_entry(ctx,&cil->xc_committing, committing) { >>> if (ctx->sequence> sequence) >>> @@ -821,6 +825,28 @@ restart: >>> /* found it! */ >>> commit_lsn = ctx->commit_lsn; >>> } >>> + >>> + /* >>> + * The call to xlog_cil_push_now() executes the push in the background. >>> + * Hence by the time we have got here it our sequence may not have been >>> + * pushed yet. This is true if the current sequence still matches the >>> + * push sequence after the above wait loop and the CIL still contains >>> + * dirty objects. >>> + * >>> + * When the push occurs, it will empty the CIL and >>> + * atomically increment the currect sequence past the push sequence and >>> + * move it into the committing list. Of course, if the CIL is clean at >>> + * the time of the push, it won't have pushed the CIL at all, so in that >>> + * case we should try the push for this sequence again from the start >>> + * just in case. >>> + */ >>> + >>> + if (sequence == cil->xc_current_sequence&& >>> + !list_empty(&cil->xc_cil)) { >>> + spin_unlock(&cil->xc_push_lock); >>> + goto restart; >>> + } >>> + >> >> IIUC, the objective here is to make sure we don't leave this code path >> before the push even starts and the ctx makes it onto the committing >> list, due to xlog_cil_push_now() moving things to a workqueue. > > Right. > >> Given that, what's the purpose of re-executing the background push as >> opposed to restarting the wait sequence (as done previously)? It looks >> like push_now() won't queue the work again due to cil->xc_push_seq, but >> it will flush the queue and I suppose make it more likely the push >> starts. Is that the intent? > > Effectively. But the other thing that it is protecting against is > that foreground push is done without holding the cil->xc_ctx_lock, > and so we can get the situation where we try a foreground push > of the current sequence, see that the CIL is empty and return > without pushing, wait for previous sequences to commit, then find > that the CIL has items on the CIL in the sequence we are supposed to > be committing. > > In this case, we don't know if this occurred because the workqueue > has not started working on our push, or whether we raced on an empty > CIL, and hence we need to make sure that everything in the sequence > we are support to commit is pushed to the log. > > Hence if the current sequence is dirty after we've ensure that all > prior sequences are fully checkpointed, need to go back and > push the CIL again to ensure that when we return to the caller the > CIL is checkpointed up to the point in time of the log force > occurring. The desired push sequence was taken from an item on the CIL (either when added or from a pinned item). How could the CIL now be empty other than someone else pushed to at least the desire sequence? A flush_work() should be enough in the case where the ctx of the desire sequence is not on the xc_committing list. The flush_work will wait for the worker to start and place the ctx of the desired sequence into the xc_committing list. This preventing a tight loop waiting for the cil push worker to start. Starting the cil push worker for every wakeup of smaller sequence in the list_for_each_entry loop seems wasteful. We know the later error paths in xfs_cil_push() will not do a wake, now is a good time to fix that. --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-20 14:51 ` Mark Tinguely @ 2014-02-20 22:07 ` Dave Chinner 2014-02-20 22:35 ` Mark Tinguely 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2014-02-20 22:07 UTC (permalink / raw) To: Mark Tinguely; +Cc: Brian Foster, xfs On Thu, Feb 20, 2014 at 08:51:55AM -0600, Mark Tinguely wrote: > On 02/19/14 18:23, Dave Chinner wrote: > >On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: > >>On 02/18/2014 11:16 PM, Dave Chinner wrote: > >>>From: Dave Chinner<dchinner@redhat.com> > >>> > >>>Log forces can occur deep in the call chain when we have relatively > >>>little stack free. Log forces can also happen at close to the call > >>>chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from > >>>places where we really don't want to add more stack overhead. > >>> > >>>This stack overhead occurs because log forces do foreground CIL > >>>pushes (xlog_cil_push_foreground()) rather than waking the > >>>background push wq and waiting for the for the push to complete. > >>>This foreground push was done to avoid confusing the CFQ Io > >>>scheduler when fsync()s were issued, as it has trouble dealing with > >>>dependent IOs being issued from different process contexts. > >>> > >>>Avoiding blowing the stack is much more critical than performance > >>>optimisations for CFQ, especially as we've been recommending against > >>>the use of CFQ for XFS since 3.2 kernels were release because of > >>>it's problems with multi-threaded IO workloads. > >>> > >>>Hence convert xlog_cil_push_foreground() to move the push work > >>>to the CIL workqueue. We already do the waiting for the push to > >>>complete in xlog_cil_force_lsn(), so there's nothing else we need to > >>>modify to make this work. > >>> > >>>Signed-off-by: Dave Chinner<dchinner@redhat.com> ..... > >>>@@ -803,7 +808,6 @@ xlog_cil_force_lsn( > >>> * before allowing the force of push_seq to go ahead. Hence block > >>> * on commits for those as well. > >>> */ > >>>-restart: > >>> spin_lock(&cil->xc_push_lock); > >>> list_for_each_entry(ctx,&cil->xc_committing, committing) { > >>> if (ctx->sequence> sequence) > >>>@@ -821,6 +825,28 @@ restart: > >>> /* found it! */ > >>> commit_lsn = ctx->commit_lsn; > >>> } > >>>+ > >>>+ /* > >>>+ * The call to xlog_cil_push_now() executes the push in the background. > >>>+ * Hence by the time we have got here it our sequence may not have been > >>>+ * pushed yet. This is true if the current sequence still matches the > >>>+ * push sequence after the above wait loop and the CIL still contains > >>>+ * dirty objects. > >>>+ * > >>>+ * When the push occurs, it will empty the CIL and > >>>+ * atomically increment the currect sequence past the push sequence and > >>>+ * move it into the committing list. Of course, if the CIL is clean at > >>>+ * the time of the push, it won't have pushed the CIL at all, so in that > >>>+ * case we should try the push for this sequence again from the start > >>>+ * just in case. > >>>+ */ > >>>+ > >>>+ if (sequence == cil->xc_current_sequence&& ^^^^^ FYI, your mailer is still mangling whitespace when quoting code.... > >>>+ !list_empty(&cil->xc_cil)) { > >>>+ spin_unlock(&cil->xc_push_lock); > >>>+ goto restart; > >>>+ } > >>>+ > >> > >>IIUC, the objective here is to make sure we don't leave this code path > >>before the push even starts and the ctx makes it onto the committing > >>list, due to xlog_cil_push_now() moving things to a workqueue. > > > >Right. > > > >>Given that, what's the purpose of re-executing the background push as > >>opposed to restarting the wait sequence (as done previously)? It looks > >>like push_now() won't queue the work again due to cil->xc_push_seq, but > >>it will flush the queue and I suppose make it more likely the push > >>starts. Is that the intent? > > > >Effectively. But the other thing that it is protecting against is > >that foreground push is done without holding the cil->xc_ctx_lock, > >and so we can get the situation where we try a foreground push > >of the current sequence, see that the CIL is empty and return > >without pushing, wait for previous sequences to commit, then find > >that the CIL has items on the CIL in the sequence we are supposed to > >be committing. > > > >In this case, we don't know if this occurred because the workqueue > >has not started working on our push, or whether we raced on an empty > >CIL, and hence we need to make sure that everything in the sequence > >we are support to commit is pushed to the log. > > > >Hence if the current sequence is dirty after we've ensure that all > >prior sequences are fully checkpointed, need to go back and > >push the CIL again to ensure that when we return to the caller the > >CIL is checkpointed up to the point in time of the log force > >occurring. > > The desired push sequence was taken from an item on the CIL (either > when added or from a pinned item). How could the CIL now be empty > other than someone else pushed to at least the desire sequence? The push sequence is only taken from an object on the CIL through xfs_log_force_lsn(). For xfs_log_force(), the sequence is taken directly from the current CIL context: static inline void xlog_cil_force(struct xlog *log) { xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence); } And that's how you get an empty CIL when entering xlog_cil_force_lsn(), and hence how you can get the race condition that the code is protecting against. > A flush_work() should be enough in the case where the ctx of the > desire sequence is not on the xc_committing list. The flush_work > will wait for the worker to start and place the ctx of the desired > sequence into the xc_committing list. This preventing a tight loop > waiting for the cil push worker to start. Yes, that's exactly what the code does. > Starting the cil push worker for every wakeup of smaller sequence in > the list_for_each_entry loop seems wasteful. As Brian pointed out, it won't restart on every wakeup - the cil->xc_push_seq checks prevent that from happening, so a specific sequence will only ever be queued for a push once. > We know the later error paths in xfs_cil_push() will not do a wake, > now is a good time to fix that. I'm not sure what you are talking about here. If there's a problem, please send patches. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-20 22:07 ` Dave Chinner @ 2014-02-20 22:35 ` Mark Tinguely 2014-02-21 0:02 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Mark Tinguely @ 2014-02-20 22:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, xfs On 02/20/14 16:07, Dave Chinner wrote: > On Thu, Feb 20, 2014 at 08:51:55AM -0600, Mark Tinguely wrote: >> On 02/19/14 18:23, Dave Chinner wrote: >>> On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: >>>> On 02/18/2014 11:16 PM, Dave Chinner wrote: >>>>> From: Dave Chinner<dchinner@redhat.com> >>>>> >>>>> Log forces can occur deep in the call chain when we have relatively >>>>> little stack free. Log forces can also happen at close to the call >>>>> chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from >>>>> places where we really don't want to add more stack overhead. >>>>> >>>>> This stack overhead occurs because log forces do foreground CIL >>>>> pushes (xlog_cil_push_foreground()) rather than waking the >>>>> background push wq and waiting for the for the push to complete. >>>>> This foreground push was done to avoid confusing the CFQ Io >>>>> scheduler when fsync()s were issued, as it has trouble dealing with >>>>> dependent IOs being issued from different process contexts. >>>>> >>>>> Avoiding blowing the stack is much more critical than performance >>>>> optimisations for CFQ, especially as we've been recommending against >>>>> the use of CFQ for XFS since 3.2 kernels were release because of >>>>> it's problems with multi-threaded IO workloads. >>>>> >>>>> Hence convert xlog_cil_push_foreground() to move the push work >>>>> to the CIL workqueue. We already do the waiting for the push to >>>>> complete in xlog_cil_force_lsn(), so there's nothing else we need to >>>>> modify to make this work. >>>>> >>>>> Signed-off-by: Dave Chinner<dchinner@redhat.com> > ..... >>>>> @@ -803,7 +808,6 @@ xlog_cil_force_lsn( >>>>> * before allowing the force of push_seq to go ahead. Hence block >>>>> * on commits for those as well. >>>>> */ >>>>> -restart: >>>>> spin_lock(&cil->xc_push_lock); >>>>> list_for_each_entry(ctx,&cil->xc_committing, committing) { >>>>> if (ctx->sequence> sequence) >>>>> @@ -821,6 +825,28 @@ restart: >>>>> /* found it! */ >>>>> commit_lsn = ctx->commit_lsn; >>>>> } >>>>> + >>>>> + /* >>>>> + * The call to xlog_cil_push_now() executes the push in the background. >>>>> + * Hence by the time we have got here it our sequence may not have been >>>>> + * pushed yet. This is true if the current sequence still matches the >>>>> + * push sequence after the above wait loop and the CIL still contains >>>>> + * dirty objects. >>>>> + * >>>>> + * When the push occurs, it will empty the CIL and >>>>> + * atomically increment the currect sequence past the push sequence and >>>>> + * move it into the committing list. Of course, if the CIL is clean at >>>>> + * the time of the push, it won't have pushed the CIL at all, so in that >>>>> + * case we should try the push for this sequence again from the start >>>>> + * just in case. >>>>> + */ >>>>> + >>>>> + if (sequence == cil->xc_current_sequence&& > ^^^^^ > FYI, your mailer is still mangling whitespace when quoting code.... > >>>>> + !list_empty(&cil->xc_cil)) { >>>>> + spin_unlock(&cil->xc_push_lock); >>>>> + goto restart; >>>>> + } >>>>> + >>>> >>>> IIUC, the objective here is to make sure we don't leave this code path >>>> before the push even starts and the ctx makes it onto the committing >>>> list, due to xlog_cil_push_now() moving things to a workqueue. >>> >>> Right. >>> >>>> Given that, what's the purpose of re-executing the background push as >>>> opposed to restarting the wait sequence (as done previously)? It looks >>>> like push_now() won't queue the work again due to cil->xc_push_seq, but >>>> it will flush the queue and I suppose make it more likely the push >>>> starts. Is that the intent? >>> >>> Effectively. But the other thing that it is protecting against is >>> that foreground push is done without holding the cil->xc_ctx_lock, >>> and so we can get the situation where we try a foreground push >>> of the current sequence, see that the CIL is empty and return >>> without pushing, wait for previous sequences to commit, then find >>> that the CIL has items on the CIL in the sequence we are supposed to >>> be committing. >>> >>> In this case, we don't know if this occurred because the workqueue >>> has not started working on our push, or whether we raced on an empty >>> CIL, and hence we need to make sure that everything in the sequence >>> we are support to commit is pushed to the log. >>> >>> Hence if the current sequence is dirty after we've ensure that all >>> prior sequences are fully checkpointed, need to go back and >>> push the CIL again to ensure that when we return to the caller the >>> CIL is checkpointed up to the point in time of the log force >>> occurring. >> >> The desired push sequence was taken from an item on the CIL (either >> when added or from a pinned item). How could the CIL now be empty >> other than someone else pushed to at least the desire sequence? > > The push sequence is only taken from an object on the CIL through > xfs_log_force_lsn(). For xfs_log_force(), the sequence is taken > directly from the current CIL context: > > static inline void > xlog_cil_force(struct xlog *log) > { > xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence); > } > > And that's how you get an empty CIL when entering > xlog_cil_force_lsn(), and hence how you can get the race condition > that the code is protecting against. > >> A flush_work() should be enough in the case where the ctx of the >> desire sequence is not on the xc_committing list. The flush_work >> will wait for the worker to start and place the ctx of the desired >> sequence into the xc_committing list. This preventing a tight loop >> waiting for the cil push worker to start. > > Yes, that's exactly what the code does. > >> Starting the cil push worker for every wakeup of smaller sequence in >> the list_for_each_entry loop seems wasteful. > > As Brian pointed out, it won't restart on every wakeup - the > cil->xc_push_seq checks prevent that from happening, so a specific > sequence will only ever be queued for a push once. > >> We know the later error paths in xfs_cil_push() will not do a wake, >> now is a good time to fix that. > > I'm not sure what you are talking about here. If there's a problem, > please send patches. > > Cheers, > > Dave. http://oss.sgi.com/archives/xfs/2013-12/msg00870.html --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-20 22:35 ` Mark Tinguely @ 2014-02-21 0:02 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2014-02-21 0:02 UTC (permalink / raw) To: Mark Tinguely; +Cc: Brian Foster, xfs On Thu, Feb 20, 2014 at 04:35:41PM -0600, Mark Tinguely wrote: > On 02/20/14 16:07, Dave Chinner wrote: > >On Thu, Feb 20, 2014 at 08:51:55AM -0600, Mark Tinguely wrote: > >>We know the later error paths in xfs_cil_push() will not do a wake, > >>now is a good time to fix that. > > > >I'm not sure what you are talking about here. If there's a problem, > >please send patches. > > http://oss.sgi.com/archives/xfs/2013-12/msg00870.html Which resulted in a discussion and discovery of more problems that also needed fixing, so I'm waiting for patches to be posted that fix them. You know what the problems are, you know how they should be fixed (it's in the discussion thread), and you are capable of writing the fixes and testing them. So, please send patches. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-20 0:23 ` Dave Chinner 2014-02-20 14:51 ` Mark Tinguely @ 2014-02-21 15:04 ` Brian Foster 2014-02-21 22:21 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Brian Foster @ 2014-02-21 15:04 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 02/19/2014 07:23 PM, Dave Chinner wrote: > On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: >> On 02/18/2014 11:16 PM, Dave Chinner wrote: >>> From: Dave Chinner <dchinner@redhat.com> >>> ... >>> + >>> + /* >>> + * The call to xlog_cil_push_now() executes the push in the background. >>> + * Hence by the time we have got here it our sequence may not have been >>> + * pushed yet. This is true if the current sequence still matches the >>> + * push sequence after the above wait loop and the CIL still contains >>> + * dirty objects. >>> + * >>> + * When the push occurs, it will empty the CIL and >>> + * atomically increment the currect sequence past the push sequence and >>> + * move it into the committing list. Of course, if the CIL is clean at >>> + * the time of the push, it won't have pushed the CIL at all, so in that >>> + * case we should try the push for this sequence again from the start >>> + * just in case. >>> + */ >>> + >>> + if (sequence == cil->xc_current_sequence && >>> + !list_empty(&cil->xc_cil)) { >>> + spin_unlock(&cil->xc_push_lock); >>> + goto restart; >>> + } >>> + >> >> IIUC, the objective here is to make sure we don't leave this code path >> before the push even starts and the ctx makes it onto the committing >> list, due to xlog_cil_push_now() moving things to a workqueue. > > Right. > >> Given that, what's the purpose of re-executing the background push as >> opposed to restarting the wait sequence (as done previously)? It looks >> like push_now() won't queue the work again due to cil->xc_push_seq, but >> it will flush the queue and I suppose make it more likely the push >> starts. Is that the intent? > > Effectively. But the other thing that it is protecting against is > that foreground push is done without holding the cil->xc_ctx_lock, > and so we can get the situation where we try a foreground push > of the current sequence, see that the CIL is empty and return > without pushing, wait for previous sequences to commit, then find > that the CIL has items on the CIL in the sequence we are supposed to > be committing. > > In this case, we don't know if this occurred because the workqueue > has not started working on our push, or whether we raced on an empty > CIL, and hence we need to make sure that everything in the sequence > we are support to commit is pushed to the log. > > Hence if the current sequence is dirty after we've ensure that all > prior sequences are fully checkpointed, need to go back and > push the CIL again to ensure that when we return to the caller the > CIL is checkpointed up to the point in time of the log force > occurring. > Ok, I see. The foreground variant checks the push sequence and xc_cil list state without xc_ctx_lock. The background variant (down in xlog_cil_push()) will repeat the list check while under xc_ctx_lock. >From what I can see, this potential behavior of a foreground push seeing an empty cil is possible even before this patch. xlog_cil_force_lsn() will just wait on all of the previous sequences and return. Another push can move things along for that sequence, as xc_push_seq has not been updated. The workqueue pattern opens up potential for the "wait before handler runs" race, and the sequence and list check is necessary to detect that. Doing the push_now() again seems technically unnecessary, but the flush executes and thus should guarantee we iterate this sequence twice at most. Seems reasonable. General follow up question - what makes not taking xc_ctx_lock anywhere in here safe in the first place? In the current implementation, if the push has already been queued (note that we flush before we take the spinlock and check the push sequence) and we get into the ctx wait sequence, isn't it possible to see xc_committing before the ctx we're pushing is even added? With this patch, what prevents us from seeing the updated xc_current_sequence and thus skipping the restart (xc_current_sequence isn't updated under the spinlock) before the pushed ctx has been added to xc_committing? Brian > Cheers, > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-21 15:04 ` Brian Foster @ 2014-02-21 22:21 ` Dave Chinner 2014-02-24 13:35 ` Brian Foster 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2014-02-21 22:21 UTC (permalink / raw) To: Brian Foster; +Cc: xfs On Fri, Feb 21, 2014 at 10:04:37AM -0500, Brian Foster wrote: > On 02/19/2014 07:23 PM, Dave Chinner wrote: > > On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: > >> On 02/18/2014 11:16 PM, Dave Chinner wrote: > >>> From: Dave Chinner <dchinner@redhat.com> > >>> > ... > >>> + > >>> + /* > >>> + * The call to xlog_cil_push_now() executes the push in the background. > >>> + * Hence by the time we have got here it our sequence may not have been > >>> + * pushed yet. This is true if the current sequence still matches the > >>> + * push sequence after the above wait loop and the CIL still contains > >>> + * dirty objects. > >>> + * > >>> + * When the push occurs, it will empty the CIL and > >>> + * atomically increment the currect sequence past the push sequence and > >>> + * move it into the committing list. Of course, if the CIL is clean at > >>> + * the time of the push, it won't have pushed the CIL at all, so in that > >>> + * case we should try the push for this sequence again from the start > >>> + * just in case. > >>> + */ > >>> + > >>> + if (sequence == cil->xc_current_sequence && > >>> + !list_empty(&cil->xc_cil)) { > >>> + spin_unlock(&cil->xc_push_lock); > >>> + goto restart; > >>> + } > >>> + > >> > >> IIUC, the objective here is to make sure we don't leave this code path > >> before the push even starts and the ctx makes it onto the committing > >> list, due to xlog_cil_push_now() moving things to a workqueue. > > > > Right. > > > >> Given that, what's the purpose of re-executing the background push as > >> opposed to restarting the wait sequence (as done previously)? It looks > >> like push_now() won't queue the work again due to cil->xc_push_seq, but > >> it will flush the queue and I suppose make it more likely the push > >> starts. Is that the intent? > > > > Effectively. But the other thing that it is protecting against is > > that foreground push is done without holding the cil->xc_ctx_lock, > > and so we can get the situation where we try a foreground push > > of the current sequence, see that the CIL is empty and return > > without pushing, wait for previous sequences to commit, then find > > that the CIL has items on the CIL in the sequence we are supposed to > > be committing. > > > > In this case, we don't know if this occurred because the workqueue > > has not started working on our push, or whether we raced on an empty > > CIL, and hence we need to make sure that everything in the sequence > > we are support to commit is pushed to the log. > > > > Hence if the current sequence is dirty after we've ensure that all > > prior sequences are fully checkpointed, need to go back and > > push the CIL again to ensure that when we return to the caller the > > CIL is checkpointed up to the point in time of the log force > > occurring. > > > > Ok, I see. The foreground variant checks the push sequence and xc_cil > list state without xc_ctx_lock. The background variant (down in > xlog_cil_push()) will repeat the list check while under xc_ctx_lock. > > From what I can see, this potential behavior of a foreground push seeing > an empty cil is possible even before this patch. xlog_cil_force_lsn() > will just wait on all of the previous sequences and return. Another push > can move things along for that sequence, as xc_push_seq has not been > updated. > > The workqueue pattern opens up potential for the "wait before handler > runs" race, and the sequence and list check is necessary to detect that. > Doing the push_now() again seems technically unnecessary, but the flush > executes and thus should guarantee we iterate this sequence twice at > most. Seems reasonable. > > General follow up question - what makes not taking xc_ctx_lock anywhere > in here safe in the first place? In the current implementation, if the > push has already been queued (note that we flush before we take the > spinlock and check the push sequence) and we get into the ctx wait > sequence, isn't it possible to see xc_committing before the ctx we're > pushing is even added? The waiting is serialised on the push lock, not the context lock. The context lock is used to serialise addition to a CIL context with the against the pushing of that sequence. Triggering a push of a CIL context does not need to be serialised addition to the CIL, nor directly against the push of the CIL. A blocking push needs to be serialised against the checkpoint of a CIL context to the iclog, which is a different thing altogether. Hence we don't want to use the xc_ctx_lock for this - it is already a contended lock and we don't want to hold off commits into a new sequence while we wait for a previous sequence to finish pushing. Yes, there are potential races in the exist code. They are fixed by this patch. > With this patch, what prevents us from seeing the updated > xc_current_sequence and thus skipping the restart (xc_current_sequence > isn't updated under the spinlock) before the pushed ctx has been added > to xc_committing? The fact that the patch moves the xc_current_sequence update under the the push_lock avoids this. i.e. it is now only updated atomically with adding the context to the committing list. Both are now explicitly updated at the same time, so you can't see a sequence number greater than what you might find on the list... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] xfs: always do log forces via the workqueue 2014-02-21 22:21 ` Dave Chinner @ 2014-02-24 13:35 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2014-02-24 13:35 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sat, Feb 22, 2014 at 09:21:06AM +1100, Dave Chinner wrote: > On Fri, Feb 21, 2014 at 10:04:37AM -0500, Brian Foster wrote: > > On 02/19/2014 07:23 PM, Dave Chinner wrote: > > > On Wed, Feb 19, 2014 at 01:24:54PM -0500, Brian Foster wrote: > > >> On 02/18/2014 11:16 PM, Dave Chinner wrote: > > >>> From: Dave Chinner <dchinner@redhat.com> > > >>> > > ... ... > > > > General follow up question - what makes not taking xc_ctx_lock anywhere > > in here safe in the first place? In the current implementation, if the > > push has already been queued (note that we flush before we take the > > spinlock and check the push sequence) and we get into the ctx wait > > sequence, isn't it possible to see xc_committing before the ctx we're > > pushing is even added? > > The waiting is serialised on the push lock, not the context lock. > > The context lock is used to serialise addition to a CIL context with > the against the pushing of that sequence. Triggering a push of a CIL > context does not need to be serialised addition to the CIL, nor > directly against the push of the CIL. A blocking push needs to be > serialised against the checkpoint of a CIL context to the iclog, > which is a different thing altogether. > > Hence we don't want to use the xc_ctx_lock for this - it is already > a contended lock and we don't want to hold off commits into a new > sequence while we wait for a previous sequence to finish pushing. > > Yes, there are potential races in the exist code. They are fixed by > this patch. > Ok, thanks. > > With this patch, what prevents us from seeing the updated > > xc_current_sequence and thus skipping the restart (xc_current_sequence > > isn't updated under the spinlock) before the pushed ctx has been added > > to xc_committing? > > The fact that the patch moves the xc_current_sequence update under > the the push_lock avoids this. i.e. it is now only updated atomically > with adding the context to the committing list. Both are now > explicitly updated at the same time, so you can't see a sequence > number greater than what you might find on the list... > Ah, right. I was reading through your patch and the original code to understand it better and lost the fact that you moved xc_current_sequence under spinlock (e.g., my assumption above about it not updated under lock is incorrect). That clears that up. Thanks for the explanations. Reviewed-by: Brian Foster <bfoster@redhat.com> Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive 2014-02-19 4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner 2014-02-19 4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner @ 2014-02-19 4:16 ` Dave Chinner 2014-02-19 18:25 ` Brian Foster 2014-02-20 14:51 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Christoph Hellwig 2014-02-19 4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner 2 siblings, 2 replies; 19+ messages in thread From: Dave Chinner @ 2014-02-19 4:16 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> The change to add the IO lock to protect the directory extent map during readdir operations has cause lockdep to have a heart attack as it now sees a different locking order on inodes w.r.t. the mmap_sem because readdir has a different ordering to write(). Add a new lockdep class for directory inodes to avoid this false positive. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_iops.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 9ddfb81..bb3bb65 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -48,6 +48,18 @@ #include <linux/fiemap.h> #include <linux/slab.h> +/* + * Directories have different lock order w.r.t. mmap_sem compared to regular + * files. This is due to readdir potentially triggering page faults on a user + * buffer inside filldir(), and this happens with the ilock on the directory + * held. For regular files, the lock order is the other way around - the + * mmap_sem is taken during the page fault, and then we lock the ilock to do + * block mapping. Hence we need a different class for the directory ilock so + * that lockdep can tell them apart. + */ +static struct lock_class_key xfs_nondir_ilock_class; +static struct lock_class_key xfs_dir_ilock_class; + static int xfs_initxattrs( struct inode *inode, @@ -1191,6 +1203,7 @@ xfs_setup_inode( xfs_diflags_to_iflags(inode, ip); ip->d_ops = ip->i_mount->m_nondir_inode_ops; + lockdep_set_class(&ip->i_lock.mr_lock, &xfs_nondir_ilock_class); switch (inode->i_mode & S_IFMT) { case S_IFREG: inode->i_op = &xfs_inode_operations; @@ -1198,6 +1211,7 @@ xfs_setup_inode( inode->i_mapping->a_ops = &xfs_address_space_operations; break; case S_IFDIR: + lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class); if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb)) inode->i_op = &xfs_dir_ci_inode_operations; else -- 1.8.4.rc3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive 2014-02-19 4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner @ 2014-02-19 18:25 ` Brian Foster 2014-02-20 0:13 ` mmap_sem -> isec->lock lockdep issues with shmem (was Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive) Dave Chinner 2014-02-20 14:51 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Christoph Hellwig 1 sibling, 1 reply; 19+ messages in thread From: Brian Foster @ 2014-02-19 18:25 UTC (permalink / raw) To: Dave Chinner, xfs [-- Attachment #1: Type: text/plain, Size: 2347 bytes --] On 02/18/2014 11:16 PM, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The change to add the IO lock to protect the directory extent map > during readdir operations has cause lockdep to have a heart attack > as it now sees a different locking order on inodes w.r.t. the > mmap_sem because readdir has a different ordering to write(). > > Add a new lockdep class for directory inodes to avoid this false > positive. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Hey Dave, I'm not terribly familiar with lockdep, but I hit the attached "possible circular locking dependency detected" warning when running with this patch. (Reproduces by running generic/001 after a reboot). Brian > fs/xfs/xfs_iops.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 9ddfb81..bb3bb65 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -48,6 +48,18 @@ > #include <linux/fiemap.h> > #include <linux/slab.h> > > +/* > + * Directories have different lock order w.r.t. mmap_sem compared to regular > + * files. This is due to readdir potentially triggering page faults on a user > + * buffer inside filldir(), and this happens with the ilock on the directory > + * held. For regular files, the lock order is the other way around - the > + * mmap_sem is taken during the page fault, and then we lock the ilock to do > + * block mapping. Hence we need a different class for the directory ilock so > + * that lockdep can tell them apart. > + */ > +static struct lock_class_key xfs_nondir_ilock_class; > +static struct lock_class_key xfs_dir_ilock_class; > + > static int > xfs_initxattrs( > struct inode *inode, > @@ -1191,6 +1203,7 @@ xfs_setup_inode( > xfs_diflags_to_iflags(inode, ip); > > ip->d_ops = ip->i_mount->m_nondir_inode_ops; > + lockdep_set_class(&ip->i_lock.mr_lock, &xfs_nondir_ilock_class); > switch (inode->i_mode & S_IFMT) { > case S_IFREG: > inode->i_op = &xfs_inode_operations; > @@ -1198,6 +1211,7 @@ xfs_setup_inode( > inode->i_mapping->a_ops = &xfs_address_space_operations; > break; > case S_IFDIR: > + lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class); > if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb)) > inode->i_op = &xfs_dir_ci_inode_operations; > else > [-- Attachment #2: messages.lockdep --] [-- Type: text/plain, Size: 17959 bytes --] Feb 19 12:22:03 localhost kernel: [ 101.486725] Feb 19 12:22:03 localhost kernel: [ 101.486903] ====================================================== Feb 19 12:22:03 localhost kernel: [ 101.487018] [ INFO: possible circular locking dependency detected ] Feb 19 12:22:03 localhost kernel: [ 101.487018] 3.14.0-rc1+ #6 Tainted: GF W O Feb 19 12:22:03 localhost kernel: [ 101.487018] ------------------------------------------------------- Feb 19 12:22:03 localhost kernel: [ 101.487018] rm/4171 is trying to acquire lock: Feb 19 12:22:03 localhost kernel: [ 101.487018] (&mm->mmap_sem){++++++}, at: [<ffffffff811cc8cf>] might_fault+0x5f/0xb0 Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] but task is already holding lock: Feb 19 12:22:03 localhost kernel: [ 101.487018] (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] which lock already depends on the new lock. Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] the existing dependency chain (in reverse order) is: Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] -> #2 (&xfs_dir_ilock_class){++++..}: Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810ed147>] down_read_nested+0x57/0xa0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05a01af>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa0565d50>] xfs_attr_get+0x90/0xe0 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa055b9d7>] xfs_xattr_get+0x37/0x50 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812483ef>] generic_getxattr+0x4f/0x70 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8133fd5e>] inode_doinit_with_dentry+0x1ae/0x650 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff813402d8>] sb_finish_set_opts+0xd8/0x270 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340702>] selinux_set_mnt_opts+0x292/0x5f0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340ac8>] superblock_doinit+0x68/0xd0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340b8d>] selinux_sb_kern_mount+0x3d/0xa0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81335536>] security_sb_kern_mount+0x16/0x20 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8122333a>] mount_fs+0x8a/0x1b0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8124285b>] vfs_kern_mount+0x6b/0x150 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8124561e>] do_mount+0x23e/0xb90 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812462a3>] SyS_mount+0x83/0xc0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] -> #1 (&isec->lock){+.+.+.}: Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81780d77>] mutex_lock_nested+0x77/0x3f0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8133fc42>] inode_doinit_with_dentry+0x92/0x650 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340dcc>] selinux_d_instantiate+0x1c/0x20 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8133517b>] security_d_instantiate+0x1b/0x30 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81237d70>] d_instantiate+0x50/0x70 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811bcb70>] __shmem_file_setup+0xe0/0x1d0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811bf988>] shmem_zero_setup+0x28/0x70 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811d8653>] mmap_region+0x543/0x5a0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811d89b1>] do_mmap_pgoff+0x301/0x3c0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811c18f0>] vm_mmap_pgoff+0x90/0xc0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811d6f26>] SyS_mmap_pgoff+0x116/0x270 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8101f9b2>] SyS_mmap+0x22/0x30 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] -> #0 (&mm->mmap_sem){++++++}: Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811cc8fc>] might_fault+0x8c/0xb0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812341c1>] filldir+0x91/0x120 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81234008>] iterate_dir+0xa8/0xe0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812344b3>] SyS_getdents+0x93/0x120 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] other info that might help us debug this: Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] Chain exists of: Feb 19 12:22:03 localhost kernel: [ 101.487018] &mm->mmap_sem --> &isec->lock --> &xfs_dir_ilock_class Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] Possible unsafe locking scenario: Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] CPU0 CPU1 Feb 19 12:22:03 localhost kernel: [ 101.487018] ---- ---- Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&xfs_dir_ilock_class); Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&isec->lock); Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&xfs_dir_ilock_class); Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&mm->mmap_sem); Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] *** DEADLOCK *** Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] 2 locks held by rm/4171: Feb 19 12:22:03 localhost kernel: [ 101.487018] #0: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff81233fc2>] iterate_dir+0x62/0xe0 Feb 19 12:22:03 localhost kernel: [ 101.487018] #1: (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] Feb 19 12:22:03 localhost kernel: [ 101.487018] stack backtrace: Feb 19 12:22:03 localhost kernel: [ 101.487018] CPU: 1 PID: 4171 Comm: rm Tainted: GF W O 3.14.0-rc1+ #6 Feb 19 12:22:03 localhost kernel: [ 101.487018] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Feb 19 12:22:03 localhost kernel: [ 101.487018] ffffffff82597d80 ffff8800c43cdc60 ffffffff8177ba90 ffffffff825cd9c0 Feb 19 12:22:03 localhost kernel: [ 101.487018] ffff8800c43cdca0 ffffffff81777168 ffff8800c43cdcf0 ffff8800d44ba630 Feb 19 12:22:03 localhost kernel: [ 101.487018] ffff8800d44b9aa0 0000000000000002 0000000000000002 ffff8800d44ba630 Feb 19 12:22:03 localhost kernel: [ 101.487018] Call Trace: Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8177ba90>] dump_stack+0x4d/0x66 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81777168>] print_circular_bug+0x201/0x20f Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811cc8fc>] might_fault+0x8c/0xb0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812341c1>] filldir+0x91/0x120 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05a0022>] ? xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs] Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81234008>] iterate_dir+0xa8/0xe0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812344b3>] SyS_getdents+0x93/0x120 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81234130>] ? fillonedir+0xf0/0xf0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8114a2cc>] ? __audit_syscall_entry+0x9c/0xf0 Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: Feb 19 12:22:03 localhost kernel: ====================================================== Feb 19 12:22:03 localhost kernel: [ INFO: possible circular locking dependency detected ] Feb 19 12:22:03 localhost kernel: 3.14.0-rc1+ #6 Tainted: GF W O Feb 19 12:22:03 localhost kernel: ------------------------------------------------------- Feb 19 12:22:03 localhost kernel: rm/4171 is trying to acquire lock: Feb 19 12:22:03 localhost kernel: (&mm->mmap_sem){++++++}, at: [<ffffffff811cc8cf>] might_fault+0x5f/0xb0 Feb 19 12:22:03 localhost kernel: but task is already holding lock: Feb 19 12:22:03 localhost kernel: (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: which lock already depends on the new lock. Feb 19 12:22:03 localhost kernel: the existing dependency chain (in reverse order) is: Feb 19 12:22:03 localhost kernel: -> #2 (&xfs_dir_ilock_class){++++..}: Feb 19 12:22:03 localhost kernel: [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [<ffffffff810ed147>] down_read_nested+0x57/0xa0 Feb 19 12:22:03 localhost kernel: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa05a01af>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa0565d50>] xfs_attr_get+0x90/0xe0 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa055b9d7>] xfs_xattr_get+0x37/0x50 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffff812483ef>] generic_getxattr+0x4f/0x70 Feb 19 12:22:03 localhost kernel: [<ffffffff8133fd5e>] inode_doinit_with_dentry+0x1ae/0x650 Feb 19 12:22:03 localhost kernel: [<ffffffff813402d8>] sb_finish_set_opts+0xd8/0x270 Feb 19 12:22:03 localhost kernel: [<ffffffff81340702>] selinux_set_mnt_opts+0x292/0x5f0 Feb 19 12:22:03 localhost kernel: [<ffffffff81340ac8>] superblock_doinit+0x68/0xd0 Feb 19 12:22:03 localhost kernel: [<ffffffff81340b8d>] selinux_sb_kern_mount+0x3d/0xa0 Feb 19 12:22:03 localhost kernel: [<ffffffff81335536>] security_sb_kern_mount+0x16/0x20 Feb 19 12:22:03 localhost kernel: [<ffffffff8122333a>] mount_fs+0x8a/0x1b0 Feb 19 12:22:03 localhost kernel: [<ffffffff8124285b>] vfs_kern_mount+0x6b/0x150 Feb 19 12:22:03 localhost kernel: [<ffffffff8124561e>] do_mount+0x23e/0xb90 Feb 19 12:22:03 localhost kernel: [<ffffffff812462a3>] SyS_mount+0x83/0xc0 Feb 19 12:22:03 localhost kernel: [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: -> #1 (&isec->lock){+.+.+.}: Feb 19 12:22:03 localhost kernel: [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [<ffffffff81780d77>] mutex_lock_nested+0x77/0x3f0 Feb 19 12:22:03 localhost kernel: [<ffffffff8133fc42>] inode_doinit_with_dentry+0x92/0x650 Feb 19 12:22:03 localhost kernel: [<ffffffff81340dcc>] selinux_d_instantiate+0x1c/0x20 Feb 19 12:22:03 localhost kernel: [<ffffffff8133517b>] security_d_instantiate+0x1b/0x30 Feb 19 12:22:03 localhost kernel: [<ffffffff81237d70>] d_instantiate+0x50/0x70 Feb 19 12:22:03 localhost kernel: [<ffffffff811bcb70>] __shmem_file_setup+0xe0/0x1d0 Feb 19 12:22:03 localhost kernel: [<ffffffff811bf988>] shmem_zero_setup+0x28/0x70 Feb 19 12:22:03 localhost kernel: [<ffffffff811d8653>] mmap_region+0x543/0x5a0 Feb 19 12:22:03 localhost kernel: [<ffffffff811d89b1>] do_mmap_pgoff+0x301/0x3c0 Feb 19 12:22:03 localhost kernel: [<ffffffff811c18f0>] vm_mmap_pgoff+0x90/0xc0 Feb 19 12:22:03 localhost kernel: [<ffffffff811d6f26>] SyS_mmap_pgoff+0x116/0x270 Feb 19 12:22:03 localhost kernel: [<ffffffff8101f9b2>] SyS_mmap+0x22/0x30 Feb 19 12:22:03 localhost kernel: [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: -> #0 (&mm->mmap_sem){++++++}: Feb 19 12:22:03 localhost kernel: [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0 Feb 19 12:22:03 localhost kernel: [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8fc>] might_fault+0x8c/0xb0 Feb 19 12:22:03 localhost kernel: [<ffffffff812341c1>] filldir+0x91/0x120 Feb 19 12:22:03 localhost kernel: [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffff81234008>] iterate_dir+0xa8/0xe0 Feb 19 12:22:03 localhost kernel: [<ffffffff812344b3>] SyS_getdents+0x93/0x120 Feb 19 12:22:03 localhost kernel: [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b Feb 19 12:22:03 localhost kernel: other info that might help us debug this: Feb 19 12:22:03 localhost kernel: Chain exists of: &mm->mmap_sem --> &isec->lock --> &xfs_dir_ilock_class Feb 19 12:22:03 localhost kernel: Possible unsafe locking scenario: Feb 19 12:22:03 localhost kernel: CPU0 CPU1 Feb 19 12:22:03 localhost kernel: ---- ---- Feb 19 12:22:03 localhost kernel: lock(&xfs_dir_ilock_class); Feb 19 12:22:03 localhost kernel: lock(&isec->lock); Feb 19 12:22:03 localhost kernel: lock(&xfs_dir_ilock_class); Feb 19 12:22:03 localhost kernel: lock(&mm->mmap_sem); Feb 19 12:22:03 localhost kernel: *** DEADLOCK *** Feb 19 12:22:03 localhost kernel: 2 locks held by rm/4171: Feb 19 12:22:03 localhost kernel: #0: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff81233fc2>] iterate_dir+0x62/0xe0 Feb 19 12:22:03 localhost kernel: #1: (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: stack backtrace: Feb 19 12:22:03 localhost kernel: CPU: 1 PID: 4171 Comm: rm Tainted: GF W O 3.14.0-rc1+ #6 Feb 19 12:22:03 localhost kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Feb 19 12:22:03 localhost kernel: ffffffff82597d80 ffff8800c43cdc60 ffffffff8177ba90 ffffffff825cd9c0 Feb 19 12:22:03 localhost kernel: ffff8800c43cdca0 ffffffff81777168 ffff8800c43cdcf0 ffff8800d44ba630 Feb 19 12:22:03 localhost kernel: ffff8800d44b9aa0 0000000000000002 0000000000000002 ffff8800d44ba630 Feb 19 12:22:03 localhost kernel: Call Trace: Feb 19 12:22:03 localhost kernel: [<ffffffff8177ba90>] dump_stack+0x4d/0x66 Feb 19 12:22:03 localhost kernel: [<ffffffff81777168>] print_circular_bug+0x201/0x20f Feb 19 12:22:03 localhost kernel: [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0 Feb 19 12:22:03 localhost kernel: [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0 Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8fc>] might_fault+0x8c/0xb0 Feb 19 12:22:03 localhost kernel: [<ffffffff811cc8cf>] ? might_fault+0x5f/0xb0 Feb 19 12:22:03 localhost kernel: [<ffffffff812341c1>] filldir+0x91/0x120 Feb 19 12:22:03 localhost kernel: [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa05a0022>] ? xfs_ilock+0x122/0x250 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs] Feb 19 12:22:03 localhost kernel: [<ffffffff81234008>] iterate_dir+0xa8/0xe0 Feb 19 12:22:03 localhost kernel: [<ffffffff812344b3>] SyS_getdents+0x93/0x120 Feb 19 12:22:03 localhost kernel: [<ffffffff81234130>] ? fillonedir+0xf0/0xf0 Feb 19 12:22:03 localhost kernel: [<ffffffff8114a2cc>] ? __audit_syscall_entry+0x9c/0xf0 Feb 19 12:22:03 localhost kernel: [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b [-- Attachment #3: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* mmap_sem -> isec->lock lockdep issues with shmem (was Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive) 2014-02-19 18:25 ` Brian Foster @ 2014-02-20 0:13 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2014-02-20 0:13 UTC (permalink / raw) To: Brian Foster; +Cc: linux-mm, linux-security-module, xfs [cc linux-mm because it shmem craziness that is causing the problem] [cc linux-security-module because it is security contexts that need lockdep annotations.] On Wed, Feb 19, 2014 at 01:25:16PM -0500, Brian Foster wrote: > On 02/18/2014 11:16 PM, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The change to add the IO lock to protect the directory extent map > > during readdir operations has cause lockdep to have a heart attack > > as it now sees a different locking order on inodes w.r.t. the > > mmap_sem because readdir has a different ordering to write(). > > > > Add a new lockdep class for directory inodes to avoid this false > > positive. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > Hey Dave, > > I'm not terribly familiar with lockdep, but I hit the attached "possible > circular locking dependency detected" warning when running with this patch. > > (Reproduces by running generic/001 after a reboot). Ok, you're testing on an selinux enabled system, I didn't. > Feb 19 12:22:03 localhost kernel: [ 101.487018] > Feb 19 12:22:03 localhost kernel: [ 101.487018] -> #2 (&xfs_dir_ilock_class){++++..}: > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810ed147>] down_read_nested+0x57/0xa0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05a0022>] xfs_ilock+0x122/0x250 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05a01af>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa0565d50>] xfs_attr_get+0x90/0xe0 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa055b9d7>] xfs_xattr_get+0x37/0x50 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812483ef>] generic_getxattr+0x4f/0x70 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8133fd5e>] inode_doinit_with_dentry+0x1ae/0x650 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff813402d8>] sb_finish_set_opts+0xd8/0x270 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340702>] selinux_set_mnt_opts+0x292/0x5f0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340ac8>] superblock_doinit+0x68/0xd0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340b8d>] selinux_sb_kern_mount+0x3d/0xa0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81335536>] security_sb_kern_mount+0x16/0x20 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8122333a>] mount_fs+0x8a/0x1b0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8124285b>] vfs_kern_mount+0x6b/0x150 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8124561e>] do_mount+0x23e/0xb90 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812462a3>] SyS_mount+0x83/0xc0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b So, we take the ilock on the directory xattr read path during security attribute initialisation so we have a inode->i_isec->lock -> ilock path, which is normal. > Feb 19 12:22:03 localhost kernel: [ 101.487018] -> #1 (&isec->lock){+.+.+.}: > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81780d77>] mutex_lock_nested+0x77/0x3f0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8133fc42>] inode_doinit_with_dentry+0x92/0x650 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81340dcc>] selinux_d_instantiate+0x1c/0x20 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8133517b>] security_d_instantiate+0x1b/0x30 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81237d70>] d_instantiate+0x50/0x70 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811bcb70>] __shmem_file_setup+0xe0/0x1d0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811bf988>] shmem_zero_setup+0x28/0x70 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811d8653>] mmap_region+0x543/0x5a0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811d89b1>] do_mmap_pgoff+0x301/0x3c0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811c18f0>] vm_mmap_pgoff+0x90/0xc0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811d6f26>] SyS_mmap_pgoff+0x116/0x270 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8101f9b2>] SyS_mmap+0x22/0x30 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b What the hell? We instantiate an shmem filesystem *inode* under the mmap_sem? And so we have a mmap_sem -> inode->i_isec->lock path on a *shmem* inode. > Feb 19 12:22:03 localhost kernel: [ 101.487018] > Feb 19 12:22:03 localhost kernel: [ 101.487018] -> #0 (&mm->mmap_sem){++++++}: > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f351c>] __lock_acquire+0x18ec/0x1aa0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff810f3ec2>] lock_acquire+0xa2/0x1d0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff811cc8fc>] might_fault+0x8c/0xb0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812341c1>] filldir+0x91/0x120 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa053f2f7>] xfs_dir2_sf_getdents+0x317/0x380 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa054001b>] xfs_readdir+0x16b/0x230 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffffa05427fb>] xfs_file_readdir+0x2b/0x40 [xfs] > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff81234008>] iterate_dir+0xa8/0xe0 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff812344b3>] SyS_getdents+0x93/0x120 > Feb 19 12:22:03 localhost kernel: [ 101.487018] [<ffffffff8178ed69>] system_call_fastpath+0x16/0x1b And then we have the mmap_sem in readdir, which inode->ilock -> mmap_sem. > Feb 19 12:22:03 localhost kernel: [ 101.487018] > Feb 19 12:22:03 localhost kernel: [ 101.487018] other info that might help us debug this: > Feb 19 12:22:03 localhost kernel: [ 101.487018] > Feb 19 12:22:03 localhost kernel: [ 101.487018] Chain exists of: > Feb 19 12:22:03 localhost kernel: [ 101.487018] &mm->mmap_sem --> &isec->lock --> &xfs_dir_ilock_class > Feb 19 12:22:03 localhost kernel: [ 101.487018] > Feb 19 12:22:03 localhost kernel: [ 101.487018] Possible unsafe locking scenario: > Feb 19 12:22:03 localhost kernel: [ 101.487018] > Feb 19 12:22:03 localhost kernel: [ 101.487018] CPU0 CPU1 > Feb 19 12:22:03 localhost kernel: [ 101.487018] ---- ---- > Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&xfs_dir_ilock_class); > Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&isec->lock); > Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&xfs_dir_ilock_class); > Feb 19 12:22:03 localhost kernel: [ 101.487018] lock(&mm->mmap_sem); So that's just another goddamn false positive. The problem here is that it's many, many layers away from XFS, and really doesn't involve XFS at all. It's caused by shmem instantiating an inode under the mmap_sem... Basically, the only way I can see that this is even remotely preventable is that inode->isec->ilock needs a per-sb lockdep context so that lockdep doesn't confuse the lock heirarchies of completely unrelated filesystems when someone does something crazy like the page fault path is currently doing. Fmeh: struct super_block { .... #ifdef CONFIG_SECURITY void *s_security; #endif So I can't even isolate it to the security subsystem pointer in the superblock because there isn't a generic structure to abstract security specific stuff from the superblock without having to implement the same lockdep annotations in every security module uses xattrs to store security information. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive 2014-02-19 4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner 2014-02-19 18:25 ` Brian Foster @ 2014-02-20 14:51 ` Christoph Hellwig 1 sibling, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2014-02-20 14:51 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint 2014-02-19 4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner 2014-02-19 4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner 2014-02-19 4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner @ 2014-02-19 4:16 ` Dave Chinner 2014-02-19 18:25 ` Brian Foster 2014-02-20 14:56 ` Christoph Hellwig 2 siblings, 2 replies; 19+ messages in thread From: Dave Chinner @ 2014-02-19 4:16 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> The struct xfs_da_args used to pass directory/attribute operation information to the lower layers is 128 bytes in size and is allocated on the stack. Dynamically allocate them to reduce the stack footprint of directory operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_dir2.c | 344 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 214 insertions(+), 130 deletions(-) diff --git a/fs/xfs/xfs_dir2.c b/fs/xfs/xfs_dir2.c index ce16ef0..fc9a41f 100644 --- a/fs/xfs/xfs_dir2.c +++ b/fs/xfs/xfs_dir2.c @@ -180,16 +180,23 @@ xfs_dir_init( xfs_inode_t *dp, xfs_inode_t *pdp) { - xfs_da_args_t args; + struct xfs_da_args *args; int error; - memset((char *)&args, 0, sizeof(args)); - args.dp = dp; - args.trans = tp; ASSERT(S_ISDIR(dp->i_d.di_mode)); - if ((error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino))) + error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino); + if (error) return error; - return xfs_dir2_sf_create(&args, pdp->i_ino); + + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); + if (!args) + return ENOMEM; + + args->dp = dp; + args->trans = tp; + error = xfs_dir2_sf_create(args, pdp->i_ino); + kmem_free(args); + return error; } /* @@ -205,41 +212,56 @@ xfs_dir_createname( xfs_bmap_free_t *flist, /* bmap's freeblock list */ xfs_extlen_t total) /* bmap's total block count */ { - xfs_da_args_t args; + struct xfs_da_args *args; int rval; int v; /* type-checking value */ ASSERT(S_ISDIR(dp->i_d.di_mode)); - if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum))) + rval = xfs_dir_ino_validate(tp->t_mountp, inum); + if (rval) return rval; XFS_STATS_INC(xs_dir_create); - memset(&args, 0, sizeof(xfs_da_args_t)); - args.name = name->name; - args.namelen = name->len; - args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); - args.inumber = inum; - args.dp = dp; - args.firstblock = first; - args.flist = flist; - args.total = total; - args.whichfork = XFS_DATA_FORK; - args.trans = tp; - args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; - - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) - rval = xfs_dir2_sf_addname(&args); - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_block_addname(&args); - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_leaf_addname(&args); + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); + if (!args) + return ENOMEM; + + args->name = name->name; + args->namelen = name->len; + args->filetype = name->type; + args->hashval = dp->i_mount->m_dirnameops->hashname(name); + args->inumber = inum; + args->dp = dp; + args->firstblock = first; + args->flist = flist; + args->total = total; + args->whichfork = XFS_DATA_FORK; + args->trans = tp; + args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { + rval = xfs_dir2_sf_addname(args); + goto out_free; + } + + rval = xfs_dir2_isblock(tp, dp, &v); + if (rval) + goto out_free; + if (v) { + rval = xfs_dir2_block_addname(args); + goto out_free; + } + + rval = xfs_dir2_isleaf(tp, dp, &v); + if (rval) + goto out_free; + if (v) + rval = xfs_dir2_leaf_addname(args); else - rval = xfs_dir2_node_addname(&args); + rval = xfs_dir2_node_addname(args); + +out_free: + kmem_free(args); return rval; } @@ -282,46 +304,68 @@ xfs_dir_lookup( xfs_ino_t *inum, /* out: inode number */ struct xfs_name *ci_name) /* out: actual name if CI match */ { - xfs_da_args_t args; + struct xfs_da_args *args; int rval; int v; /* type-checking value */ ASSERT(S_ISDIR(dp->i_d.di_mode)); XFS_STATS_INC(xs_dir_lookup); - memset(&args, 0, sizeof(xfs_da_args_t)); - args.name = name->name; - args.namelen = name->len; - args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); - args.dp = dp; - args.whichfork = XFS_DATA_FORK; - args.trans = tp; - args.op_flags = XFS_DA_OP_OKNOENT; + /* + * If we don't use KM_NOFS here, lockdep will through false positive + * deadlock warnings when we come through here of the non-transactional + * lookup path because the allocation can recurse into inode reclaim. + * Doing this avoids having to add a bunch of lockdep class + * annotations into the reclaim patch for the ilock. + */ + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); + if (!args) + return ENOMEM; + + args->name = name->name; + args->namelen = name->len; + args->filetype = name->type; + args->hashval = dp->i_mount->m_dirnameops->hashname(name); + args->dp = dp; + args->whichfork = XFS_DATA_FORK; + args->trans = tp; + args->op_flags = XFS_DA_OP_OKNOENT; if (ci_name) - args.op_flags |= XFS_DA_OP_CILOOKUP; + args->op_flags |= XFS_DA_OP_CILOOKUP; - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) - rval = xfs_dir2_sf_lookup(&args); - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_block_lookup(&args); - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_leaf_lookup(&args); + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { + rval = xfs_dir2_sf_lookup(args); + goto out_check_rval; + } + + rval = xfs_dir2_isblock(tp, dp, &v); + if (rval) + goto out_free; + if (v) { + rval = xfs_dir2_block_lookup(args); + goto out_check_rval; + } + + rval = xfs_dir2_isleaf(tp, dp, &v); + if (rval) + goto out_free; + if (v) + rval = xfs_dir2_leaf_lookup(args); else - rval = xfs_dir2_node_lookup(&args); + rval = xfs_dir2_node_lookup(args); + +out_check_rval: if (rval == EEXIST) rval = 0; if (!rval) { - *inum = args.inumber; + *inum = args->inumber; if (ci_name) { - ci_name->name = args.value; - ci_name->len = args.valuelen; + ci_name->name = args->value; + ci_name->len = args->valuelen; } } +out_free: + kmem_free(args); return rval; } @@ -338,38 +382,51 @@ xfs_dir_removename( xfs_bmap_free_t *flist, /* bmap's freeblock list */ xfs_extlen_t total) /* bmap's total block count */ { - xfs_da_args_t args; + struct xfs_da_args *args; int rval; int v; /* type-checking value */ ASSERT(S_ISDIR(dp->i_d.di_mode)); XFS_STATS_INC(xs_dir_remove); - memset(&args, 0, sizeof(xfs_da_args_t)); - args.name = name->name; - args.namelen = name->len; - args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); - args.inumber = ino; - args.dp = dp; - args.firstblock = first; - args.flist = flist; - args.total = total; - args.whichfork = XFS_DATA_FORK; - args.trans = tp; - - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) - rval = xfs_dir2_sf_removename(&args); - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_block_removename(&args); - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_leaf_removename(&args); + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); + if (!args) + return ENOMEM; + + args->name = name->name; + args->namelen = name->len; + args->filetype = name->type; + args->hashval = dp->i_mount->m_dirnameops->hashname(name); + args->inumber = ino; + args->dp = dp; + args->firstblock = first; + args->flist = flist; + args->total = total; + args->whichfork = XFS_DATA_FORK; + args->trans = tp; + + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { + rval = xfs_dir2_sf_removename(args); + goto out_free; + } + + rval = xfs_dir2_isblock(tp, dp, &v); + if (rval) + goto out_free; + if (v) { + rval = xfs_dir2_block_removename(args); + goto out_free; + } + + rval = xfs_dir2_isleaf(tp, dp, &v); + if (rval) + goto out_free; + if (v) + rval = xfs_dir2_leaf_removename(args); else - rval = xfs_dir2_node_removename(&args); + rval = xfs_dir2_node_removename(args); +out_free: + kmem_free(args); return rval; } @@ -386,40 +443,54 @@ xfs_dir_replace( xfs_bmap_free_t *flist, /* bmap's freeblock list */ xfs_extlen_t total) /* bmap's total block count */ { - xfs_da_args_t args; + struct xfs_da_args *args; int rval; int v; /* type-checking value */ ASSERT(S_ISDIR(dp->i_d.di_mode)); - if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum))) + rval = xfs_dir_ino_validate(tp->t_mountp, inum); + if (rval) return rval; - memset(&args, 0, sizeof(xfs_da_args_t)); - args.name = name->name; - args.namelen = name->len; - args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); - args.inumber = inum; - args.dp = dp; - args.firstblock = first; - args.flist = flist; - args.total = total; - args.whichfork = XFS_DATA_FORK; - args.trans = tp; - - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) - rval = xfs_dir2_sf_replace(&args); - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_block_replace(&args); - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_leaf_replace(&args); + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); + if (!args) + return ENOMEM; + + args->name = name->name; + args->namelen = name->len; + args->filetype = name->type; + args->hashval = dp->i_mount->m_dirnameops->hashname(name); + args->inumber = inum; + args->dp = dp; + args->firstblock = first; + args->flist = flist; + args->total = total; + args->whichfork = XFS_DATA_FORK; + args->trans = tp; + + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { + rval = xfs_dir2_sf_replace(args); + goto out_free; + } + + rval = xfs_dir2_isblock(tp, dp, &v); + if (rval) + goto out_free; + if (v) { + rval = xfs_dir2_block_replace(args); + goto out_free; + } + + rval = xfs_dir2_isleaf(tp, dp, &v); + if (rval) + goto out_free; + if (v) + rval = xfs_dir2_leaf_replace(args); else - rval = xfs_dir2_node_replace(&args); + rval = xfs_dir2_node_replace(args); +out_free: + kmem_free(args); return rval; } @@ -434,7 +505,7 @@ xfs_dir_canenter( struct xfs_name *name, /* name of entry to add */ uint resblks) { - xfs_da_args_t args; + struct xfs_da_args *args; int rval; int v; /* type-checking value */ @@ -443,29 +514,42 @@ xfs_dir_canenter( ASSERT(S_ISDIR(dp->i_d.di_mode)); - memset(&args, 0, sizeof(xfs_da_args_t)); - args.name = name->name; - args.namelen = name->len; - args.filetype = name->type; - args.hashval = dp->i_mount->m_dirnameops->hashname(name); - args.dp = dp; - args.whichfork = XFS_DATA_FORK; - args.trans = tp; - args.op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME | + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); + if (!args) + return ENOMEM; + + args->name = name->name; + args->namelen = name->len; + args->filetype = name->type; + args->hashval = dp->i_mount->m_dirnameops->hashname(name); + args->dp = dp; + args->whichfork = XFS_DATA_FORK; + args->trans = tp; + args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) - rval = xfs_dir2_sf_addname(&args); - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_block_addname(&args); - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) - return rval; - else if (v) - rval = xfs_dir2_leaf_addname(&args); + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { + rval = xfs_dir2_sf_addname(args); + goto out_free; + } + + rval = xfs_dir2_isblock(tp, dp, &v); + if (rval) + goto out_free; + if (v) { + rval = xfs_dir2_block_addname(args); + goto out_free; + } + + rval = xfs_dir2_isleaf(tp, dp, &v); + if (rval) + goto out_free; + if (v) + rval = xfs_dir2_leaf_addname(args); else - rval = xfs_dir2_node_addname(&args); + rval = xfs_dir2_node_addname(args); +out_free: + kmem_free(args); return rval; } -- 1.8.4.rc3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint 2014-02-19 4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner @ 2014-02-19 18:25 ` Brian Foster 2014-02-20 14:56 ` Christoph Hellwig 1 sibling, 0 replies; 19+ messages in thread From: Brian Foster @ 2014-02-19 18:25 UTC (permalink / raw) To: Dave Chinner, xfs On 02/18/2014 11:16 PM, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The struct xfs_da_args used to pass directory/attribute operation > information to the lower layers is 128 bytes in size and is > allocated on the stack. Dynamically allocate them to reduce the > stack footprint of directory operations. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_dir2.c | 344 +++++++++++++++++++++++++++++++++--------------------- > 1 file changed, 214 insertions(+), 130 deletions(-) > > diff --git a/fs/xfs/xfs_dir2.c b/fs/xfs/xfs_dir2.c > index ce16ef0..fc9a41f 100644 > --- a/fs/xfs/xfs_dir2.c > +++ b/fs/xfs/xfs_dir2.c > @@ -180,16 +180,23 @@ xfs_dir_init( > xfs_inode_t *dp, > xfs_inode_t *pdp) > { > - xfs_da_args_t args; > + struct xfs_da_args *args; > int error; > > - memset((char *)&args, 0, sizeof(args)); > - args.dp = dp; > - args.trans = tp; > ASSERT(S_ISDIR(dp->i_d.di_mode)); > - if ((error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino))) > + error = xfs_dir_ino_validate(tp->t_mountp, pdp->i_ino); > + if (error) > return error; > - return xfs_dir2_sf_create(&args, pdp->i_ino); > + > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; > + > + args->dp = dp; > + args->trans = tp; > + error = xfs_dir2_sf_create(args, pdp->i_ino); > + kmem_free(args); > + return error; > } > > /* > @@ -205,41 +212,56 @@ xfs_dir_createname( > xfs_bmap_free_t *flist, /* bmap's freeblock list */ > xfs_extlen_t total) /* bmap's total block count */ > { > - xfs_da_args_t args; > + struct xfs_da_args *args; > int rval; > int v; /* type-checking value */ > > ASSERT(S_ISDIR(dp->i_d.di_mode)); > - if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum))) > + rval = xfs_dir_ino_validate(tp->t_mountp, inum); > + if (rval) > return rval; > XFS_STATS_INC(xs_dir_create); > > - memset(&args, 0, sizeof(xfs_da_args_t)); > - args.name = name->name; > - args.namelen = name->len; > - args.filetype = name->type; > - args.hashval = dp->i_mount->m_dirnameops->hashname(name); > - args.inumber = inum; > - args.dp = dp; > - args.firstblock = first; > - args.flist = flist; > - args.total = total; > - args.whichfork = XFS_DATA_FORK; > - args.trans = tp; > - args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; > - > - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) > - rval = xfs_dir2_sf_addname(&args); > - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_block_addname(&args); > - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_leaf_addname(&args); > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; > + > + args->name = name->name; > + args->namelen = name->len; > + args->filetype = name->type; > + args->hashval = dp->i_mount->m_dirnameops->hashname(name); > + args->inumber = inum; > + args->dp = dp; > + args->firstblock = first; > + args->flist = flist; > + args->total = total; > + args->whichfork = XFS_DATA_FORK; > + args->trans = tp; > + args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; > + > + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { > + rval = xfs_dir2_sf_addname(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isblock(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) { > + rval = xfs_dir2_block_addname(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isleaf(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) > + rval = xfs_dir2_leaf_addname(args); > else > - rval = xfs_dir2_node_addname(&args); > + rval = xfs_dir2_node_addname(args); > + > +out_free: > + kmem_free(args); > return rval; > } > > @@ -282,46 +304,68 @@ xfs_dir_lookup( > xfs_ino_t *inum, /* out: inode number */ > struct xfs_name *ci_name) /* out: actual name if CI match */ > { > - xfs_da_args_t args; > + struct xfs_da_args *args; > int rval; > int v; /* type-checking value */ > > ASSERT(S_ISDIR(dp->i_d.di_mode)); > XFS_STATS_INC(xs_dir_lookup); > > - memset(&args, 0, sizeof(xfs_da_args_t)); > - args.name = name->name; > - args.namelen = name->len; > - args.filetype = name->type; > - args.hashval = dp->i_mount->m_dirnameops->hashname(name); > - args.dp = dp; > - args.whichfork = XFS_DATA_FORK; > - args.trans = tp; > - args.op_flags = XFS_DA_OP_OKNOENT; > + /* > + * If we don't use KM_NOFS here, lockdep will through false positive throw? Reviewed-by: Brian Foster <bfoster@redhat.com> > + * deadlock warnings when we come through here of the non-transactional > + * lookup path because the allocation can recurse into inode reclaim. > + * Doing this avoids having to add a bunch of lockdep class > + * annotations into the reclaim patch for the ilock. > + */ > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; > + > + args->name = name->name; > + args->namelen = name->len; > + args->filetype = name->type; > + args->hashval = dp->i_mount->m_dirnameops->hashname(name); > + args->dp = dp; > + args->whichfork = XFS_DATA_FORK; > + args->trans = tp; > + args->op_flags = XFS_DA_OP_OKNOENT; > if (ci_name) > - args.op_flags |= XFS_DA_OP_CILOOKUP; > + args->op_flags |= XFS_DA_OP_CILOOKUP; > > - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) > - rval = xfs_dir2_sf_lookup(&args); > - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_block_lookup(&args); > - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_leaf_lookup(&args); > + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { > + rval = xfs_dir2_sf_lookup(args); > + goto out_check_rval; > + } > + > + rval = xfs_dir2_isblock(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) { > + rval = xfs_dir2_block_lookup(args); > + goto out_check_rval; > + } > + > + rval = xfs_dir2_isleaf(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) > + rval = xfs_dir2_leaf_lookup(args); > else > - rval = xfs_dir2_node_lookup(&args); > + rval = xfs_dir2_node_lookup(args); > + > +out_check_rval: > if (rval == EEXIST) > rval = 0; > if (!rval) { > - *inum = args.inumber; > + *inum = args->inumber; > if (ci_name) { > - ci_name->name = args.value; > - ci_name->len = args.valuelen; > + ci_name->name = args->value; > + ci_name->len = args->valuelen; > } > } > +out_free: > + kmem_free(args); > return rval; > } > > @@ -338,38 +382,51 @@ xfs_dir_removename( > xfs_bmap_free_t *flist, /* bmap's freeblock list */ > xfs_extlen_t total) /* bmap's total block count */ > { > - xfs_da_args_t args; > + struct xfs_da_args *args; > int rval; > int v; /* type-checking value */ > > ASSERT(S_ISDIR(dp->i_d.di_mode)); > XFS_STATS_INC(xs_dir_remove); > > - memset(&args, 0, sizeof(xfs_da_args_t)); > - args.name = name->name; > - args.namelen = name->len; > - args.filetype = name->type; > - args.hashval = dp->i_mount->m_dirnameops->hashname(name); > - args.inumber = ino; > - args.dp = dp; > - args.firstblock = first; > - args.flist = flist; > - args.total = total; > - args.whichfork = XFS_DATA_FORK; > - args.trans = tp; > - > - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) > - rval = xfs_dir2_sf_removename(&args); > - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_block_removename(&args); > - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_leaf_removename(&args); > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; > + > + args->name = name->name; > + args->namelen = name->len; > + args->filetype = name->type; > + args->hashval = dp->i_mount->m_dirnameops->hashname(name); > + args->inumber = ino; > + args->dp = dp; > + args->firstblock = first; > + args->flist = flist; > + args->total = total; > + args->whichfork = XFS_DATA_FORK; > + args->trans = tp; > + > + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { > + rval = xfs_dir2_sf_removename(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isblock(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) { > + rval = xfs_dir2_block_removename(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isleaf(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) > + rval = xfs_dir2_leaf_removename(args); > else > - rval = xfs_dir2_node_removename(&args); > + rval = xfs_dir2_node_removename(args); > +out_free: > + kmem_free(args); > return rval; > } > > @@ -386,40 +443,54 @@ xfs_dir_replace( > xfs_bmap_free_t *flist, /* bmap's freeblock list */ > xfs_extlen_t total) /* bmap's total block count */ > { > - xfs_da_args_t args; > + struct xfs_da_args *args; > int rval; > int v; /* type-checking value */ > > ASSERT(S_ISDIR(dp->i_d.di_mode)); > > - if ((rval = xfs_dir_ino_validate(tp->t_mountp, inum))) > + rval = xfs_dir_ino_validate(tp->t_mountp, inum); > + if (rval) > return rval; > > - memset(&args, 0, sizeof(xfs_da_args_t)); > - args.name = name->name; > - args.namelen = name->len; > - args.filetype = name->type; > - args.hashval = dp->i_mount->m_dirnameops->hashname(name); > - args.inumber = inum; > - args.dp = dp; > - args.firstblock = first; > - args.flist = flist; > - args.total = total; > - args.whichfork = XFS_DATA_FORK; > - args.trans = tp; > - > - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) > - rval = xfs_dir2_sf_replace(&args); > - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_block_replace(&args); > - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_leaf_replace(&args); > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; > + > + args->name = name->name; > + args->namelen = name->len; > + args->filetype = name->type; > + args->hashval = dp->i_mount->m_dirnameops->hashname(name); > + args->inumber = inum; > + args->dp = dp; > + args->firstblock = first; > + args->flist = flist; > + args->total = total; > + args->whichfork = XFS_DATA_FORK; > + args->trans = tp; > + > + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { > + rval = xfs_dir2_sf_replace(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isblock(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) { > + rval = xfs_dir2_block_replace(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isleaf(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) > + rval = xfs_dir2_leaf_replace(args); > else > - rval = xfs_dir2_node_replace(&args); > + rval = xfs_dir2_node_replace(args); > +out_free: > + kmem_free(args); > return rval; > } > > @@ -434,7 +505,7 @@ xfs_dir_canenter( > struct xfs_name *name, /* name of entry to add */ > uint resblks) > { > - xfs_da_args_t args; > + struct xfs_da_args *args; > int rval; > int v; /* type-checking value */ > > @@ -443,29 +514,42 @@ xfs_dir_canenter( > > ASSERT(S_ISDIR(dp->i_d.di_mode)); > > - memset(&args, 0, sizeof(xfs_da_args_t)); > - args.name = name->name; > - args.namelen = name->len; > - args.filetype = name->type; > - args.hashval = dp->i_mount->m_dirnameops->hashname(name); > - args.dp = dp; > - args.whichfork = XFS_DATA_FORK; > - args.trans = tp; > - args.op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME | > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; > + > + args->name = name->name; > + args->namelen = name->len; > + args->filetype = name->type; > + args->hashval = dp->i_mount->m_dirnameops->hashname(name); > + args->dp = dp; > + args->whichfork = XFS_DATA_FORK; > + args->trans = tp; > + args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME | > XFS_DA_OP_OKNOENT; > > - if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) > - rval = xfs_dir2_sf_addname(&args); > - else if ((rval = xfs_dir2_isblock(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_block_addname(&args); > - else if ((rval = xfs_dir2_isleaf(tp, dp, &v))) > - return rval; > - else if (v) > - rval = xfs_dir2_leaf_addname(&args); > + if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) { > + rval = xfs_dir2_sf_addname(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isblock(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) { > + rval = xfs_dir2_block_addname(args); > + goto out_free; > + } > + > + rval = xfs_dir2_isleaf(tp, dp, &v); > + if (rval) > + goto out_free; > + if (v) > + rval = xfs_dir2_leaf_addname(args); > else > - rval = xfs_dir2_node_addname(&args); > + rval = xfs_dir2_node_addname(args); > +out_free: > + kmem_free(args); > return rval; > } > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint 2014-02-19 4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner 2014-02-19 18:25 ` Brian Foster @ 2014-02-20 14:56 ` Christoph Hellwig 2014-02-20 21:09 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2014-02-20 14:56 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Wed, Feb 19, 2014 at 03:16:42PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The struct xfs_da_args used to pass directory/attribute operation > information to the lower layers is 128 bytes in size and is > allocated on the stack. Dynamically allocate them to reduce the > stack footprint of directory operations. Are we having stack space problems in the directory code as well, without all the VM code above it? I'm defintively a bit scared about adding another memory allocation to every single directory operation. > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > + if (!args) > + return ENOMEM; KM_SLEEP is the default when KM_NOFS is set. Also KM_SLEEP will never return a NULL pointer, either remove the handling or make it an KM_MAYFAIL allocation. > + /* > + * If we don't use KM_NOFS here, lockdep will through false positive > + * deadlock warnings when we come through here of the non-transactional > + * lookup path because the allocation can recurse into inode reclaim. > + * Doing this avoids having to add a bunch of lockdep class > + * annotations into the reclaim patch for the ilock. > + */ > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); I don't understand that comment. We do use KM_NOFS here unlike what the comment claims, and the comment seems to explain why we actually need KM_NOFS as far as I can tell. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint 2014-02-20 14:56 ` Christoph Hellwig @ 2014-02-20 21:09 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2014-02-20 21:09 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Thu, Feb 20, 2014 at 06:56:01AM -0800, Christoph Hellwig wrote: > On Wed, Feb 19, 2014 at 03:16:42PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The struct xfs_da_args used to pass directory/attribute operation > > information to the lower layers is 128 bytes in size and is > > allocated on the stack. Dynamically allocate them to reduce the > > stack footprint of directory operations. > > Are we having stack space problems in the directory code as well, > without all the VM code above it? I'm defintively a bit scared about > adding another memory allocation to every single directory operation. Oh, yeah. We've been having problems for some time now. When I introduced the stack switch in the allocation path, i also did it for metadata allocations because we'd seen a couple of reports of stack overflows through the directory code. That got reverted because of the performance issues it caused, so the directory code is simply just another ticking timebomb. See the thread here: http://oss.sgi.com/pipermail/xfs/2014-February/034018.html and, specifically: http://oss.sgi.com/pipermail/xfs/2014-February/034156.html Of further interest: http://oss.sgi.com/pipermail/xfs/2014-February/034051.html http://oss.sgi.com/pipermail/xfs/2014-February/034057.html http://oss.sgi.com/pipermail/xfs/2014-February/034053.html > > > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > > + if (!args) > > + return ENOMEM; > > KM_SLEEP is the default when KM_NOFS is set. Also KM_SLEEP will never > return a NULL pointer, either remove the handling or make it an > KM_MAYFAIL allocation. It might be, but it's the same pattern we use everywhere else. It's as much documentation of what the code requires, because we have KM_MAYFAIL | KM_NOFS in other places. I'll kill the ENOMEM check - that's just habit.... > > > + /* > > + * If we don't use KM_NOFS here, lockdep will through false positive > > + * deadlock warnings when we come through here of the non-transactional > > + * lookup path because the allocation can recurse into inode reclaim. > > + * Doing this avoids having to add a bunch of lockdep class > > + * annotations into the reclaim patch for the ilock. > > + */ > > + args = kmem_zalloc(sizeof(*args), KM_SLEEP | KM_NOFS); > > I don't understand that comment. We do use KM_NOFS here unlike what the > comment claims, and the comment seems to explain why we actually need > KM_NOFS as far as I can tell. The comment says "if we *don't* use KM_NOFS", so AFAICT, it's correct. I can rewrite it to be more obvious. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-02-24 13:35 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-19 4:16 [PATCH 0/3] xfs: lockdep and stack reduction fixes Dave Chinner 2014-02-19 4:16 ` [PATCH 1/3] xfs: always do log forces via the workqueue Dave Chinner 2014-02-19 18:24 ` Brian Foster 2014-02-20 0:23 ` Dave Chinner 2014-02-20 14:51 ` Mark Tinguely 2014-02-20 22:07 ` Dave Chinner 2014-02-20 22:35 ` Mark Tinguely 2014-02-21 0:02 ` Dave Chinner 2014-02-21 15:04 ` Brian Foster 2014-02-21 22:21 ` Dave Chinner 2014-02-24 13:35 ` Brian Foster 2014-02-19 4:16 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Dave Chinner 2014-02-19 18:25 ` Brian Foster 2014-02-20 0:13 ` mmap_sem -> isec->lock lockdep issues with shmem (was Re: [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive) Dave Chinner 2014-02-20 14:51 ` [PATCH 2/3] xfs: fix directory inode iolock lockdep false positive Christoph Hellwig 2014-02-19 4:16 ` [PATCH 3/3] xfs: allocate xfs_da_args to reduce stack footprint Dave Chinner 2014-02-19 18:25 ` Brian Foster 2014-02-20 14:56 ` Christoph Hellwig 2014-02-20 21:09 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox