* [PATCH 0/4] xfs fixes for Linux 3.2-rc3
@ 2011-11-28 8:17 Christoph Hellwig
2011-11-28 8:17 ` [PATCH 1/4] xfs: fix attr2 vs large data fork assert Christoph Hellwig
` (5 more replies)
0 siblings, 6 replies; 24+ messages in thread
From: Christoph Hellwig @ 2011-11-28 8:17 UTC (permalink / raw)
To: xfs
These are the patches that I think should still go into Linux 3.2.
The first two are fairly trivial fixes for crashes that can be reproduced
by users action or filesystem corruption. The third is a a fix for the
hang reported by Simon, and although he hasn't tested the final variant
yet I'm fairly confident it should fix it given we use the same pattern
for dquots. The last one is the biggest and fixes the log space hang
first reported by Alex and then debugged by Chandra.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 24+ messages in thread* [PATCH 1/4] xfs: fix attr2 vs large data fork assert 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig @ 2011-11-28 8:17 ` Christoph Hellwig 2011-11-28 18:01 ` Ben Myers 2011-11-28 8:17 ` [PATCH 2/4] xfs: validate acl count Christoph Hellwig ` (4 subsequent siblings) 5 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2011-11-28 8:17 UTC (permalink / raw) To: xfs [-- Attachment #1: xfs-fix-attr-oops --] [-- Type: text/plain, Size: 3801 bytes --] With Dmitrys fsstress updates I've seen very reproducible crashes in xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that the attributes would not fit inline into the inode after removing an attribute. It turns out that we were operating on an inode with lots of delalloc extents, and thus an if_bytes values for the data fork that is larger than biggest possible on-disk storage for it which utterly confuses the code near the end of xfs_attr_shortform_bytesfit. Fix this by always allowing the current attribute fork, like we already do for the attr1 format, given that delalloc conversion will take care for moving either the data or attribute area out of line if it doesn't fit at that point - or making the point moot by merging extents at this point. Also document the function better, and clean up some lose bits. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: xfs/fs/xfs/xfs_attr_leaf.c =================================================================== --- xfs.orig/fs/xfs/xfs_attr_leaf.c 2011-11-04 16:31:32.244656675 +0100 +++ xfs/fs/xfs/xfs_attr_leaf.c 2011-11-05 09:01:32.613995075 +0100 @@ -110,6 +110,7 @@ xfs_attr_namesp_match(int arg_flags, int /* * Query whether the requested number of additional bytes of extended * attribute space will be able to fit inline. + * * Returns zero if not, else the di_forkoff fork offset to be used in the * literal area for attribute data once the new bytes have been added. * @@ -136,11 +137,26 @@ xfs_attr_shortform_bytesfit(xfs_inode_t return (offset >= minforkoff) ? minforkoff : 0; } - if (!(mp->m_flags & XFS_MOUNT_ATTR2)) { - if (bytes <= XFS_IFORK_ASIZE(dp)) - return dp->i_d.di_forkoff; + /* + * If the requested numbers of bytes is smaller or equal to the + * current attribute fork size we can always proceed. + * + * Note that if_bytes in the data fork might actually be larger than + * the current data fork size is due to delalloc extents. In that + * case either the extent count will go down when they are converted + * to ral extents, or the delalloc conversion will take care of the + * literal area rebalancing. + */ + if (bytes <= XFS_IFORK_ASIZE(dp)) + return dp->i_d.di_forkoff; + + /* + * For attr2 we can try to move the forkoff if there is space in the + * literal area, but for the old format we are done if there is no + * space in the fixes attribute fork. + */ + if (!(mp->m_flags & XFS_MOUNT_ATTR2)) return 0; - } dsize = dp->i_df.if_bytes; @@ -157,10 +173,9 @@ xfs_attr_shortform_bytesfit(xfs_inode_t xfs_default_attroffset(dp)) dsize = XFS_BMDR_SPACE_CALC(MINDBTPTRS); break; - case XFS_DINODE_FMT_BTREE: /* - * If have data btree then keep forkoff if we have one, + * If have a data btree then keep forkoff if we have one, * otherwise we are adding a new attr, so then we set * minforkoff to where the btree root can finish so we have * plenty of room for attrs @@ -168,10 +183,9 @@ xfs_attr_shortform_bytesfit(xfs_inode_t if (dp->i_d.di_forkoff) { if (offset < dp->i_d.di_forkoff) return 0; - else - return dp->i_d.di_forkoff; - } else - dsize = XFS_BMAP_BROOT_SPACE(dp->i_df.if_broot); + return dp->i_d.di_forkoff; + } + dsize = XFS_BMAP_BROOT_SPACE(dp->i_df.if_broot); break; } @@ -186,10 +200,10 @@ xfs_attr_shortform_bytesfit(xfs_inode_t maxforkoff = XFS_LITINO(mp) - XFS_BMDR_SPACE_CALC(MINABTPTRS); maxforkoff = maxforkoff >> 3; /* rounded down */ - if (offset >= minforkoff && offset < maxforkoff) - return offset; if (offset >= maxforkoff) return maxforkoff; + if (offset >= minforkoff) + return offset; return 0; } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] xfs: fix attr2 vs large data fork assert 2011-11-28 8:17 ` [PATCH 1/4] xfs: fix attr2 vs large data fork assert Christoph Hellwig @ 2011-11-28 18:01 ` Ben Myers 2011-11-28 18:03 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Ben Myers @ 2011-11-28 18:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs Hey Christoph, On Mon, Nov 28, 2011 at 03:17:33AM -0500, Christoph Hellwig wrote: > With Dmitrys fsstress updates I've seen very reproducible crashes in > xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that > the attributes would not fit inline into the inode after removing an > attribute. It turns out that we were operating on an inode with lots > of delalloc extents, and thus an if_bytes values for the data fork that > is larger than biggest possible on-disk storage for it which utterly > confuses the code near the end of xfs_attr_shortform_bytesfit. > > Fix this by always allowing the current attribute fork, like we already > do for the attr1 format, given that delalloc conversion will take care > for moving either the data or attribute area out of line if it doesn't > fit at that point - or making the point moot by merging extents at this > point. > > Also document the function better, and clean up some lose bits. > > Signed-off-by: Christoph Hellwig <hch@lst.de> This is missing some cleanups that you did for v2. I'll use that version, unless you have an objection. Thanks, Ben > Index: xfs/fs/xfs/xfs_attr_leaf.c > =================================================================== > --- xfs.orig/fs/xfs/xfs_attr_leaf.c 2011-11-04 16:31:32.244656675 +0100 > +++ xfs/fs/xfs/xfs_attr_leaf.c 2011-11-05 09:01:32.613995075 +0100 > @@ -110,6 +110,7 @@ xfs_attr_namesp_match(int arg_flags, int > /* > * Query whether the requested number of additional bytes of extended > * attribute space will be able to fit inline. > + * > * Returns zero if not, else the di_forkoff fork offset to be used in the > * literal area for attribute data once the new bytes have been added. > * > @@ -136,11 +137,26 @@ xfs_attr_shortform_bytesfit(xfs_inode_t > return (offset >= minforkoff) ? minforkoff : 0; > } > > - if (!(mp->m_flags & XFS_MOUNT_ATTR2)) { > - if (bytes <= XFS_IFORK_ASIZE(dp)) > - return dp->i_d.di_forkoff; > + /* > + * If the requested numbers of bytes is smaller or equal to the > + * current attribute fork size we can always proceed. > + * > + * Note that if_bytes in the data fork might actually be larger than > + * the current data fork size is due to delalloc extents. In that > + * case either the extent count will go down when they are converted > + * to ral extents, or the delalloc conversion will take care of the > + * literal area rebalancing. > + */ > + if (bytes <= XFS_IFORK_ASIZE(dp)) > + return dp->i_d.di_forkoff; > + > + /* > + * For attr2 we can try to move the forkoff if there is space in the > + * literal area, but for the old format we are done if there is no > + * space in the fixes attribute fork. > + */ > + if (!(mp->m_flags & XFS_MOUNT_ATTR2)) > return 0; > - } > > dsize = dp->i_df.if_bytes; > > @@ -157,10 +173,9 @@ xfs_attr_shortform_bytesfit(xfs_inode_t > xfs_default_attroffset(dp)) > dsize = XFS_BMDR_SPACE_CALC(MINDBTPTRS); > break; > - > case XFS_DINODE_FMT_BTREE: > /* > - * If have data btree then keep forkoff if we have one, > + * If have a data btree then keep forkoff if we have one, > * otherwise we are adding a new attr, so then we set > * minforkoff to where the btree root can finish so we have > * plenty of room for attrs > @@ -168,10 +183,9 @@ xfs_attr_shortform_bytesfit(xfs_inode_t > if (dp->i_d.di_forkoff) { > if (offset < dp->i_d.di_forkoff) > return 0; > - else > - return dp->i_d.di_forkoff; > - } else > - dsize = XFS_BMAP_BROOT_SPACE(dp->i_df.if_broot); > + return dp->i_d.di_forkoff; > + } > + dsize = XFS_BMAP_BROOT_SPACE(dp->i_df.if_broot); > break; > } > > @@ -186,10 +200,10 @@ xfs_attr_shortform_bytesfit(xfs_inode_t > maxforkoff = XFS_LITINO(mp) - XFS_BMDR_SPACE_CALC(MINABTPTRS); > maxforkoff = maxforkoff >> 3; /* rounded down */ > > - if (offset >= minforkoff && offset < maxforkoff) > - return offset; > if (offset >= maxforkoff) > return maxforkoff; > + if (offset >= minforkoff) > + return offset; > return 0; > } > > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] xfs: fix attr2 vs large data fork assert 2011-11-28 18:01 ` Ben Myers @ 2011-11-28 18:03 ` Christoph Hellwig 0 siblings, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2011-11-28 18:03 UTC (permalink / raw) To: Ben Myers; +Cc: Christoph Hellwig, xfs On Mon, Nov 28, 2011 at 12:01:36PM -0600, Ben Myers wrote: > This is missing some cleanups that you did for v2. I'll use that > version, unless you have an objection. Looks like I got lost in my maze of trees. Pick whichever you want. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/4] xfs: validate acl count 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig 2011-11-28 8:17 ` [PATCH 1/4] xfs: fix attr2 vs large data fork assert Christoph Hellwig @ 2011-11-28 8:17 ` Christoph Hellwig 2011-11-28 8:17 ` [PATCH 3/4] xfs: force buffer writeback before blocking on the ilock in inode reclaim Christoph Hellwig ` (3 subsequent siblings) 5 siblings, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2011-11-28 8:17 UTC (permalink / raw) To: xfs [-- Attachment #1: xfs-validate-acls --] [-- Type: text/plain, Size: 728 bytes --] This prevents in-memory corruption and possible panics if the on-disk ACL is badly corrupted. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: xfs/fs/xfs/xfs_acl.c =================================================================== --- xfs.orig/fs/xfs/xfs_acl.c 2011-11-20 12:49:11.181244706 +0100 +++ xfs/fs/xfs/xfs_acl.c 2011-11-20 12:49:50.637697619 +0100 @@ -42,6 +42,8 @@ xfs_acl_from_disk(struct xfs_acl *aclp) int count, i; count = be32_to_cpu(aclp->acl_cnt); + if (count > XFS_ACL_MAX_ENTRIES) + return ERR_PTR(-EFSCORRUPTED); acl = posix_acl_alloc(count, GFP_KERNEL); if (!acl) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/4] xfs: force buffer writeback before blocking on the ilock in inode reclaim 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig 2011-11-28 8:17 ` [PATCH 1/4] xfs: fix attr2 vs large data fork assert Christoph Hellwig 2011-11-28 8:17 ` [PATCH 2/4] xfs: validate acl count Christoph Hellwig @ 2011-11-28 8:17 ` Christoph Hellwig 2011-11-28 8:17 ` [PATCH 4/4] xfs: fix the logspace waiting algorithm Christoph Hellwig ` (2 subsequent siblings) 5 siblings, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2011-11-28 8:17 UTC (permalink / raw) To: xfs [-- Attachment #1: xfs-fix-inode-reclaim-hang --] [-- Type: text/plain, Size: 2695 bytes --] If we are doing synchronous inode reclaim we block the VM from making progress in memory reclaim. So if we encouter a flush locked inode promote it in the delwri list and wake up xfsbufd to write it out now. Without this we can get hangs of up to 30 seconds during workloads hitting synchronous inode reclaim. The scheme is copied from what we do for dquot reclaims. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Christoph Hellwig <hch@lst.de> Index: xfs/fs/xfs/xfs_sync.c =================================================================== --- xfs.orig/fs/xfs/xfs_sync.c 2011-11-20 12:48:36.664765032 +0100 +++ xfs/fs/xfs/xfs_sync.c 2011-11-20 13:51:55.594184465 +0100 @@ -770,6 +770,17 @@ restart: if (!xfs_iflock_nowait(ip)) { if (!(sync_mode & SYNC_WAIT)) goto out; + + /* + * If we only have a single dirty inode in a cluster there is + * a fair chance that the AIL push may have pushed it into + * the buffer, but xfsbufd won't touch it until 30 seconds + * from now, and thus we will lock up here. + * + * Promote the inode buffer to the front of the delwri list + * and wake up xfsbufd now. + */ + xfs_promote_inode(ip); xfs_iflock(ip); } Index: xfs/fs/xfs/xfs_inode.c =================================================================== --- xfs.orig/fs/xfs/xfs_inode.c 2011-11-20 13:50:51.457865253 +0100 +++ xfs/fs/xfs/xfs_inode.c 2011-11-20 13:52:30.420662460 +0100 @@ -2835,6 +2835,27 @@ corrupt_out: return XFS_ERROR(EFSCORRUPTED); } +void +xfs_promote_inode( + struct xfs_inode *ip) +{ + struct xfs_buf *bp; + + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)); + + bp = xfs_incore(ip->i_mount->m_ddev_targp, ip->i_imap.im_blkno, + ip->i_imap.im_len, XBF_TRYLOCK); + if (!bp) + return; + + if (XFS_BUF_ISDELAYWRITE(bp)) { + xfs_buf_delwri_promote(bp); + wake_up_process(ip->i_mount->m_ddev_targp->bt_task); + } + + xfs_buf_relse(bp); +} + /* * Return a pointer to the extent record at file index idx. */ Index: xfs/fs/xfs/xfs_inode.h =================================================================== --- xfs.orig/fs/xfs/xfs_inode.h 2011-11-20 13:50:51.487865091 +0100 +++ xfs/fs/xfs/xfs_inode.h 2011-11-20 13:51:39.224273148 +0100 @@ -498,6 +498,7 @@ int xfs_iunlink(struct xfs_trans *, xfs void xfs_iext_realloc(xfs_inode_t *, int, int); void xfs_iunpin_wait(xfs_inode_t *); int xfs_iflush(xfs_inode_t *, uint); +void xfs_promote_inode(struct xfs_inode *); void xfs_lock_inodes(xfs_inode_t **, int, uint); void xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig ` (2 preceding siblings ...) 2011-11-28 8:17 ` [PATCH 3/4] xfs: force buffer writeback before blocking on the ilock in inode reclaim Christoph Hellwig @ 2011-11-28 8:17 ` Christoph Hellwig 2011-11-30 23:56 ` Ben Myers ` (2 more replies) 2011-11-29 19:23 ` [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Ben Myers 2011-11-30 8:58 ` [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels Christoph Hellwig 5 siblings, 3 replies; 24+ messages in thread From: Christoph Hellwig @ 2011-11-28 8:17 UTC (permalink / raw) To: xfs [-- Attachment #1: xfs-fix-log-race --] [-- Type: text/plain, Size: 15031 bytes --] Apply the scheme used in log_regrant_write_log_space to wake up any other threads waiting for log space before the newly added one to log_regrant_write_log_space as well, and factor the code into readable helpers. For each of the queues we have add two helpers: - one to try to wake up all waiting threads. This helper will also be usable by xfs_log_move_tail once we remove the current opportunistic wakeups in it. - one to sleep on t_wait until enough log space is available, loosely modelled after Linux waitqueues. And use them to reimplement the guts of log_regrant_write_log_space and log_regrant_write_log_space. These two function now use one and the same algorithm for waiting on log space instead of subtly different ones before, with an option to completely unify them in the near future. Also move the filesystem shutdown handling to the common caller given that we had to touch it anyway. Based on hard debugging and an earlier patch from Chandra Seetharaman <sekharan@us.ibm.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_log.c | 348 ++++++++++++++++++++++++++--------------------------- fs/xfs/xfs_trace.h | 12 - 2 files changed, 177 insertions(+), 183 deletions(-) Index: xfs/fs/xfs/xfs_log.c =================================================================== --- xfs.orig/fs/xfs/xfs_log.c 2011-11-20 16:29:49.356194023 +0100 +++ xfs/fs/xfs/xfs_log.c 2011-11-20 16:46:33.440754431 +0100 @@ -150,6 +150,117 @@ xlog_grant_add_space( } while (head_val != old); } +STATIC bool +xlog_reserveq_wake( + struct log *log, + int *free_bytes) +{ + struct xlog_ticket *tic; + int need_bytes; + + list_for_each_entry(tic, &log->l_reserveq, t_queue) { + if (tic->t_flags & XLOG_TIC_PERM_RESERV) + need_bytes = tic->t_unit_res * tic->t_cnt; + else + need_bytes = tic->t_unit_res; + + if (*free_bytes < need_bytes) + return false; + *free_bytes -= need_bytes; + + trace_xfs_log_grant_wake_up(log, tic); + wake_up(&tic->t_wait); + } + + return true; +} + +STATIC bool +xlog_writeq_wake( + struct log *log, + int *free_bytes) +{ + struct xlog_ticket *tic; + int need_bytes; + + list_for_each_entry(tic, &log->l_writeq, t_queue) { + ASSERT(tic->t_flags & XLOG_TIC_PERM_RESERV); + + need_bytes = tic->t_unit_res; + + if (*free_bytes < need_bytes) + return false; + *free_bytes -= need_bytes; + + trace_xfs_log_regrant_write_wake_up(log, tic); + wake_up(&tic->t_wait); + } + + return true; +} + +STATIC int +xlog_reserveq_wait( + struct log *log, + struct xlog_ticket *tic, + int need_bytes) +{ + list_add_tail(&tic->t_queue, &log->l_reserveq); + + do { + if (XLOG_FORCED_SHUTDOWN(log)) + goto shutdown; + xlog_grant_push_ail(log, need_bytes); + + XFS_STATS_INC(xs_sleep_logspace); + trace_xfs_log_grant_sleep(log, tic); + + xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); + trace_xfs_log_grant_wake(log, tic); + + spin_lock(&log->l_grant_reserve_lock); + if (XLOG_FORCED_SHUTDOWN(log)) + goto shutdown; + } while (xlog_space_left(log, &log->l_grant_reserve_head) < need_bytes); + + list_del_init(&tic->t_queue); + return 0; +shutdown: + list_del_init(&tic->t_queue); + return XFS_ERROR(EIO); +} + +STATIC int +xlog_writeq_wait( + struct log *log, + struct xlog_ticket *tic, + int need_bytes) +{ + list_add_tail(&tic->t_queue, &log->l_writeq); + + do { + if (XLOG_FORCED_SHUTDOWN(log)) + goto shutdown; + xlog_grant_push_ail(log, need_bytes); + + XFS_STATS_INC(xs_sleep_logspace); + trace_xfs_log_regrant_write_sleep(log, tic); + + xlog_wait(&tic->t_wait, &log->l_grant_write_lock); + trace_xfs_log_regrant_write_wake(log, tic); + + spin_lock(&log->l_grant_write_lock); + if (XLOG_FORCED_SHUTDOWN(log)) + goto shutdown; + } while (xlog_space_left(log, &log->l_grant_write_head) < need_bytes); + + list_del_init(&tic->t_queue); + return 0; +shutdown: + list_del_init(&tic->t_queue); + return XFS_ERROR(EIO); +} + static void xlog_tic_reset_res(xlog_ticket_t *tic) { @@ -350,8 +461,19 @@ xfs_log_reserve( retval = xlog_grant_log_space(log, internal_ticket); } + if (unlikely(retval)) { + /* + * If we are failing, make sure the ticket doesn't have any + * current reservations. We don't want to add this back + * when the ticket/ transaction gets cancelled. + */ + internal_ticket->t_curr_res = 0; + /* ungrant will give back unit_res * t_cnt. */ + internal_ticket->t_cnt = 0; + } + return retval; -} /* xfs_log_reserve */ +} /* @@ -2481,8 +2603,8 @@ restart: /* * Atomically get the log space required for a log ticket. * - * Once a ticket gets put onto the reserveq, it will only return after - * the needed reservation is satisfied. + * Once a ticket gets put onto the reserveq, it will only return after the + * needed reservation is satisfied. * * This function is structured so that it has a lock free fast path. This is * necessary because every new transaction reservation will come through this @@ -2490,113 +2612,53 @@ restart: * every pass. * * As tickets are only ever moved on and off the reserveq under the - * l_grant_reserve_lock, we only need to take that lock if we are going - * to add the ticket to the queue and sleep. We can avoid taking the lock if the - * ticket was never added to the reserveq because the t_queue list head will be - * empty and we hold the only reference to it so it can safely be checked - * unlocked. + * l_grant_reserve_lock, we only need to take that lock if we are going to add + * the ticket to the queue and sleep. We can avoid taking the lock if the ticket + * was never added to the reserveq because the t_queue list head will be empty + * and we hold the only reference to it so it can safely be checked unlocked. */ STATIC int -xlog_grant_log_space(xlog_t *log, - xlog_ticket_t *tic) +xlog_grant_log_space( + struct log *log, + struct xlog_ticket *tic) { - int free_bytes; - int need_bytes; + int free_bytes, need_bytes; + int error = 0; -#ifdef DEBUG - if (log->l_flags & XLOG_ACTIVE_RECOVERY) - panic("grant Recovery problem"); -#endif + ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); trace_xfs_log_grant_enter(log, tic); + /* + * If there are other waiters on the queue then give them a chance at + * logspace before us. Wake up the first waiters, if we do not wake + * up all the waiters then go to sleep waiting for more free space, + * otherwise try to get some space for this transaction. + */ need_bytes = tic->t_unit_res; if (tic->t_flags & XFS_LOG_PERM_RESERV) need_bytes *= tic->t_ocnt; - - /* something is already sleeping; insert new transaction at end */ - if (!list_empty_careful(&log->l_reserveq)) { - spin_lock(&log->l_grant_reserve_lock); - /* recheck the queue now we are locked */ - if (list_empty(&log->l_reserveq)) { - spin_unlock(&log->l_grant_reserve_lock); - goto redo; - } - list_add_tail(&tic->t_queue, &log->l_reserveq); - - trace_xfs_log_grant_sleep1(log, tic); - - /* - * Gotta check this before going to sleep, while we're - * holding the grant lock. - */ - if (XLOG_FORCED_SHUTDOWN(log)) - goto error_return; - - XFS_STATS_INC(xs_sleep_logspace); - xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); - - /* - * If we got an error, and the filesystem is shutting down, - * we'll catch it down below. So just continue... - */ - trace_xfs_log_grant_wake1(log, tic); - } - -redo: - if (XLOG_FORCED_SHUTDOWN(log)) - goto error_return_unlocked; - free_bytes = xlog_space_left(log, &log->l_grant_reserve_head); - if (free_bytes < need_bytes) { + if (!list_empty_careful(&log->l_reserveq)) { spin_lock(&log->l_grant_reserve_lock); - if (list_empty(&tic->t_queue)) - list_add_tail(&tic->t_queue, &log->l_reserveq); - - trace_xfs_log_grant_sleep2(log, tic); - - if (XLOG_FORCED_SHUTDOWN(log)) - goto error_return; - - xlog_grant_push_ail(log, need_bytes); - - XFS_STATS_INC(xs_sleep_logspace); - xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); - - trace_xfs_log_grant_wake2(log, tic); - goto redo; - } - - if (!list_empty(&tic->t_queue)) { + if (!xlog_reserveq_wake(log, &free_bytes) || + free_bytes < need_bytes) + error = xlog_reserveq_wait(log, tic, need_bytes); + spin_unlock(&log->l_grant_reserve_lock); + } else if (free_bytes < need_bytes) { spin_lock(&log->l_grant_reserve_lock); - list_del_init(&tic->t_queue); + error = xlog_reserveq_wait(log, tic, need_bytes); spin_unlock(&log->l_grant_reserve_lock); } + if (error) + return error; - /* we've got enough space */ xlog_grant_add_space(log, &log->l_grant_reserve_head, need_bytes); xlog_grant_add_space(log, &log->l_grant_write_head, need_bytes); trace_xfs_log_grant_exit(log, tic); xlog_verify_grant_tail(log); return 0; - -error_return_unlocked: - spin_lock(&log->l_grant_reserve_lock); -error_return: - list_del_init(&tic->t_queue); - spin_unlock(&log->l_grant_reserve_lock); - trace_xfs_log_grant_error(log, tic); - - /* - * If we are failing, make sure the ticket doesn't have any - * current reservations. We don't want to add this back when - * the ticket/transaction gets cancelled. - */ - tic->t_curr_res = 0; - tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */ - return XFS_ERROR(EIO); -} /* xlog_grant_log_space */ - +} /* * Replenish the byte reservation required by moving the grant write head. @@ -2605,10 +2667,12 @@ error_return: * free fast path. */ STATIC int -xlog_regrant_write_log_space(xlog_t *log, - xlog_ticket_t *tic) +xlog_regrant_write_log_space( + struct log *log, + struct xlog_ticket *tic) { - int free_bytes, need_bytes; + int free_bytes, need_bytes; + int error = 0; tic->t_curr_res = tic->t_unit_res; xlog_tic_reset_res(tic); @@ -2616,104 +2680,38 @@ xlog_regrant_write_log_space(xlog_t * if (tic->t_cnt > 0) return 0; -#ifdef DEBUG - if (log->l_flags & XLOG_ACTIVE_RECOVERY) - panic("regrant Recovery problem"); -#endif + ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); trace_xfs_log_regrant_write_enter(log, tic); - if (XLOG_FORCED_SHUTDOWN(log)) - goto error_return_unlocked; - /* If there are other waiters on the queue then give them a - * chance at logspace before us. Wake up the first waiters, - * if we do not wake up all the waiters then go to sleep waiting - * for more free space, otherwise try to get some space for - * this transaction. + /* + * If there are other waiters on the queue then give them a chance at + * logspace before us. Wake up the first waiters, if we do not wake + * up all the waiters then go to sleep waiting for more free space, + * otherwise try to get some space for this transaction. */ need_bytes = tic->t_unit_res; - if (!list_empty_careful(&log->l_writeq)) { - struct xlog_ticket *ntic; - - spin_lock(&log->l_grant_write_lock); - free_bytes = xlog_space_left(log, &log->l_grant_write_head); - list_for_each_entry(ntic, &log->l_writeq, t_queue) { - ASSERT(ntic->t_flags & XLOG_TIC_PERM_RESERV); - - if (free_bytes < ntic->t_unit_res) - break; - free_bytes -= ntic->t_unit_res; - wake_up(&ntic->t_wait); - } - - if (ntic != list_first_entry(&log->l_writeq, - struct xlog_ticket, t_queue)) { - if (list_empty(&tic->t_queue)) - list_add_tail(&tic->t_queue, &log->l_writeq); - trace_xfs_log_regrant_write_sleep1(log, tic); - - xlog_grant_push_ail(log, need_bytes); - - XFS_STATS_INC(xs_sleep_logspace); - xlog_wait(&tic->t_wait, &log->l_grant_write_lock); - trace_xfs_log_regrant_write_wake1(log, tic); - } else - spin_unlock(&log->l_grant_write_lock); - } - -redo: - if (XLOG_FORCED_SHUTDOWN(log)) - goto error_return_unlocked; - free_bytes = xlog_space_left(log, &log->l_grant_write_head); - if (free_bytes < need_bytes) { + if (!list_empty_careful(&log->l_writeq)) { spin_lock(&log->l_grant_write_lock); - if (list_empty(&tic->t_queue)) - list_add_tail(&tic->t_queue, &log->l_writeq); - - if (XLOG_FORCED_SHUTDOWN(log)) - goto error_return; - - xlog_grant_push_ail(log, need_bytes); - - XFS_STATS_INC(xs_sleep_logspace); - trace_xfs_log_regrant_write_sleep2(log, tic); - xlog_wait(&tic->t_wait, &log->l_grant_write_lock); - - trace_xfs_log_regrant_write_wake2(log, tic); - goto redo; - } - - if (!list_empty(&tic->t_queue)) { + if (!xlog_writeq_wake(log, &free_bytes) || + free_bytes < need_bytes) + error = xlog_writeq_wait(log, tic, need_bytes); + spin_unlock(&log->l_grant_write_lock); + } else if (free_bytes < need_bytes) { spin_lock(&log->l_grant_write_lock); - list_del_init(&tic->t_queue); + error = xlog_writeq_wait(log, tic, need_bytes); spin_unlock(&log->l_grant_write_lock); } - /* we've got enough space */ + if (error) + return error; + xlog_grant_add_space(log, &log->l_grant_write_head, need_bytes); trace_xfs_log_regrant_write_exit(log, tic); xlog_verify_grant_tail(log); return 0; - - - error_return_unlocked: - spin_lock(&log->l_grant_write_lock); - error_return: - list_del_init(&tic->t_queue); - spin_unlock(&log->l_grant_write_lock); - trace_xfs_log_regrant_write_error(log, tic); - - /* - * If we are failing, make sure the ticket doesn't have any - * current reservations. We don't want to add this back when - * the ticket/transaction gets cancelled. - */ - tic->t_curr_res = 0; - tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */ - return XFS_ERROR(EIO); -} /* xlog_regrant_write_log_space */ - +} /* The first cnt-1 times through here we don't need to * move the grant write head because the permanent Index: xfs/fs/xfs/xfs_trace.h =================================================================== --- xfs.orig/fs/xfs/xfs_trace.h 2011-11-20 16:29:49.362860654 +0100 +++ xfs/fs/xfs/xfs_trace.h 2011-11-20 16:34:23.954706395 +0100 @@ -834,18 +834,14 @@ DEFINE_LOGGRANT_EVENT(xfs_log_umount_wri DEFINE_LOGGRANT_EVENT(xfs_log_grant_enter); DEFINE_LOGGRANT_EVENT(xfs_log_grant_exit); DEFINE_LOGGRANT_EVENT(xfs_log_grant_error); -DEFINE_LOGGRANT_EVENT(xfs_log_grant_sleep1); -DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake1); -DEFINE_LOGGRANT_EVENT(xfs_log_grant_sleep2); -DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake2); +DEFINE_LOGGRANT_EVENT(xfs_log_grant_sleep); +DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake); DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake_up); DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_enter); DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_exit); DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_error); -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_sleep1); -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake1); -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_sleep2); -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake2); +DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_sleep); +DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake); DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake_up); DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_enter); DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_exit); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-11-28 8:17 ` [PATCH 4/4] xfs: fix the logspace waiting algorithm Christoph Hellwig @ 2011-11-30 23:56 ` Ben Myers 2011-12-01 7:21 ` Christoph Hellwig 2011-12-01 18:32 ` Ben Myers 2011-12-01 19:28 ` Chandra Seetharaman 2 siblings, 1 reply; 24+ messages in thread From: Ben Myers @ 2011-11-30 23:56 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs Christoph, I feel that the ordering of the accesses to l_grant_head and the writeq list may be important in the lockless path, and notice that they used to be in opposite order. It could use a comment explaining why (if) it is ok. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-11-30 23:56 ` Ben Myers @ 2011-12-01 7:21 ` Christoph Hellwig 2011-12-01 19:51 ` Ben Myers 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2011-12-01 7:21 UTC (permalink / raw) To: Ben Myers; +Cc: Christoph Hellwig, xfs On Wed, Nov 30, 2011 at 05:56:41PM -0600, Ben Myers wrote: > Christoph, > > I feel that the ordering of the accesses to l_grant_head and the writeq > list may be important in the lockless path, and notice that they used to > be in opposite order. It could use a comment explaining why (if) it is > ok. Do you mean moving the xlog_space_left after the first list_empty_careful check in xlog_grant_log_space? It doesn't matter given that we re-check the available space after each wakeup. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-01 7:21 ` Christoph Hellwig @ 2011-12-01 19:51 ` Ben Myers 2011-12-02 11:51 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Ben Myers @ 2011-12-01 19:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Thu, Dec 01, 2011 at 02:21:49AM -0500, Christoph Hellwig wrote: > On Wed, Nov 30, 2011 at 05:56:41PM -0600, Ben Myers wrote: > > Christoph, > > > > I feel that the ordering of the accesses to l_grant_head and the writeq > > list may be important in the lockless path, and notice that they used to > > be in opposite order. It could use a comment explaining why (if) it is > > ok. > > Do you mean moving the xlog_space_left after the first > list_empty_careful check in xlog_grant_log_space? That's what I mean, but I'm not sure I was correct. > It doesn't matter > given that we re-check the available space after each wakeup. 2620 STATIC int 2621 xlog_grant_log_space( 2622 struct log *log, 2623 struct xlog_ticket *tic) 2624 { 2625 int free_bytes, need_bytes; 2626 int error = 0; 2627 2628 ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); 2629 2630 trace_xfs_log_grant_enter(log, tic); 2631 2632 /* 2633 * If there are other waiters on the queue then give them a chance at 2634 * logspace before us. Wake up the first waiters, if we do not wake 2635 * up all the waiters then go to sleep waiting for more free space, 2636 * otherwise try to get some space for this transaction. 2637 */ 2638 need_bytes = tic->t_unit_res; 2639 if (tic->t_flags & XFS_LOG_PERM_RESERV) 2640 need_bytes *= tic->t_ocnt; 2641 free_bytes = xlog_space_left(log, &log->l_grant_reserve_head); 2642 if (!list_empty_careful(&log->l_reserveq)) { 2643 spin_lock(&log->l_grant_reserve_lock); 2644 if (!xlog_reserveq_wake(log, &free_bytes) || 2645 free_bytes < need_bytes) 2646 error = xlog_reserveq_wait(log, tic, need_bytes); 2647 spin_unlock(&log->l_grant_reserve_lock); 2648 } else if (free_bytes < need_bytes) { 2649 spin_lock(&log->l_grant_reserve_lock); 2650 error = xlog_reserveq_wait(log, tic, need_bytes); 2651 spin_unlock(&log->l_grant_reserve_lock); 2652 } 2653 if (error) 2654 return error; 2655 2656 xlog_grant_add_space(log, &log->l_grant_reserve_head, need_bytes); 2657 xlog_grant_add_space(log, &log->l_grant_write_head, need_bytes); 2658 trace_xfs_log_grant_exit(log, tic); 2659 xlog_verify_grant_tail(log); 2660 return 0; 2661 } Process A reads from the grant reserve head at 2641 (and there currently is enough space) Process B wakes at either 2646 or 2650, in xlog_reserveq_wait, locks, and reads from the grant reserve head (and currently there is enough space) Process B removes itself from the list Process A reads from the reservq list and finds it to be empty Process A finds that there was enough space at 2646 Process B returns from xlog_reserveq_wait, unlocks, grants space at 2656, returns Process A grants log space at 2656, and returns AFAICS there is nothing that prevents these guys from granting the same space when you approach free_bytes >= need_bytes concurrently. This lockless stuff is always a mind job for me. I'll take another look at some of the other aspects of the patch. Even if it doesn't resolve my question about the lockless issue, it seems to resolve Chandra's race. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-01 19:51 ` Ben Myers @ 2011-12-02 11:51 ` Christoph Hellwig 2011-12-05 2:53 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2011-12-02 11:51 UTC (permalink / raw) To: Ben Myers; +Cc: Dave Chinner, xfs On Thu, Dec 01, 2011 at 01:51:28PM -0600, Ben Myers wrote: > Process A reads from the grant reserve head at 2641 (and there currently is enough space) > Process B wakes at either 2646 or 2650, in xlog_reserveq_wait, locks, and reads from the grant reserve head (and currently there is enough space) > Process B removes itself from the list > Process A reads from the reservq list and finds it to be empty > Process A finds that there was enough space at 2646 > Process B returns from xlog_reserveq_wait, unlocks, grants space at 2656, returns > Process A grants log space at 2656, and returns > > AFAICS there is nothing that prevents these guys from granting the same > space when you approach free_bytes >= need_bytes concurrently. > > This lockless stuff is always a mind job for me. I'll take another look at > some of the other aspects of the patch. Even if it doesn't resolve my > question about the lockless issue, it seems to resolve Chandra's race. Indeed, I think we have this race. Then again I I think we had exactly the same one before, too. The only way to fix it would be to do a sort of double cmpxchg that only moves the grant head forward if it's still in available space vs the tails lsn. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-02 11:51 ` Christoph Hellwig @ 2011-12-05 2:53 ` Dave Chinner 2011-12-05 9:05 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2011-12-05 2:53 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Ben Myers, xfs, Dave Chinner On Fri, Dec 02, 2011 at 06:51:41AM -0500, Christoph Hellwig wrote: > On Thu, Dec 01, 2011 at 01:51:28PM -0600, Ben Myers wrote: > > Process A reads from the grant reserve head at 2641 (and there currently is enough space) > > Process B wakes at either 2646 or 2650, in xlog_reserveq_wait, locks, and reads from the grant reserve head (and currently there is enough space) > > Process B removes itself from the list > > Process A reads from the reservq list and finds it to be empty > > Process A finds that there was enough space at 2646 > > Process B returns from xlog_reserveq_wait, unlocks, grants space at 2656, returns > > Process A grants log space at 2656, and returns > > > > AFAICS there is nothing that prevents these guys from granting the same > > space when you approach free_bytes >= need_bytes concurrently. > > > > This lockless stuff is always a mind job for me. I'll take another look at > > some of the other aspects of the patch. Even if it doesn't resolve my > > question about the lockless issue, it seems to resolve Chandra's race. > > Indeed, I think we have this race. Then again I I think we had > exactly the same one before, too. What the current code does is this (ignoring this patch): again: check free space if (no free space) { add to list; sleep; space becomes available wake up goto again; } remove from list grant space do { grant calc } while (head does not move) The problem being talked about, as I understand it, is that free space can change between the check and the grant, resulting in process B also seeing free space because the space process A thinks it has but has not yet accounted for it. It seems to me that this can happen regardless of whether we have a queue of waiters or not, so the change to how the waiters are woken in Christoph's patch is irrelevant. > The only way to fix it would be > to do a sort of double cmpxchg that only moves the grant head forward > if it's still in available space vs the tails lsn. Actually, it's quite simple to fix simply by having the free space check return the grant head value at the time of the check and then using that as the old value in the cmpxchg loop for updating the grant head. That is: again: free = xlog_space_left(..... &old_head); ...... if (xlog_grant_add_space(log, l_grant_reserve_head, need_bytes, old_head) goto again; and xlog_grant_log_space() returns 1 if the old_head does not match the current head so we go araound again. I don't think we don't need to do this for the write head update in xlog_grant_log_space() because once we have been granted reserve head space we are guarantee to have write head space. We do, however, need to make the same change for the write head in xlog_regrant_write_log_space() as that has potential for the same race when two permanent transactions commit and race to re-reserve the write space that was just consumed by the committed transaction. And to retain existing behaviour for other callers, perhaps an old_head value of -1 (NULL LSN value) can be passed to indicate that current head value should be used and to loop until the change is made.... Does that make sense? It retains the same lockless behaviour, but closes the race condition of the grant head changing from the time of read to the time of actual modification.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-05 2:53 ` Dave Chinner @ 2011-12-05 9:05 ` Christoph Hellwig 0 siblings, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2011-12-05 9:05 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, Ben Myers, Dave Chinner, xfs On Mon, Dec 05, 2011 at 01:53:16PM +1100, Dave Chinner wrote: > Actually, it's quite simple to fix simply by having the free space > check return the grant head value at the time of the check and then > using that as the old value in the cmpxchg loop for updating the > grant head. That is: I;ll give it a go for 3.3. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-11-28 8:17 ` [PATCH 4/4] xfs: fix the logspace waiting algorithm Christoph Hellwig 2011-11-30 23:56 ` Ben Myers @ 2011-12-01 18:32 ` Ben Myers 2011-12-01 20:43 ` Chandra Seetharaman 2011-12-01 19:28 ` Chandra Seetharaman 2 siblings, 1 reply; 24+ messages in thread From: Ben Myers @ 2011-12-01 18:32 UTC (permalink / raw) To: Christoph Hellwig, Chandra Seetharaman; +Cc: xfs Hi, On Mon, Nov 28, 2011 at 03:17:36AM -0500, Christoph Hellwig wrote: > Apply the scheme used in log_regrant_write_log_space to wake up any other > threads waiting for log space before the newly added one to > log_regrant_write_log_space as well, and factor the code into readable > helpers. For each of the queues we have add two helpers: > > - one to try to wake up all waiting threads. This helper will also be > usable by xfs_log_move_tail once we remove the current opportunistic > wakeups in it. > - one to sleep on t_wait until enough log space is available, loosely > modelled after Linux waitqueues. > > And use them to reimplement the guts of log_regrant_write_log_space and > log_regrant_write_log_space. These two function now use one and the same > algorithm for waiting on log space instead of subtly different ones before, > with an option to completely unify them in the near future. > > Also move the filesystem shutdown handling to the common caller given > that we had to touch it anyway. > > Based on hard debugging and an earlier patch from > Chandra Seetharaman <sekharan@us.ibm.com>. > > Signed-off-by: Christoph Hellwig <hch@lst.de> I'd like to make sure that I understand the race that Chandra debugged and reported. 2499 STATIC int 2500 xlog_grant_log_space(xlog_t *log, 2501 xlog_ticket_t *tic) 2502 { 2503 int free_bytes; 2504 int need_bytes; ... 2517 /* something is already sleeping; insert new transaction at end */ 2518 if (!list_empty_careful(&log->l_reserveq)) { 2519 spin_lock(&log->l_grant_reserve_lock); 2520 /* recheck the queue now we are locked */ 2521 if (list_empty(&log->l_reserveq)) { 2522 spin_unlock(&log->l_grant_reserve_lock); 2523 goto redo; 2524 } 2525 list_add_tail(&tic->t_queue, &log->l_reserveq); 2526 2527 trace_xfs_log_grant_sleep1(log, tic); 2528 2529 /* 2530 * Gotta check this before going to sleep, while we're 2531 * holding the grant lock. 2532 */ 2533 if (XLOG_FORCED_SHUTDOWN(log)) 2534 goto error_return; 2535 2536 XFS_STATS_INC(xs_sleep_logspace); 2537 xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); 2538 2539 /* 2540 * If we got an error, and the filesystem is shutting down, 2541 * we'll catch it down below. So just continue... 2542 */ 2543 trace_xfs_log_grant_wake1(log, tic); 2544 } 2545 2546 redo: 2547 if (XLOG_FORCED_SHUTDOWN(log)) 2548 goto error_return_unlocked; 2549 2550 free_bytes = xlog_space_left(log, &log->l_grant_reserve_head); 2551 if (free_bytes < need_bytes) { 2552 spin_lock(&log->l_grant_reserve_lock); 2553 if (list_empty(&tic->t_queue)) 2554 list_add_tail(&tic->t_queue, &log->l_reserveq); 2555 2556 trace_xfs_log_grant_sleep2(log, tic); 2557 2558 if (XLOG_FORCED_SHUTDOWN(log)) 2559 goto error_return; 2560 2561 xlog_grant_push_ail(log, need_bytes); 2562 2563 XFS_STATS_INC(xs_sleep_logspace); 2564 xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); 2565 2566 trace_xfs_log_grant_wake2(log, tic); 2567 goto redo; 2568 } 2569 2570 if (!list_empty(&tic->t_queue)) { 2571 spin_lock(&log->l_grant_reserve_lock); 2572 list_del_init(&tic->t_queue); 2573 spin_unlock(&log->l_grant_reserve_lock); 2574 } So the race that we're looking at here is: process A was added to the reserve queue at either 2525 or 2554 and, pushes the AIL at 2561, xfsaild frees up enough log space for process A (possibly B?), eventually xfs_log_move_tail is called to wake process A, process A wakes at line 2564, and he is on the reserveq already, process B sees that there are tickets on the queue at 2518 and gets the grant reserve lock at 2519, process A spins at 2571 waiting for the grant reserve lock, process B adds itself to the queue at 2525, process B drops the grant reserve lock and goes to sleep at 2537 process A takes the grant reserve lock at 2571 and removes it's ticket from the list. ...and there is nothing to wake process B until the ail is pushed by some other process. Is that about right? Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-01 18:32 ` Ben Myers @ 2011-12-01 20:43 ` Chandra Seetharaman 0 siblings, 0 replies; 24+ messages in thread From: Chandra Seetharaman @ 2011-12-01 20:43 UTC (permalink / raw) To: Ben Myers; +Cc: Christoph Hellwig, xfs Ben, Yes, that is the race I found thru debugging. Chandra On Thu, 2011-12-01 at 12:32 -0600, Ben Myers wrote: > Hi, > > On Mon, Nov 28, 2011 at 03:17:36AM -0500, Christoph Hellwig wrote: > > Apply the scheme used in log_regrant_write_log_space to wake up any other > > threads waiting for log space before the newly added one to > > log_regrant_write_log_space as well, and factor the code into readable > > helpers. For each of the queues we have add two helpers: > > > > - one to try to wake up all waiting threads. This helper will also be > > usable by xfs_log_move_tail once we remove the current opportunistic > > wakeups in it. > > - one to sleep on t_wait until enough log space is available, loosely > > modelled after Linux waitqueues. > > > > And use them to reimplement the guts of log_regrant_write_log_space and > > log_regrant_write_log_space. These two function now use one and the same > > algorithm for waiting on log space instead of subtly different ones before, > > with an option to completely unify them in the near future. > > > > Also move the filesystem shutdown handling to the common caller given > > that we had to touch it anyway. > > > > Based on hard debugging and an earlier patch from > > Chandra Seetharaman <sekharan@us.ibm.com>. > > > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > I'd like to make sure that I understand the race that Chandra > debugged and reported. > > 2499 STATIC int > 2500 xlog_grant_log_space(xlog_t *log, > 2501 xlog_ticket_t *tic) > 2502 { > 2503 int free_bytes; > 2504 int need_bytes; > ... > 2517 /* something is already sleeping; insert new transaction at end */ > 2518 if (!list_empty_careful(&log->l_reserveq)) { > 2519 spin_lock(&log->l_grant_reserve_lock); > 2520 /* recheck the queue now we are locked */ > 2521 if (list_empty(&log->l_reserveq)) { > 2522 spin_unlock(&log->l_grant_reserve_lock); > 2523 goto redo; > 2524 } > 2525 list_add_tail(&tic->t_queue, &log->l_reserveq); > 2526 > 2527 trace_xfs_log_grant_sleep1(log, tic); > 2528 > 2529 /* > 2530 * Gotta check this before going to sleep, while we're > 2531 * holding the grant lock. > 2532 */ > 2533 if (XLOG_FORCED_SHUTDOWN(log)) > 2534 goto error_return; > 2535 > 2536 XFS_STATS_INC(xs_sleep_logspace); > 2537 xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); > 2538 > 2539 /* > 2540 * If we got an error, and the filesystem is shutting down, > 2541 * we'll catch it down below. So just continue... > 2542 */ > 2543 trace_xfs_log_grant_wake1(log, tic); > 2544 } > 2545 > 2546 redo: > 2547 if (XLOG_FORCED_SHUTDOWN(log)) > 2548 goto error_return_unlocked; > 2549 > 2550 free_bytes = xlog_space_left(log, &log->l_grant_reserve_head); > 2551 if (free_bytes < need_bytes) { > 2552 spin_lock(&log->l_grant_reserve_lock); > 2553 if (list_empty(&tic->t_queue)) > 2554 list_add_tail(&tic->t_queue, &log->l_reserveq); > 2555 > 2556 trace_xfs_log_grant_sleep2(log, tic); > 2557 > 2558 if (XLOG_FORCED_SHUTDOWN(log)) > 2559 goto error_return; > 2560 > 2561 xlog_grant_push_ail(log, need_bytes); > 2562 > 2563 XFS_STATS_INC(xs_sleep_logspace); > 2564 xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); > 2565 > 2566 trace_xfs_log_grant_wake2(log, tic); > 2567 goto redo; > 2568 } > 2569 > 2570 if (!list_empty(&tic->t_queue)) { > 2571 spin_lock(&log->l_grant_reserve_lock); > 2572 list_del_init(&tic->t_queue); > 2573 spin_unlock(&log->l_grant_reserve_lock); > 2574 } > > So the race that we're looking at here is: > > process A was added to the reserve queue at either 2525 or 2554 and, pushes the AIL at 2561, > xfsaild frees up enough log space for process A (possibly B?), eventually xfs_log_move_tail is called to wake process A, > process A wakes at line 2564, and he is on the reserveq already, > process B sees that there are tickets on the queue at 2518 and gets the grant reserve lock at 2519, > process A spins at 2571 waiting for the grant reserve lock, > process B adds itself to the queue at 2525, > process B drops the grant reserve lock and goes to sleep at 2537 > process A takes the grant reserve lock at 2571 and removes it's ticket from the list. > > ...and there is nothing to wake process B until the ail is pushed by > some other process. > > Is that about right? > > Thanks, > Ben > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-11-28 8:17 ` [PATCH 4/4] xfs: fix the logspace waiting algorithm Christoph Hellwig 2011-11-30 23:56 ` Ben Myers 2011-12-01 18:32 ` Ben Myers @ 2011-12-01 19:28 ` Chandra Seetharaman 2011-12-02 11:16 ` Christoph Hellwig 2 siblings, 1 reply; 24+ messages in thread From: Chandra Seetharaman @ 2011-12-01 19:28 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs Tested the patch with testcases 234 and 273. They ran for more than 350 iterations without getting into the hang situation. Tested-by: Chandra Seetharaman <sekharan@us.ibm.com> Few generic comments on the patch 1. xlog_*_wake could use something to indicate that they are looking for log space in the specific queue. ex: xlog_reserveq_available() 2. All new functions expect a lock to be held on entry. Can be explicitly specified in a comment. 3. Change the trace function names to reflect the names of the function ?! Otherwise patch looks good and works fine. Chandra ------- end of comments ------ On Mon, 2011-11-28 at 03:17 -0500, Christoph Hellwig wrote: > plain text document attachment (xfs-fix-log-race) > Apply the scheme used in log_regrant_write_log_space to wake up any other > threads waiting for log space before the newly added one to > log_regrant_write_log_space as well, and factor the code into readable > helpers. For each of the queues we have add two helpers: > > - one to try to wake up all waiting threads. This helper will also be > usable by xfs_log_move_tail once we remove the current opportunistic > wakeups in it. > - one to sleep on t_wait until enough log space is available, loosely > modelled after Linux waitqueues. > > And use them to reimplement the guts of log_regrant_write_log_space and > log_regrant_write_log_space. These two function now use one and the same > algorithm for waiting on log space instead of subtly different ones before, > with an option to completely unify them in the near future. > > Also move the filesystem shutdown handling to the common caller given > that we had to touch it anyway. > > Based on hard debugging and an earlier patch from > Chandra Seetharaman <sekharan@us.ibm.com>. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > --- > fs/xfs/xfs_log.c | 348 ++++++++++++++++++++++++++--------------------------- > fs/xfs/xfs_trace.h | 12 - > 2 files changed, 177 insertions(+), 183 deletions(-) > > Index: xfs/fs/xfs/xfs_log.c > =================================================================== > --- xfs.orig/fs/xfs/xfs_log.c 2011-11-20 16:29:49.356194023 +0100 > +++ xfs/fs/xfs/xfs_log.c 2011-11-20 16:46:33.440754431 +0100 > @@ -150,6 +150,117 @@ xlog_grant_add_space( > } while (head_val != old); > } > > +STATIC bool > +xlog_reserveq_wake( > + struct log *log, > + int *free_bytes) > +{ > + struct xlog_ticket *tic; > + int need_bytes; > + > + list_for_each_entry(tic, &log->l_reserveq, t_queue) { > + if (tic->t_flags & XLOG_TIC_PERM_RESERV) > + need_bytes = tic->t_unit_res * tic->t_cnt; > + else > + need_bytes = tic->t_unit_res; > + > + if (*free_bytes < need_bytes) > + return false; > + *free_bytes -= need_bytes; > + > + trace_xfs_log_grant_wake_up(log, tic); > + wake_up(&tic->t_wait); > + } > + > + return true; > +} > + > +STATIC bool > +xlog_writeq_wake( > + struct log *log, > + int *free_bytes) > +{ > + struct xlog_ticket *tic; > + int need_bytes; > + > + list_for_each_entry(tic, &log->l_writeq, t_queue) { > + ASSERT(tic->t_flags & XLOG_TIC_PERM_RESERV); > + > + need_bytes = tic->t_unit_res; > + > + if (*free_bytes < need_bytes) > + return false; > + *free_bytes -= need_bytes; > + > + trace_xfs_log_regrant_write_wake_up(log, tic); > + wake_up(&tic->t_wait); > + } > + > + return true; > +} > + > +STATIC int > +xlog_reserveq_wait( > + struct log *log, > + struct xlog_ticket *tic, > + int need_bytes) > +{ > + list_add_tail(&tic->t_queue, &log->l_reserveq); > + > + do { > + if (XLOG_FORCED_SHUTDOWN(log)) > + goto shutdown; > + xlog_grant_push_ail(log, need_bytes); > + > + XFS_STATS_INC(xs_sleep_logspace); > + trace_xfs_log_grant_sleep(log, tic); > + > + xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); > + trace_xfs_log_grant_wake(log, tic); > + > + spin_lock(&log->l_grant_reserve_lock); > + if (XLOG_FORCED_SHUTDOWN(log)) > + goto shutdown; > + } while (xlog_space_left(log, &log->l_grant_reserve_head) < need_bytes); > + > + list_del_init(&tic->t_queue); > + return 0; > +shutdown: > + list_del_init(&tic->t_queue); > + return XFS_ERROR(EIO); > +} > + > +STATIC int > +xlog_writeq_wait( > + struct log *log, > + struct xlog_ticket *tic, > + int need_bytes) > +{ > + list_add_tail(&tic->t_queue, &log->l_writeq); > + > + do { > + if (XLOG_FORCED_SHUTDOWN(log)) > + goto shutdown; > + xlog_grant_push_ail(log, need_bytes); > + > + XFS_STATS_INC(xs_sleep_logspace); > + trace_xfs_log_regrant_write_sleep(log, tic); > + > + xlog_wait(&tic->t_wait, &log->l_grant_write_lock); > + trace_xfs_log_regrant_write_wake(log, tic); > + > + spin_lock(&log->l_grant_write_lock); > + if (XLOG_FORCED_SHUTDOWN(log)) > + goto shutdown; > + } while (xlog_space_left(log, &log->l_grant_write_head) < need_bytes); > + > + list_del_init(&tic->t_queue); > + return 0; > +shutdown: > + list_del_init(&tic->t_queue); > + return XFS_ERROR(EIO); > +} > + > static void > xlog_tic_reset_res(xlog_ticket_t *tic) > { > @@ -350,8 +461,19 @@ xfs_log_reserve( > retval = xlog_grant_log_space(log, internal_ticket); > } > > + if (unlikely(retval)) { > + /* > + * If we are failing, make sure the ticket doesn't have any > + * current reservations. We don't want to add this back > + * when the ticket/ transaction gets cancelled. > + */ > + internal_ticket->t_curr_res = 0; > + /* ungrant will give back unit_res * t_cnt. */ > + internal_ticket->t_cnt = 0; > + } > + > return retval; > -} /* xfs_log_reserve */ > +} > > > /* > @@ -2481,8 +2603,8 @@ restart: > /* > * Atomically get the log space required for a log ticket. > * > - * Once a ticket gets put onto the reserveq, it will only return after > - * the needed reservation is satisfied. > + * Once a ticket gets put onto the reserveq, it will only return after the > + * needed reservation is satisfied. > * > * This function is structured so that it has a lock free fast path. This is > * necessary because every new transaction reservation will come through this > @@ -2490,113 +2612,53 @@ restart: > * every pass. > * > * As tickets are only ever moved on and off the reserveq under the > - * l_grant_reserve_lock, we only need to take that lock if we are going > - * to add the ticket to the queue and sleep. We can avoid taking the lock if the > - * ticket was never added to the reserveq because the t_queue list head will be > - * empty and we hold the only reference to it so it can safely be checked > - * unlocked. > + * l_grant_reserve_lock, we only need to take that lock if we are going to add > + * the ticket to the queue and sleep. We can avoid taking the lock if the ticket > + * was never added to the reserveq because the t_queue list head will be empty > + * and we hold the only reference to it so it can safely be checked unlocked. > */ > STATIC int > -xlog_grant_log_space(xlog_t *log, > - xlog_ticket_t *tic) > +xlog_grant_log_space( > + struct log *log, > + struct xlog_ticket *tic) > { > - int free_bytes; > - int need_bytes; > + int free_bytes, need_bytes; > + int error = 0; > > -#ifdef DEBUG > - if (log->l_flags & XLOG_ACTIVE_RECOVERY) > - panic("grant Recovery problem"); > -#endif > + ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); > > trace_xfs_log_grant_enter(log, tic); > > + /* > + * If there are other waiters on the queue then give them a chance at > + * logspace before us. Wake up the first waiters, if we do not wake > + * up all the waiters then go to sleep waiting for more free space, > + * otherwise try to get some space for this transaction. > + */ > need_bytes = tic->t_unit_res; > if (tic->t_flags & XFS_LOG_PERM_RESERV) > need_bytes *= tic->t_ocnt; > - > - /* something is already sleeping; insert new transaction at end */ > - if (!list_empty_careful(&log->l_reserveq)) { > - spin_lock(&log->l_grant_reserve_lock); > - /* recheck the queue now we are locked */ > - if (list_empty(&log->l_reserveq)) { > - spin_unlock(&log->l_grant_reserve_lock); > - goto redo; > - } > - list_add_tail(&tic->t_queue, &log->l_reserveq); > - > - trace_xfs_log_grant_sleep1(log, tic); > - > - /* > - * Gotta check this before going to sleep, while we're > - * holding the grant lock. > - */ > - if (XLOG_FORCED_SHUTDOWN(log)) > - goto error_return; > - > - XFS_STATS_INC(xs_sleep_logspace); > - xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); > - > - /* > - * If we got an error, and the filesystem is shutting down, > - * we'll catch it down below. So just continue... > - */ > - trace_xfs_log_grant_wake1(log, tic); > - } > - > -redo: > - if (XLOG_FORCED_SHUTDOWN(log)) > - goto error_return_unlocked; > - > free_bytes = xlog_space_left(log, &log->l_grant_reserve_head); > - if (free_bytes < need_bytes) { > + if (!list_empty_careful(&log->l_reserveq)) { > spin_lock(&log->l_grant_reserve_lock); > - if (list_empty(&tic->t_queue)) > - list_add_tail(&tic->t_queue, &log->l_reserveq); > - > - trace_xfs_log_grant_sleep2(log, tic); > - > - if (XLOG_FORCED_SHUTDOWN(log)) > - goto error_return; > - > - xlog_grant_push_ail(log, need_bytes); > - > - XFS_STATS_INC(xs_sleep_logspace); > - xlog_wait(&tic->t_wait, &log->l_grant_reserve_lock); > - > - trace_xfs_log_grant_wake2(log, tic); > - goto redo; > - } > - > - if (!list_empty(&tic->t_queue)) { > + if (!xlog_reserveq_wake(log, &free_bytes) || > + free_bytes < need_bytes) > + error = xlog_reserveq_wait(log, tic, need_bytes); > + spin_unlock(&log->l_grant_reserve_lock); > + } else if (free_bytes < need_bytes) { > spin_lock(&log->l_grant_reserve_lock); > - list_del_init(&tic->t_queue); > + error = xlog_reserveq_wait(log, tic, need_bytes); > spin_unlock(&log->l_grant_reserve_lock); > } > + if (error) > + return error; > > - /* we've got enough space */ > xlog_grant_add_space(log, &log->l_grant_reserve_head, need_bytes); > xlog_grant_add_space(log, &log->l_grant_write_head, need_bytes); > trace_xfs_log_grant_exit(log, tic); > xlog_verify_grant_tail(log); > return 0; > - > -error_return_unlocked: > - spin_lock(&log->l_grant_reserve_lock); > -error_return: > - list_del_init(&tic->t_queue); > - spin_unlock(&log->l_grant_reserve_lock); > - trace_xfs_log_grant_error(log, tic); > - > - /* > - * If we are failing, make sure the ticket doesn't have any > - * current reservations. We don't want to add this back when > - * the ticket/transaction gets cancelled. > - */ > - tic->t_curr_res = 0; > - tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */ > - return XFS_ERROR(EIO); > -} /* xlog_grant_log_space */ > - > +} > > /* > * Replenish the byte reservation required by moving the grant write head. > @@ -2605,10 +2667,12 @@ error_return: > * free fast path. > */ > STATIC int > -xlog_regrant_write_log_space(xlog_t *log, > - xlog_ticket_t *tic) > +xlog_regrant_write_log_space( > + struct log *log, > + struct xlog_ticket *tic) > { > - int free_bytes, need_bytes; > + int free_bytes, need_bytes; > + int error = 0; > > tic->t_curr_res = tic->t_unit_res; > xlog_tic_reset_res(tic); > @@ -2616,104 +2680,38 @@ xlog_regrant_write_log_space(xlog_t * > if (tic->t_cnt > 0) > return 0; > > -#ifdef DEBUG > - if (log->l_flags & XLOG_ACTIVE_RECOVERY) > - panic("regrant Recovery problem"); > -#endif > + ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); > > trace_xfs_log_regrant_write_enter(log, tic); > - if (XLOG_FORCED_SHUTDOWN(log)) > - goto error_return_unlocked; > > - /* If there are other waiters on the queue then give them a > - * chance at logspace before us. Wake up the first waiters, > - * if we do not wake up all the waiters then go to sleep waiting > - * for more free space, otherwise try to get some space for > - * this transaction. > + /* > + * If there are other waiters on the queue then give them a chance at > + * logspace before us. Wake up the first waiters, if we do not wake > + * up all the waiters then go to sleep waiting for more free space, > + * otherwise try to get some space for this transaction. > */ > need_bytes = tic->t_unit_res; > - if (!list_empty_careful(&log->l_writeq)) { > - struct xlog_ticket *ntic; > - > - spin_lock(&log->l_grant_write_lock); > - free_bytes = xlog_space_left(log, &log->l_grant_write_head); > - list_for_each_entry(ntic, &log->l_writeq, t_queue) { > - ASSERT(ntic->t_flags & XLOG_TIC_PERM_RESERV); > - > - if (free_bytes < ntic->t_unit_res) > - break; > - free_bytes -= ntic->t_unit_res; > - wake_up(&ntic->t_wait); > - } > - > - if (ntic != list_first_entry(&log->l_writeq, > - struct xlog_ticket, t_queue)) { > - if (list_empty(&tic->t_queue)) > - list_add_tail(&tic->t_queue, &log->l_writeq); > - trace_xfs_log_regrant_write_sleep1(log, tic); > - > - xlog_grant_push_ail(log, need_bytes); > - > - XFS_STATS_INC(xs_sleep_logspace); > - xlog_wait(&tic->t_wait, &log->l_grant_write_lock); > - trace_xfs_log_regrant_write_wake1(log, tic); > - } else > - spin_unlock(&log->l_grant_write_lock); > - } > - > -redo: > - if (XLOG_FORCED_SHUTDOWN(log)) > - goto error_return_unlocked; > - > free_bytes = xlog_space_left(log, &log->l_grant_write_head); > - if (free_bytes < need_bytes) { > + if (!list_empty_careful(&log->l_writeq)) { > spin_lock(&log->l_grant_write_lock); > - if (list_empty(&tic->t_queue)) > - list_add_tail(&tic->t_queue, &log->l_writeq); > - > - if (XLOG_FORCED_SHUTDOWN(log)) > - goto error_return; > - > - xlog_grant_push_ail(log, need_bytes); > - > - XFS_STATS_INC(xs_sleep_logspace); > - trace_xfs_log_regrant_write_sleep2(log, tic); > - xlog_wait(&tic->t_wait, &log->l_grant_write_lock); > - > - trace_xfs_log_regrant_write_wake2(log, tic); > - goto redo; > - } > - > - if (!list_empty(&tic->t_queue)) { > + if (!xlog_writeq_wake(log, &free_bytes) || > + free_bytes < need_bytes) > + error = xlog_writeq_wait(log, tic, need_bytes); > + spin_unlock(&log->l_grant_write_lock); > + } else if (free_bytes < need_bytes) { > spin_lock(&log->l_grant_write_lock); > - list_del_init(&tic->t_queue); > + error = xlog_writeq_wait(log, tic, need_bytes); > spin_unlock(&log->l_grant_write_lock); > } > > - /* we've got enough space */ > + if (error) > + return error; > + > xlog_grant_add_space(log, &log->l_grant_write_head, need_bytes); > trace_xfs_log_regrant_write_exit(log, tic); > xlog_verify_grant_tail(log); > return 0; > - > - > - error_return_unlocked: > - spin_lock(&log->l_grant_write_lock); > - error_return: > - list_del_init(&tic->t_queue); > - spin_unlock(&log->l_grant_write_lock); > - trace_xfs_log_regrant_write_error(log, tic); > - > - /* > - * If we are failing, make sure the ticket doesn't have any > - * current reservations. We don't want to add this back when > - * the ticket/transaction gets cancelled. > - */ > - tic->t_curr_res = 0; > - tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */ > - return XFS_ERROR(EIO); > -} /* xlog_regrant_write_log_space */ > - > +} > > /* The first cnt-1 times through here we don't need to > * move the grant write head because the permanent > Index: xfs/fs/xfs/xfs_trace.h > =================================================================== > --- xfs.orig/fs/xfs/xfs_trace.h 2011-11-20 16:29:49.362860654 +0100 > +++ xfs/fs/xfs/xfs_trace.h 2011-11-20 16:34:23.954706395 +0100 > @@ -834,18 +834,14 @@ DEFINE_LOGGRANT_EVENT(xfs_log_umount_wri > DEFINE_LOGGRANT_EVENT(xfs_log_grant_enter); > DEFINE_LOGGRANT_EVENT(xfs_log_grant_exit); > DEFINE_LOGGRANT_EVENT(xfs_log_grant_error); > -DEFINE_LOGGRANT_EVENT(xfs_log_grant_sleep1); > -DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake1); > -DEFINE_LOGGRANT_EVENT(xfs_log_grant_sleep2); > -DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake2); > +DEFINE_LOGGRANT_EVENT(xfs_log_grant_sleep); > +DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake); > DEFINE_LOGGRANT_EVENT(xfs_log_grant_wake_up); > DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_enter); > DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_exit); > DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_error); > -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_sleep1); > -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake1); > -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_sleep2); > -DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake2); > +DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_sleep); > +DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake); > DEFINE_LOGGRANT_EVENT(xfs_log_regrant_write_wake_up); > DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_enter); > DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_exit); > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-01 19:28 ` Chandra Seetharaman @ 2011-12-02 11:16 ` Christoph Hellwig 2011-12-02 16:02 ` Chandra Seetharaman 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2011-12-02 11:16 UTC (permalink / raw) To: Chandra Seetharaman; +Cc: Christoph Hellwig, xfs On Thu, Dec 01, 2011 at 01:28:33PM -0600, Chandra Seetharaman wrote: > Tested the patch with testcases 234 and 273. They ran for more than 350 > iterations without getting into the hang situation. > > Tested-by: Chandra Seetharaman <sekharan@us.ibm.com> > > Few generic comments on the patch > > 1. xlog_*_wake could use something to indicate that they are looking for > log space in the specific queue. ex: xlog_reserveq_available() > > 2. All new functions expect a lock to be held on entry. Can be > explicitly specified in a comment. > > 3. Change the trace function names to reflect the names of the > function ?! All this is going to change with the refactoring of these to use a common data structure which I have in my queue for 3.3. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] xfs: fix the logspace waiting algorithm 2011-12-02 11:16 ` Christoph Hellwig @ 2011-12-02 16:02 ` Chandra Seetharaman 0 siblings, 0 replies; 24+ messages in thread From: Chandra Seetharaman @ 2011-12-02 16:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs Then it is all good Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> On Fri, 2011-12-02 at 06:16 -0500, Christoph Hellwig wrote: > On Thu, Dec 01, 2011 at 01:28:33PM -0600, Chandra Seetharaman wrote: > > Tested the patch with testcases 234 and 273. They ran for more than 350 > > iterations without getting into the hang situation. > > > > Tested-by: Chandra Seetharaman <sekharan@us.ibm.com> > > > > Few generic comments on the patch > > > > 1. xlog_*_wake could use something to indicate that they are looking for > > log space in the specific queue. ex: xlog_reserveq_available() > > > > 2. All new functions expect a lock to be held on entry. Can be > > explicitly specified in a comment. > > > > 3. Change the trace function names to reflect the names of the > > function ?! > > All this is going to change with the refactoring of these to use a > common data structure which I have in my queue for 3.3. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] xfs fixes for Linux 3.2-rc3 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig ` (3 preceding siblings ...) 2011-11-28 8:17 ` [PATCH 4/4] xfs: fix the logspace waiting algorithm Christoph Hellwig @ 2011-11-29 19:23 ` Ben Myers 2011-11-30 8:54 ` Christoph Hellwig 2011-11-30 8:58 ` [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels Christoph Hellwig 5 siblings, 1 reply; 24+ messages in thread From: Ben Myers @ 2011-11-29 19:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Mon, Nov 28, 2011 at 03:17:32AM -0500, Christoph Hellwig wrote: > These are the patches that I think should still go into Linux 3.2. > > The first two are fairly trivial fixes for crashes that can be reproduced > by users action or filesystem corruption. The third is a a fix for the > hang reported by Simon, and although he hasn't tested the final variant > yet I'm fairly confident it should fix it given we use the same pattern > for dquots. Yep. > The last one is the biggest and fixes the log space hang > first reported by Alex and then debugged by Chandra. This last one is pretty big. Could it wait for 3.3? -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] xfs fixes for Linux 3.2-rc3 2011-11-29 19:23 ` [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Ben Myers @ 2011-11-30 8:54 ` Christoph Hellwig 0 siblings, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2011-11-30 8:54 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Tue, Nov 29, 2011 at 01:23:00PM -0600, Ben Myers wrote: > > The last one is the biggest and fixes the log space hang > > first reported by Alex and then debugged by Chandra. > > This last one is pretty big. Could it wait for 3.3? It's not that big, and if we manage to push it out ASAP it's not even that late in the cycle. Given that people are getting fast hardware managing to hit all kinds of weird AIL and log reservation hangs that had been hidden forever I'd much prefer to get it out into the field ASAP. That also includes a -stable backport, btw. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig ` (4 preceding siblings ...) 2011-11-29 19:23 ` [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Ben Myers @ 2011-11-30 8:58 ` Christoph Hellwig 2011-12-02 16:07 ` Ben Myers 5 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2011-11-30 8:58 UTC (permalink / raw) To: xfs The i_ino field in the VFS inode is of type unsigned long and thus can't hold the full 64-bit inode number on 32-bit kernels. We have the full inode number in the XFS inode, so use that one for nfs exports. Note that I've also switched the 32-bit file handles types to it, just to make the code more consistent and copy & paste errors less likely to happen. Reported-by: Guoquan Yang <ygq51@hotmail.com> Reported-by: Hank Peng <pengxihan@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Index: xfs/fs/xfs/xfs_export.c =================================================================== --- xfs.orig/fs/xfs/xfs_export.c 2011-11-28 12:11:08.923630697 +0100 +++ xfs/fs/xfs/xfs_export.c 2011-11-29 05:48:12.682177223 +0100 @@ -98,22 +98,22 @@ xfs_fs_encode_fh( switch (fileid_type) { case FILEID_INO32_GEN_PARENT: spin_lock(&dentry->d_lock); - fid->i32.parent_ino = dentry->d_parent->d_inode->i_ino; + fid->i32.parent_ino = XFS_I(dentry->d_parent->d_inode)->i_ino; fid->i32.parent_gen = dentry->d_parent->d_inode->i_generation; spin_unlock(&dentry->d_lock); /*FALLTHRU*/ case FILEID_INO32_GEN: - fid->i32.ino = inode->i_ino; + fid->i32.ino = XFS_I(inode)->i_ino; fid->i32.gen = inode->i_generation; break; case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG: spin_lock(&dentry->d_lock); - fid64->parent_ino = dentry->d_parent->d_inode->i_ino; + fid64->parent_ino = XFS_I(dentry->d_parent->d_inode)->i_ino; fid64->parent_gen = dentry->d_parent->d_inode->i_generation; spin_unlock(&dentry->d_lock); /*FALLTHRU*/ case FILEID_INO32_GEN | XFS_FILEID_TYPE_64FLAG: - fid64->ino = inode->i_ino; + fid64->ino = XFS_I(inode)->i_ino; fid64->gen = inode->i_generation; break; } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels 2011-11-30 8:58 ` [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels Christoph Hellwig @ 2011-12-02 16:07 ` Ben Myers 2011-12-02 17:29 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Ben Myers @ 2011-12-02 16:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs Hey Christoph, On Wed, Nov 30, 2011 at 03:58:18AM -0500, Christoph Hellwig wrote: > The i_ino field in the VFS inode is of type unsigned long and thus can't > hold the full 64-bit inode number on 32-bit kernels. We have the full > inode number in the XFS inode, so use that one for nfs exports. Note > that I've also switched the 32-bit file handles types to it, just to make > the code more consistent and copy & paste errors less likely to happen. > > Reported-by: Guoquan Yang <ygq51@hotmail.com> > Reported-by: Hank Peng <pengxihan@gmail.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> To what extent did you test this one? Anything in particular I should look for when I test it? Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels 2011-12-02 16:07 ` Ben Myers @ 2011-12-02 17:29 ` Christoph Hellwig 2011-12-06 16:39 ` Ben Myers 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2011-12-02 17:29 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Fri, Dec 02, 2011 at 10:07:12AM -0600, Ben Myers wrote: > To what extent did you test this one? Anything in particular I should > look for when I test it? Create a large filesystem, mount it using -o inode64 on a 32-bit kernel, find an inode that actually uses more than 32-bits, and try to access it from an nfs client. Alternatively you could probably reproduce it using the open by handle system calls, I have an idea how to turn that into a test case for xfstests using a large loopback filesystem. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels 2011-12-02 17:29 ` Christoph Hellwig @ 2011-12-06 16:39 ` Ben Myers 0 siblings, 0 replies; 24+ messages in thread From: Ben Myers @ 2011-12-06 16:39 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Fri, Dec 02, 2011 at 12:29:08PM -0500, Christoph Hellwig wrote: > On Fri, Dec 02, 2011 at 10:07:12AM -0600, Ben Myers wrote: > > To what extent did you test this one? Anything in particular I should > > look for when I test it? > > Create a large filesystem, mount it using -o inode64 on a 32-bit kernel, > find an inode that actually uses more than 32-bits, and try to access > it from an nfs client. > > Alternatively you could probably reproduce it using the open by handle > system calls, I have an idea how to turn that into a test case for > xfstests using a large loopback filesystem. I tested this on a pair of i386 boxes with a large loop filesystem and inode64. Without the fix I got ESTALE on any inodes > 32 bits, and with the patch it works great. Signed-off-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2011-12-06 16:38 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-28 8:17 [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Christoph Hellwig 2011-11-28 8:17 ` [PATCH 1/4] xfs: fix attr2 vs large data fork assert Christoph Hellwig 2011-11-28 18:01 ` Ben Myers 2011-11-28 18:03 ` Christoph Hellwig 2011-11-28 8:17 ` [PATCH 2/4] xfs: validate acl count Christoph Hellwig 2011-11-28 8:17 ` [PATCH 3/4] xfs: force buffer writeback before blocking on the ilock in inode reclaim Christoph Hellwig 2011-11-28 8:17 ` [PATCH 4/4] xfs: fix the logspace waiting algorithm Christoph Hellwig 2011-11-30 23:56 ` Ben Myers 2011-12-01 7:21 ` Christoph Hellwig 2011-12-01 19:51 ` Ben Myers 2011-12-02 11:51 ` Christoph Hellwig 2011-12-05 2:53 ` Dave Chinner 2011-12-05 9:05 ` Christoph Hellwig 2011-12-01 18:32 ` Ben Myers 2011-12-01 20:43 ` Chandra Seetharaman 2011-12-01 19:28 ` Chandra Seetharaman 2011-12-02 11:16 ` Christoph Hellwig 2011-12-02 16:02 ` Chandra Seetharaman 2011-11-29 19:23 ` [PATCH 0/4] xfs fixes for Linux 3.2-rc3 Ben Myers 2011-11-30 8:54 ` Christoph Hellwig 2011-11-30 8:58 ` [PATCH 5/4] xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernels Christoph Hellwig 2011-12-02 16:07 ` Ben Myers 2011-12-02 17:29 ` Christoph Hellwig 2011-12-06 16:39 ` Ben Myers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox