public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: Improve scalability of busy extent tracking
@ 2010-04-21  5:47 Dave Chinner
  2010-04-22 11:01 ` Christoph Hellwig
  2010-04-22 11:07 ` Christoph Hellwig
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2010-04-21  5:47 UTC (permalink / raw)
  To: xfs

When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.

Changes since last version:
o folded all three patches into one - they were too intertwined to
  keep separate without introducing bugs in the code that got
  removed in the next patch. This also resulted in significant
  maintenance overhead because fixing the first patch would make
  the others reject.
o busy extent transaction owner tracked by transaction id, not by
  address of transaction structure. Tracking by address of the
  transaction was triggering false ASSERT failures when searching for
  busy extents with matching start block numbers.
o minor code cleanups
o tested against blocksize < pagesize filesystems

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_buf.c      |    9 ++
 fs/xfs/linux-2.6/xfs_quotaops.c |    1 +
 fs/xfs/linux-2.6/xfs_trace.h    |   80 ++++++++----
 fs/xfs/support/debug.c          |    1 +
 fs/xfs/xfs_ag.h                 |   21 ++--
 fs/xfs/xfs_alloc.c              |  272 ++++++++++++++++++++++++--------------
 fs/xfs/xfs_alloc.h              |    5 +-
 fs/xfs/xfs_filestream.c         |    1 +
 fs/xfs/xfs_log.h                |    2 +
 fs/xfs/xfs_log_priv.h           |    2 -
 fs/xfs/xfs_trans.c              |   61 +++++----
 fs/xfs/xfs_trans.h              |   36 +-----
 fs/xfs/xfs_trans_extfree.c      |    1 +
 fs/xfs/xfs_trans_item.c         |  109 ----------------
 fs/xfs/xfs_trans_priv.h         |    6 +-
 15 files changed, 292 insertions(+), 315 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 5f7a355..a8a3eaf 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -37,6 +37,7 @@
 
 #include "xfs_sb.h"
 #include "xfs_inum.h"
+#include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_dmapi.h"
 #include "xfs_mount.h"
@@ -850,6 +851,12 @@ xfs_buf_lock_value(
  *	Note that this in no way locks the underlying pages, so it is only
  *	useful for synchronizing concurrent use of buffer objects, not for
  *	synchronizing independent access to the underlying pages.
+ *
+ *	If we come across a stale, pinned, locked buffer, we know that we
+ *	are being asked to lock a buffer that has been reallocated. Because
+ *	it is pinned, we know that the log has not been pushed to disk and
+ *	hence it will still be locked. Rather than sleeping until someone
+ *	else pushes the log, push it ourselves before trying to get the lock.
  */
 void
 xfs_buf_lock(
@@ -857,6 +864,8 @@ xfs_buf_lock(
 {
 	trace_xfs_buf_lock(bp, _RET_IP_);
 
+	if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
+		xfs_log_force(bp->b_mount, 0);
 	if (atomic_read(&bp->b_io_remaining))
 		blk_run_address_space(bp->b_target->bt_mapping);
 	down(&bp->b_sema);
diff --git a/fs/xfs/linux-2.6/xfs_quotaops.c b/fs/xfs/linux-2.6/xfs_quotaops.c
index 1947514..2e73688 100644
--- a/fs/xfs/linux-2.6/xfs_quotaops.c
+++ b/fs/xfs/linux-2.6/xfs_quotaops.c
@@ -19,6 +19,7 @@
 #include "xfs_dmapi.h"
 #include "xfs_sb.h"
 #include "xfs_inum.h"
+#include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_mount.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index 8a319cf..0934a27 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -1059,83 +1059,109 @@ TRACE_EVENT(xfs_bunmap,
 
 );
 
+#define XFS_BUSY_SYNC \
+	{ 0,	"async" }, \
+	{ 1,	"sync" }
+
 TRACE_EVENT(xfs_alloc_busy,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-		 xfs_extlen_t len, int slot),
-	TP_ARGS(mp, agno, agbno, len, slot),
+	TP_PROTO(struct xfs_trans *trans, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_extlen_t len, int sync),
+	TP_ARGS(trans, agno, agbno, len, sync),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(struct xfs_trans *, tp)
 		__field(xfs_agnumber_t, agno)
 		__field(xfs_agblock_t, agbno)
 		__field(xfs_extlen_t, len)
-		__field(int, slot)
+		__field(int, sync)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
+		__entry->dev = trans->t_mountp->m_super->s_dev;
+		__entry->tp = trans;
 		__entry->agno = agno;
 		__entry->agbno = agbno;
 		__entry->len = len;
-		__entry->slot = slot;
+		__entry->sync = sync;
 	),
-	TP_printk("dev %d:%d agno %u agbno %u len %u slot %d",
+	TP_printk("dev %d:%d trans 0x%p agno %u agbno %u len %u %s",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tp,
 		  __entry->agno,
 		  __entry->agbno,
 		  __entry->len,
-		  __entry->slot)
+		  __print_symbolic(__entry->sync, XFS_BUSY_SYNC))
 
 );
 
-#define XFS_BUSY_STATES \
-	{ 0,	"found" }, \
-	{ 1,	"missing" }
-
 TRACE_EVENT(xfs_alloc_unbusy,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 int slot, int found),
-	TP_ARGS(mp, agno, slot, found),
+		 xfs_agblock_t agbno, xfs_extlen_t len),
+	TP_ARGS(mp, agno, agbno, len),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
-		__field(int, slot)
-		__field(int, found)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
-		__entry->slot = slot;
-		__entry->found = found;
+		__entry->agbno = agbno;
+		__entry->len = len;
 	),
-	TP_printk("dev %d:%d agno %u slot %d %s",
+	TP_printk("dev %d:%d agno %u agbno %u len %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
-		  __entry->slot,
-		  __print_symbolic(__entry->found, XFS_BUSY_STATES))
+		  __entry->agbno,
+		  __entry->len)
 );
 
+#define XFS_BUSY_STATES \
+	{ 0,	"missing" }, \
+	{ 1,	"found" }
+
 TRACE_EVENT(xfs_alloc_busysearch,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-		 xfs_extlen_t len, xfs_lsn_t lsn),
-	TP_ARGS(mp, agno, agbno, len, lsn),
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_extlen_t len, int found),
+	TP_ARGS(mp, agno, agbno, len, found),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
 		__field(xfs_agblock_t, agbno)
 		__field(xfs_extlen_t, len)
-		__field(xfs_lsn_t, lsn)
+		__field(int, found)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
 		__entry->agbno = agbno;
 		__entry->len = len;
-		__entry->lsn = lsn;
+		__entry->found = found;
 	),
-	TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx",
+	TP_printk("dev %d:%d agno %u agbno %u len %u %s",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __entry->agbno,
 		  __entry->len,
+		  __print_symbolic(__entry->found, XFS_BUSY_STATES))
+);
+
+TRACE_EVENT(xfs_trans_commit_lsn,
+	TP_PROTO(struct xfs_trans *trans),
+	TP_ARGS(trans),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(struct xfs_trans *, tp)
+		__field(xfs_lsn_t, lsn)
+	),
+	TP_fast_assign(
+		__entry->dev = trans->t_mountp->m_super->s_dev;
+		__entry->tp = trans;
+		__entry->lsn = trans->t_commit_lsn;
+	),
+	TP_printk("dev %d:%d trans 0x%p commit_lsn 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tp,
 		  __entry->lsn)
 );
 
diff --git a/fs/xfs/support/debug.c b/fs/xfs/support/debug.c
index 3f3610a..c52472f 100644
--- a/fs/xfs/support/debug.c
+++ b/fs/xfs/support/debug.c
@@ -21,6 +21,7 @@
 /* xfs_mount.h drags a lot of crap in, sorry.. */
 #include "xfs_sb.h"
 #include "xfs_inum.h"
+#include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_dmapi.h"
 #include "xfs_mount.h"
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index abb8222..f2c08c6 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -175,14 +175,17 @@ typedef struct xfs_agfl {
 } xfs_agfl_t;
 
 /*
- * Busy block/extent entry.  Used in perag to mark blocks that have been freed
- * but whose transactions aren't committed to disk yet.
+ * Busy block/extent entry.  Indexed by a rbtree in perag to mark blocks that
+ * have been freed but whose transactions aren't committed to disk yet.
  */
-typedef struct xfs_perag_busy {
-	xfs_agblock_t	busy_start;
-	xfs_extlen_t	busy_length;
-	struct xfs_trans *busy_tp;	/* transaction that did the free */
-} xfs_perag_busy_t;
+struct xfs_busy_extent {
+	struct rb_node	rb_node;	/* ag by-bno indexed search tree */
+	struct list_head list;		/* transaction busy extent list */
+	xfs_agnumber_t	agno;
+	xfs_agblock_t	bno;
+	xfs_extlen_t	length;
+	xlog_tid_t	tid;		/* transaction identifier */
+};
 
 /*
  * Per-ag incore structure, copies of information in agf and agi,
@@ -216,7 +219,8 @@ typedef struct xfs_perag {
 	xfs_agino_t	pagl_leftrec;
 	xfs_agino_t	pagl_rightrec;
 #ifdef __KERNEL__
-	spinlock_t	pagb_lock;	/* lock for pagb_list */
+	spinlock_t	pagb_lock;	/* lock for pagb_tree */
+	struct rb_root	pagb_tree;	/* ordered tree of busy extents */
 
 	atomic_t        pagf_fstrms;    /* # of filestreams active in this AG */
 
@@ -226,7 +230,6 @@ typedef struct xfs_perag {
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
 #endif
 	int		pagb_count;	/* pagb slots in use */
-	xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS];	/* unstable blocks */
 } xfs_perag_t;
 
 /*
diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 94cddbf..04de39e 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -46,11 +46,9 @@
 #define	XFSA_FIXUP_BNO_OK	1
 #define	XFSA_FIXUP_CNT_OK	2
 
-STATIC void
-xfs_alloc_search_busy(xfs_trans_t *tp,
-		    xfs_agnumber_t agno,
-		    xfs_agblock_t bno,
-		    xfs_extlen_t len);
+static int
+xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
+		    xfs_agblock_t bno, xfs_extlen_t len);
 
 /*
  * Prototypes for per-ag allocation routines
@@ -540,9 +538,16 @@ xfs_alloc_ag_vextent(
 				be32_to_cpu(agf->agf_length));
 			xfs_alloc_log_agf(args->tp, args->agbp,
 						XFS_AGF_FREEBLKS);
-			/* search the busylist for these blocks */
-			xfs_alloc_search_busy(args->tp, args->agno,
-					args->agbno, args->len);
+			/*
+			 * Search the busylist for these blocks and mark the
+			 * transaction as synchronous if blocks are found. This
+			 * avoids the need to block in due to a synchronous log
+			 * force to ensure correct ordering as the synchronous
+			 * transaction will guarantee that for us.
+			 */
+			if (xfs_alloc_busy_search(args->mp, args->agno,
+						args->agbno, args->len))
+				xfs_trans_set_sync(args->tp);
 		}
 		if (!args->isfl)
 			xfs_trans_mod_sb(args->tp,
@@ -1993,10 +1998,17 @@ xfs_alloc_get_freelist(
 	 * and remain there until the freeing transaction is committed to
 	 * disk.  Now that we have allocated blocks, this list must be
 	 * searched to see if a block is being reused.  If one is, then
-	 * the freeing transaction must be pushed to disk NOW by forcing
-	 * to disk all iclogs up that transaction's LSN.
-	 */
-	xfs_alloc_search_busy(tp, be32_to_cpu(agf->agf_seqno), bno, 1);
+	 * the freeing transaction must be pushed to disk before this
+	 * transaction.
+	 *
+	 * We do this by setting the current transaction
+	 * to a sync transaction which guarantees that the freeing transaction
+	 * is on disk before this transaction. This is done instead of a
+	 * synchronous log force here so that we don't sit and wait with
+	 * the AGF locked in the transaction during the log force.
+	 */
+	if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1))
+		xfs_trans_set_sync(tp);
 	return 0;
 }
 
@@ -2201,7 +2213,7 @@ xfs_alloc_read_agf(
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
 		spin_lock_init(&pag->pagb_lock);
 		pag->pagb_count = 0;
-		memset(pag->pagb_list, 0, sizeof(pag->pagb_list));
+		pag->pagb_tree = RB_ROOT;
 		pag->pagf_init = 1;
 	}
 #ifdef DEBUG
@@ -2482,124 +2494,184 @@ error0:
  * xfs_alloc_mark_busy - add to the per-ag busy list
  * xfs_alloc_clear_busy - remove an item from the per-ag busy list
  */
-void
-xfs_alloc_mark_busy(xfs_trans_t *tp,
-		    xfs_agnumber_t agno,
-		    xfs_agblock_t bno,
-		    xfs_extlen_t len)
+static void
+xfs_alloc_busy_insert(
+	struct xfs_trans	*tp,
+	struct xfs_busy_extent	*new)
 {
-	xfs_perag_busy_t	*bsy;
 	struct xfs_perag	*pag;
-	int			n;
+	struct rb_node		**rbp;
+	struct rb_node		*parent;
+	struct xfs_busy_extent	*busyp;
 
-	pag = xfs_perag_get(tp->t_mountp, agno);
+	pag = xfs_perag_get(tp->t_mountp, new->agno);
+restart:
 	spin_lock(&pag->pagb_lock);
+	rbp = &pag->pagb_tree.rb_node;
+	parent = NULL;
+	while (*rbp) {
+		parent = *rbp;
+		busyp = rb_entry(parent, struct xfs_busy_extent, rb_node);
+
+		if (new->bno < busyp->bno)
+			rbp = &(*rbp)->rb_left;
+		else if (new->bno > busyp->bno)
+			rbp = &(*rbp)->rb_right;
+		else {
 
-	/* search pagb_list for an open slot */
-	for (bsy = pag->pagb_list, n = 0;
-	     n < XFS_PAGB_NUM_SLOTS;
-	     bsy++, n++) {
-		if (bsy->busy_tp == NULL) {
-			break;
+			/*
+			 * We're trying to reuse an already busy extent?
+			 *
+			 * That means the transaction that marked it busy must
+			 * still be committing, but we are freeing it again
+			 * here. This could be the same transaction (btree
+			 * manipulations may allocate and free blocks
+			 * multiple times in a transaction), so if it is make
+			 * sure the transaction is marked synchronous already
+			 * and update the length of the busy extent to match
+			 * the new range being freed.
+			 *
+			 * If it is not the same transaction, then we need to
+			 * wait for the transaction that marked this extent
+			 * busy to complete. I don' think we can avoid a log
+			 * force in this case. We can't rely on the contents of
+			 * the transaction point in busyp, so we have to force
+			 * everything.
+			 *
+			 * Note that we do not use the transaction structure
+			 * for identifying equal transactions. This is because
+			 * there is the possibility of transaction structures
+			 * being reallocated from the slab after being freed
+			 * and triggering false detections here. Hence use the
+			 * transaction ticket ID to determine if it is the same
+			 * transaction.
+			 */
+			if (busyp->tid != xfs_trans_get_tid(tp)) {
+				spin_unlock(&pag->pagb_lock);
+				xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
+				goto restart;
+			}
+			ASSERT(tp->t_flags & XFS_TRANS_SYNC);
+			busyp->length = max(busyp->length, new->length);
+			spin_unlock(&pag->pagb_lock);
+			xfs_perag_put(pag);
+			kmem_free(new);
+			return;
 		}
 	}
 
-	trace_xfs_alloc_busy(tp->t_mountp, agno, bno, len, n);
-
-	if (n < XFS_PAGB_NUM_SLOTS) {
-		bsy = &pag->pagb_list[n];
-		pag->pagb_count++;
-		bsy->busy_start = bno;
-		bsy->busy_length = len;
-		bsy->busy_tp = tp;
-		xfs_trans_add_busy(tp, agno, n);
-	} else {
-		/*
-		 * The busy list is full!  Since it is now not possible to
-		 * track the free block, make this a synchronous transaction
-		 * to insure that the block is not reused before this
-		 * transaction commits.
-		 */
-		xfs_trans_set_sync(tp);
-	}
+	rb_link_node(&new->rb_node, parent, rbp);
+	rb_insert_color(&new->rb_node, &pag->pagb_tree);
 
+	list_add(&new->list, &tp->t_busy);
 	spin_unlock(&pag->pagb_lock);
 	xfs_perag_put(pag);
 }
 
 void
-xfs_alloc_clear_busy(xfs_trans_t *tp,
-		     xfs_agnumber_t agno,
-		     int idx)
+xfs_alloc_mark_busy(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len)
 {
-	struct xfs_perag	*pag;
-	xfs_perag_busy_t	*list;
+	struct xfs_busy_extent	*busyp;
 
-	ASSERT(idx < XFS_PAGB_NUM_SLOTS);
-	pag = xfs_perag_get(tp->t_mountp, agno);
-	spin_lock(&pag->pagb_lock);
-	list = pag->pagb_list;
-
-	trace_xfs_alloc_unbusy(tp->t_mountp, agno, idx, list[idx].busy_tp == tp);
-
-	if (list[idx].busy_tp == tp) {
-		list[idx].busy_tp = NULL;
-		pag->pagb_count--;
+	busyp = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL);
+	if (!busyp) {
+		/*
+		 * No Memory!  Since it is now not possible to track the free
+		 * block, make this a synchronous transaction to insure that
+		 * the block is not reused before this transaction commits.
+		 */
+		trace_xfs_alloc_busy(tp, agno, bno, len, 1);
+		xfs_trans_set_sync(tp);
+		return;
 	}
 
-	spin_unlock(&pag->pagb_lock);
-	xfs_perag_put(pag);
-}
+	busyp->agno = agno;
+	busyp->bno = bno;
+	busyp->length = len;
+	busyp->tid = xfs_trans_get_tid(tp);
+	INIT_LIST_HEAD(&busyp->list);
 
+	/* trace before insert to be able to see failed inserts */
+	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
+	xfs_alloc_busy_insert(tp, busyp);
+}
 
 /*
- * If we find the extent in the busy list, force the log out to get the
- * extent out of the busy list so the caller can use it straight away.
+ * Search for a busy extent within the range of the extent we are about to
+ * allocate.  You need to be holding the busy extent tree lock when calling
+ * xfs_alloc_busy_search(). This function returns 0 for no overlapping busy
+ * extent, -1 for an overlapping but not exact busy extent, and 1 for an exact
+ * match. This is done so that a non-zero return indicates an overlap that
+ * will require a synchronous transaction, but it can still be
+ * used to distinguish between a partial or exact match.
  */
-STATIC void
-xfs_alloc_search_busy(xfs_trans_t *tp,
-		    xfs_agnumber_t agno,
-		    xfs_agblock_t bno,
-		    xfs_extlen_t len)
+static int
+xfs_alloc_busy_search(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len)
 {
 	struct xfs_perag	*pag;
-	xfs_perag_busy_t	*bsy;
+	struct rb_node		*rbp;
 	xfs_agblock_t		uend, bend;
-	xfs_lsn_t		lsn = 0;
-	int			cnt;
+	struct xfs_busy_extent	*busyp;
+	int			match = 0;
 
-	pag = xfs_perag_get(tp->t_mountp, agno);
+	pag = xfs_perag_get(mp, agno);
 	spin_lock(&pag->pagb_lock);
-	cnt = pag->pagb_count;
 
-	/*
-	 * search pagb_list for this slot, skipping open slots. We have to
-	 * search the entire array as there may be multiple overlaps and
-	 * we have to get the most recent LSN for the log force to push out
-	 * all the transactions that span the range.
-	 */
 	uend = bno + len - 1;
-	for (cnt = 0; cnt < pag->pagb_count; cnt++) {
-		bsy = &pag->pagb_list[cnt];
-		if (!bsy->busy_tp)
-			continue;
-
-		bend = bsy->busy_start + bsy->busy_length - 1;
-		if (bno > bend || uend < bsy->busy_start)
-			continue;
-
-		/* (start1,length1) within (start2, length2) */
-		if (XFS_LSN_CMP(bsy->busy_tp->t_commit_lsn, lsn) > 0)
-			lsn = bsy->busy_tp->t_commit_lsn;
+	rbp = pag->pagb_tree.rb_node;
+
+	/* find closest start bno overlap */
+	while (rbp) {
+		busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node);
+		bend = busyp->bno + busyp->length - 1;
+		if (bno < busyp->bno) {
+			/* may overlap, but exact start block is lower */
+			if (uend > busyp->bno)
+				match = -1;
+			rbp = rbp->rb_left;
+		} else if (bno > busyp->bno) {
+			/* may overlap, but exact start block is higher */
+			if (bno < bend)
+				match = -1;
+			rbp = rbp->rb_right;
+		} else {
+			/* bno matches busyp, length determines exact match */
+			match = (busyp->length == len) ? 1 : -1;
+			break;
+		}
 	}
 	spin_unlock(&pag->pagb_lock);
+	trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match);
+	xfs_perag_put(pag);
+	return match;
+}
+
+void
+xfs_alloc_clear_busy(
+	struct xfs_mount	*mp,
+	struct xfs_busy_extent	*busyp)
+{
+	struct xfs_perag	*pag;
+
+	trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno,
+						busyp->length);
+
+	ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno,
+						busyp->length) == 1);
+
+	pag = xfs_perag_get(mp, busyp->agno);
+	spin_lock(&pag->pagb_lock);
+	rb_erase(&busyp->rb_node, &pag->pagb_tree);
+	spin_unlock(&pag->pagb_lock);
 	xfs_perag_put(pag);
-	trace_xfs_alloc_busysearch(tp->t_mountp, agno, bno, len, lsn);
 
-	/*
-	 * If a block was found, force the log through the LSN of the
-	 * transaction that freed the block
-	 */
-	if (lsn)
-		xfs_log_force_lsn(tp->t_mountp, lsn, XFS_LOG_SYNC);
+	kmem_free(busyp);
 }
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index 599bffa..3b88a6b 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -22,6 +22,7 @@ struct xfs_buf;
 struct xfs_mount;
 struct xfs_perag;
 struct xfs_trans;
+struct xfs_busy_extent;
 
 /*
  * Freespace allocation types.  Argument to xfs_alloc_[v]extent.
@@ -125,9 +126,7 @@ xfs_alloc_mark_busy(xfs_trans_t *tp,
 		xfs_extlen_t len);
 
 void
-xfs_alloc_clear_busy(xfs_trans_t *tp,
-		xfs_agnumber_t ag,
-		int idx);
+xfs_alloc_clear_busy(struct xfs_mount *mp, struct xfs_busy_extent *busyp);
 
 #endif	/* __KERNEL__ */
 
diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
index 390850e..f9e1c2d 100644
--- a/fs/xfs/xfs_filestream.c
+++ b/fs/xfs/xfs_filestream.c
@@ -23,6 +23,7 @@
 #include "xfs_attr_sf.h"
 #include "xfs_dinode.h"
 #include "xfs_inode.h"
+#include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_dmapi.h"
 #include "xfs_log.h"
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 229d1f3..b8579b7 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -18,6 +18,8 @@
 #ifndef	__XFS_LOG_H__
 #define __XFS_LOG_H__
 
+typedef __uint32_t xlog_tid_t;
+
 /* get lsn fields */
 
 #define CYCLE_LSN(lsn) ((uint)((lsn)>>32))
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 9cf6951..ac97bdd 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -152,8 +152,6 @@ static inline uint xlog_get_client_id(__be32 i)
 #define	XLOG_RECOVERY_NEEDED	0x4	/* log was recovered */
 #define XLOG_IO_ERROR		0x8	/* log hit an I/O error, and being
 					   shutdown */
-typedef __uint32_t xlog_tid_t;
-
 
 #ifdef __KERNEL__
 /*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index be578ec..da75ccd 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -44,6 +44,8 @@
 #include "xfs_trans_priv.h"
 #include "xfs_trans_space.h"
 #include "xfs_inode_item.h"
+#include "xfs_trace.h"
+#include "xfs_log_priv.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -243,9 +245,8 @@ _xfs_trans_alloc(
 	tp->t_type = type;
 	tp->t_mountp = mp;
 	tp->t_items_free = XFS_LIC_NUM_SLOTS;
-	tp->t_busy_free = XFS_LBC_NUM_SLOTS;
 	xfs_lic_init(&(tp->t_items));
-	XFS_LBC_INIT(&(tp->t_busy));
+	INIT_LIST_HEAD(&tp->t_busy);
 	return tp;
 }
 
@@ -285,9 +286,8 @@ xfs_trans_dup(
 	ntp->t_type = tp->t_type;
 	ntp->t_mountp = tp->t_mountp;
 	ntp->t_items_free = XFS_LIC_NUM_SLOTS;
-	ntp->t_busy_free = XFS_LBC_NUM_SLOTS;
 	xfs_lic_init(&(ntp->t_items));
-	XFS_LBC_INIT(&(ntp->t_busy));
+	INIT_LIST_HEAD(&ntp->t_busy);
 
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 	ASSERT(tp->t_ticket != NULL);
@@ -306,6 +306,13 @@ xfs_trans_dup(
 	return ntp;
 }
 
+xlog_tid_t
+xfs_trans_get_tid(
+	struct xfs_trans	*tp)
+{
+	return tp->t_ticket->t_tid;
+}
+
 /*
  * This is called to reserve free disk blocks and log space for the
  * given transaction.  This must be done before allocating any resources
@@ -423,6 +430,22 @@ undo_blocks:
 	return error;
 }
 
+/*
+ * xfs_trans_free_busy
+ * Free all of the busy extents from a transaction
+ */
+void
+xfs_trans_free_busy(
+	struct xfs_mount	*mp,
+	struct list_head	*busy_list)
+{
+	struct xfs_busy_extent	*busyp, *n;
+
+	list_for_each_entry_safe(busyp, n, busy_list, list) {
+		list_del_init(&busyp->list);
+		xfs_alloc_clear_busy(mp, busyp);
+	}
+}
 
 /*
  * Record the indicated change to the given field for application
@@ -930,26 +953,6 @@ xfs_trans_item_committed(
 	IOP_UNPIN(lip);
 }
 
-/* Clear all the per-AG busy list items listed in this transaction */
-static void
-xfs_trans_clear_busy_extents(
-	struct xfs_trans	*tp)
-{
-	xfs_log_busy_chunk_t	*lbcp;
-	xfs_log_busy_slot_t	*lbsp;
-	int			i;
-
-	for (lbcp = &tp->t_busy; lbcp != NULL; lbcp = lbcp->lbc_next) {
-		i = 0;
-		for (lbsp = lbcp->lbc_busy; i < lbcp->lbc_unused; i++, lbsp++) {
-			if (XFS_LBC_ISFREE(lbcp, i))
-				continue;
-			xfs_alloc_clear_busy(tp, lbsp->lbc_ag, lbsp->lbc_idx);
-		}
-	}
-	xfs_trans_free_busy(tp);
-}
-
 /*
  * This is typically called by the LM when a transaction has been fully
  * committed to disk.  It needs to unpin the items which have
@@ -984,7 +987,7 @@ xfs_trans_committed(
 		kmem_free(licp);
 	}
 
-	xfs_trans_clear_busy_extents(tp);
+	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 }
 
@@ -1013,7 +1016,7 @@ xfs_trans_uncommit(
 	xfs_trans_unreserve_and_mod_dquots(tp);
 
 	xfs_trans_free_items(tp, flags);
-	xfs_trans_free_busy(tp);
+	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 }
 
@@ -1075,6 +1078,8 @@ xfs_trans_commit_iclog(
 	*commit_lsn = xfs_log_done(mp, tp->t_ticket, &commit_iclog, log_flags);
 
 	tp->t_commit_lsn = *commit_lsn;
+	trace_xfs_trans_commit_lsn(tp);
+
 	if (nvec > XFS_TRANS_LOGVEC_COUNT)
 		kmem_free(log_vector);
 
@@ -1260,7 +1265,7 @@ out_unreserve:
 	}
 	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 	xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0);
-	xfs_trans_free_busy(tp);
+	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 
 	XFS_STATS_INC(xs_trans_empty);
@@ -1339,7 +1344,7 @@ xfs_trans_cancel(
 	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 
 	xfs_trans_free_items(tp, flags);
-	xfs_trans_free_busy(tp);
+	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 }
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index c62beee..0c4a1b5 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -813,6 +813,7 @@ struct xfs_log_item_desc;
 struct xfs_mount;
 struct xfs_trans;
 struct xfs_dquot_acct;
+struct xfs_busy_extent;
 
 typedef struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -872,34 +873,6 @@ typedef struct xfs_item_ops {
 #define XFS_ITEM_PUSHBUF	3
 
 /*
- * This structure is used to maintain a list of block ranges that have been
- * freed in the transaction.  The ranges are listed in the perag[] busy list
- * between when they're freed and the transaction is committed to disk.
- */
-
-typedef struct xfs_log_busy_slot {
-	xfs_agnumber_t		lbc_ag;
-	ushort			lbc_idx;	/* index in perag.busy[] */
-} xfs_log_busy_slot_t;
-
-#define XFS_LBC_NUM_SLOTS	31
-typedef struct xfs_log_busy_chunk {
-	struct xfs_log_busy_chunk	*lbc_next;
-	uint				lbc_free;	/* free slots bitmask */
-	ushort				lbc_unused;	/* first unused */
-	xfs_log_busy_slot_t		lbc_busy[XFS_LBC_NUM_SLOTS];
-} xfs_log_busy_chunk_t;
-
-#define	XFS_LBC_MAX_SLOT	(XFS_LBC_NUM_SLOTS - 1)
-#define	XFS_LBC_FREEMASK	((1U << XFS_LBC_NUM_SLOTS) - 1)
-
-#define	XFS_LBC_INIT(cp)	((cp)->lbc_free = XFS_LBC_FREEMASK)
-#define	XFS_LBC_CLAIM(cp, slot)	((cp)->lbc_free &= ~(1 << (slot)))
-#define	XFS_LBC_SLOT(cp, slot)	(&((cp)->lbc_busy[(slot)]))
-#define	XFS_LBC_VACANCY(cp)	(((cp)->lbc_free) & XFS_LBC_FREEMASK)
-#define	XFS_LBC_ISFREE(cp, slot) ((cp)->lbc_free & (1 << (slot)))
-
-/*
  * This is the type of function which can be given to xfs_trans_callback()
  * to be called upon the transaction's commit to disk.
  */
@@ -950,8 +923,7 @@ typedef struct xfs_trans {
 	unsigned int		t_items_free;	/* log item descs free */
 	xfs_log_item_chunk_t	t_items;	/* first log item desc chunk */
 	xfs_trans_header_t	t_header;	/* header for in-log trans */
-	unsigned int		t_busy_free;	/* busy descs free */
-	xfs_log_busy_chunk_t	t_busy;		/* busy/async free blocks */
+	struct list_head	t_busy;		/* list of busy extents */
 	unsigned long		t_pflags;	/* saved process flags state */
 } xfs_trans_t;
 
@@ -980,6 +952,7 @@ typedef struct xfs_trans {
 xfs_trans_t	*xfs_trans_alloc(struct xfs_mount *, uint);
 xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint);
 xfs_trans_t	*xfs_trans_dup(xfs_trans_t *);
+xlog_tid_t	xfs_trans_get_tid(struct xfs_trans *);
 int		xfs_trans_reserve(xfs_trans_t *, uint, uint, uint,
 				  uint, uint);
 void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);
@@ -1025,9 +998,6 @@ int		_xfs_trans_commit(xfs_trans_t *,
 void		xfs_trans_cancel(xfs_trans_t *, int);
 int		xfs_trans_ail_init(struct xfs_mount *);
 void		xfs_trans_ail_destroy(struct xfs_mount *);
-xfs_log_busy_slot_t *xfs_trans_add_busy(xfs_trans_t *tp,
-					xfs_agnumber_t ag,
-					xfs_extlen_t idx);
 
 extern kmem_zone_t	*xfs_trans_zone;
 
diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
index 27cce2a..1940eef 100644
--- a/fs/xfs/xfs_trans_extfree.c
+++ b/fs/xfs/xfs_trans_extfree.c
@@ -21,6 +21,7 @@
 #include "xfs_log.h"
 #include "xfs_inum.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_sb.h"
 #include "xfs_ag.h"
 #include "xfs_dmapi.h"
diff --git a/fs/xfs/xfs_trans_item.c b/fs/xfs/xfs_trans_item.c
index eb3fc57..2937a1e 100644
--- a/fs/xfs/xfs_trans_item.c
+++ b/fs/xfs/xfs_trans_item.c
@@ -438,112 +438,3 @@ xfs_trans_unlock_chunk(
 
 	return freed;
 }
-
-
-/*
- * This is called to add the given busy item to the transaction's
- * list of busy items.  It must find a free busy item descriptor
- * or allocate a new one and add the item to that descriptor.
- * The function returns a pointer to busy descriptor used to point
- * to the new busy entry.  The log busy entry will now point to its new
- * descriptor with its ???? field.
- */
-xfs_log_busy_slot_t *
-xfs_trans_add_busy(xfs_trans_t *tp, xfs_agnumber_t ag, xfs_extlen_t idx)
-{
-	xfs_log_busy_chunk_t	*lbcp;
-	xfs_log_busy_slot_t	*lbsp;
-	int			i=0;
-
-	/*
-	 * If there are no free descriptors, allocate a new chunk
-	 * of them and put it at the front of the chunk list.
-	 */
-	if (tp->t_busy_free == 0) {
-		lbcp = (xfs_log_busy_chunk_t*)
-		       kmem_alloc(sizeof(xfs_log_busy_chunk_t), KM_SLEEP);
-		ASSERT(lbcp != NULL);
-		/*
-		 * Initialize the chunk, and then
-		 * claim the first slot in the newly allocated chunk.
-		 */
-		XFS_LBC_INIT(lbcp);
-		XFS_LBC_CLAIM(lbcp, 0);
-		lbcp->lbc_unused = 1;
-		lbsp = XFS_LBC_SLOT(lbcp, 0);
-
-		/*
-		 * Link in the new chunk and update the free count.
-		 */
-		lbcp->lbc_next = tp->t_busy.lbc_next;
-		tp->t_busy.lbc_next = lbcp;
-		tp->t_busy_free = XFS_LIC_NUM_SLOTS - 1;
-
-		/*
-		 * Initialize the descriptor and the generic portion
-		 * of the log item.
-		 *
-		 * Point the new slot at this item and return it.
-		 * Also point the log item at its currently active
-		 * descriptor and set the item's mount pointer.
-		 */
-		lbsp->lbc_ag = ag;
-		lbsp->lbc_idx = idx;
-		return lbsp;
-	}
-
-	/*
-	 * Find the free descriptor. It is somewhere in the chunklist
-	 * of descriptors.
-	 */
-	lbcp = &tp->t_busy;
-	while (lbcp != NULL) {
-		if (XFS_LBC_VACANCY(lbcp)) {
-			if (lbcp->lbc_unused <= XFS_LBC_MAX_SLOT) {
-				i = lbcp->lbc_unused;
-				break;
-			} else {
-				/* out-of-order vacancy */
-				cmn_err(CE_DEBUG, "OOO vacancy lbcp 0x%p\n", lbcp);
-				ASSERT(0);
-			}
-		}
-		lbcp = lbcp->lbc_next;
-	}
-	ASSERT(lbcp != NULL);
-	/*
-	 * If we find a free descriptor, claim it,
-	 * initialize it, and return it.
-	 */
-	XFS_LBC_CLAIM(lbcp, i);
-	if (lbcp->lbc_unused <= i) {
-		lbcp->lbc_unused = i + 1;
-	}
-	lbsp = XFS_LBC_SLOT(lbcp, i);
-	tp->t_busy_free--;
-	lbsp->lbc_ag = ag;
-	lbsp->lbc_idx = idx;
-	return lbsp;
-}
-
-
-/*
- * xfs_trans_free_busy
- * Free all of the busy lists from a transaction
- */
-void
-xfs_trans_free_busy(xfs_trans_t *tp)
-{
-	xfs_log_busy_chunk_t	*lbcp;
-	xfs_log_busy_chunk_t	*lbcq;
-
-	lbcp = tp->t_busy.lbc_next;
-	while (lbcp != NULL) {
-		lbcq = lbcp->lbc_next;
-		kmem_free(lbcp);
-		lbcp = lbcq;
-	}
-
-	XFS_LBC_INIT(&tp->t_busy);
-	tp->t_busy.lbc_unused = 0;
-}
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 73e2ad3..6698db6 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -38,10 +38,8 @@ struct xfs_log_item_desc	*xfs_trans_next_item(struct xfs_trans *,
 void				xfs_trans_free_items(struct xfs_trans *, int);
 void				xfs_trans_unlock_items(struct xfs_trans *,
 							xfs_lsn_t);
-void				xfs_trans_free_busy(xfs_trans_t *tp);
-xfs_log_busy_slot_t		*xfs_trans_add_busy(xfs_trans_t *tp,
-						    xfs_agnumber_t ag,
-						    xfs_extlen_t idx);
+
+void	xfs_trans_free_busy(struct xfs_mount *mp, struct list_head *busy_list);
 
 /*
  * AIL traversal cursor.
-- 
1.5.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-21  5:47 [PATCH] xfs: Improve scalability of busy extent tracking Dave Chinner
@ 2010-04-22 11:01 ` Christoph Hellwig
  2010-04-22 16:16   ` Dave Chinner
  2010-04-22 11:07 ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2010-04-22 11:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Having the patches merged into one certainly helps with readability.
This version also passes xfsqa fine while I had some problems with that
with the delayed logging tree.

> o busy extent transaction owner tracked by transaction id, not by
>   address of transaction structure. Tracking by address of the
>   transaction was triggering false ASSERT failures when searching for
>   busy extents with matching start block numbers.

I'm a bit confused by that.  How can we ever find a xfs_busy_extent
structure for a transaction that's been freed?  We always call
xfs_trans_free_busy before freeing the transaction, so this should
never happen.  And if it can happen we can also get a reuse of the tid,
even if it is much more likely.  So if there's a corner case I've
missed we really need to keep a pointer to the xlog_ticket and keep
a reference on it as long as we have busy extents belonging to it in
the rbtree.

Besides that here's a few cosmetic comments:

> -STATIC void
> -xfs_alloc_search_busy(xfs_trans_t *tp,
> -		    xfs_agnumber_t agno,
> -		    xfs_agblock_t bno,
> -		    xfs_extlen_t len);
> +static int
> +xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		    xfs_agblock_t bno, xfs_extlen_t len);

The switch in naming convention looks good to me, but it should
also be applied to xfs_alloc_clear_busy (-> xfs_alloc_busy_clear?)
and xfs_alloc_mark_busy (-> xfs_alloc_busy_add/insert?)

> +xfs_alloc_mark_busy(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len)
>  {
> +	struct xfs_busy_extent	*busyp;
>  
> +	busyp = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL);
> +	if (!busyp) {
> +		/*
> +		 * No Memory!  Since it is now not possible to track the free
> +		 * block, make this a synchronous transaction to insure that
> +		 * the block is not reused before this transaction commits.
> +		 */
> +		trace_xfs_alloc_busy(tp, agno, bno, len, 1);
> +		xfs_trans_set_sync(tp);
> +		return;
>  	}
>  
> +	busyp->agno = agno;
> +	busyp->bno = bno;
> +	busyp->length = len;
> +	busyp->tid = xfs_trans_get_tid(tp);
> +	INIT_LIST_HEAD(&busyp->list);
>  
> +	/* trace before insert to be able to see failed inserts */
> +	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
> +	xfs_alloc_busy_insert(tp, busyp);
> +}

I'd rather inline xfs_alloc_busy_insert into this function, there's no
good reason to keep it separate.

> -	xfs_trans_clear_busy_extents(tp);
> +	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
>  	xfs_trans_free(tp);
>  }
>  
> @@ -1013,7 +1016,7 @@ xfs_trans_uncommit(
>  	xfs_trans_unreserve_and_mod_dquots(tp);
>  
>  	xfs_trans_free_items(tp, flags);
> -	xfs_trans_free_busy(tp);
> +	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);

There's no real need for the prototype change at this point.  I know
you'll need it for the delayed logging, but let's keep the prototype
change in that patchset.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-21  5:47 [PATCH] xfs: Improve scalability of busy extent tracking Dave Chinner
  2010-04-22 11:01 ` Christoph Hellwig
@ 2010-04-22 11:07 ` Christoph Hellwig
  2010-04-22 17:10   ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2010-04-22 11:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Oh, and while I'm at it:

I'd move the list_del_init(&busyp->list); from xfs_trans_free_busy into
xfs_alloc_clear_busy, we also do the list insertation inside
xfs_alloc_busy_insert, so that makes it symmetic.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-22 11:01 ` Christoph Hellwig
@ 2010-04-22 16:16   ` Dave Chinner
  2010-04-22 17:08     ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-04-22 16:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Apr 22, 2010 at 07:01:43AM -0400, Christoph Hellwig wrote:
> Having the patches merged into one certainly helps with readability.
> This version also passes xfsqa fine while I had some problems with that
> with the delayed logging tree.
> 
> > o busy extent transaction owner tracked by transaction id, not by
> >   address of transaction structure. Tracking by address of the
> >   transaction was triggering false ASSERT failures when searching for
> >   busy extents with matching start block numbers.
> 
> I'm a bit confused by that.  How can we ever find a xfs_busy_extent
> structure for a transaction that's been freed?  We always call
> xfs_trans_free_busy before freeing the transaction, so this should
> never happen.

Yeah, I can't explain the root cause, either.  Short story: at the
time I did this I was getting desperate - I'd been trying to work
out the failure you reported for three days and had tried just
about everything I could think of. I was thinking that "I know this
can't happen but I'm going to rule it out, anyway." 

Long story:

I managed to reproduce the xfsqa failures you were seeing in the
previous version - I could only reproduce it on a single CPU VM
with full kernel preemption enabled. I could get it to either
the ASSERT(sync transaction) or corrupt the rbtree.

I confirmed via gdb that when the ASSERT fired the transaction
pointed to by the busy extent matched the address of the transaction
passed in, but it did not look like the right transaction in any
way. i.e. no tracked busy extents, not marked sync, very few items
attached and it was usually the first allocation in the transaction.
I turned on slab poisoning, the kmemleak detector and so on, and
none of them fired indicating a use after free or anything like
that. If I looked at the rbtree at this time, it was almost always
corrupted in some way.

But then there was the fact that none of the debug I put in to
verify the rbtree structure ever caught the tree corruption when it
occurred. I only ever caught a corrupted tree before an operation
was done, never after an operation had been done.  And not only
that, the busy extent tracing only ever indicated valid operation
sequences leading up to the corruption detection and that the busy
extent should not be in the tree whenever the ASSERT fired.

Basically, I could only conclude that the tree was being corrupted
by some outside event. Every time I triggered the problem, i ended
up seeing exactly the same sort of corruption, but never any new
information that would lead me to what corrupted the tree.

This is where I started on trying things that couldn't happen.
Reference counting transactions was the first thing I tried, and
that didn't fix the problem. Then I randomised the transaction IDs
and switched to tracking them to avoid the possibility that
somewhere, somehow we incorrectly matched a re-allocated
transaction.

To my surprise that made all the failures go away. I can't corrupt
the rbtree, I can't trigger the sync transaction assert, none of the
debug I added to verify the rbtree found any problems any more, etc,
and no matter how much stress or how many different workloads I test
against, I cannot get the problem to trip again on any sort of
configuration I can test here.

Hence I delayed sending this out again because I'm still clueless as
to the root cause and I've been hoping I'd be able to see a new
problem with it. But i haven't seen any problems, with or without
delayed logging, so wider testing is the only way I'm going to
uncover any lingering problem. In a way, I'm kind of disappointed
that it fixed the problem you reported...

> And if it can happen we can also get a reuse of the tid,
> even if it is much more likely.  So if there's a corner case I've
> missed we really need to keep a pointer to the xlog_ticket and keep
> a reference on it as long as we have busy extents belonging to it in
> the rbtree.

Like I said, i tried that and it didn't help.

> Besides that here's a few cosmetic comments:
> 
> > -STATIC void
> > -xfs_alloc_search_busy(xfs_trans_t *tp,
> > -		    xfs_agnumber_t agno,
> > -		    xfs_agblock_t bno,
> > -		    xfs_extlen_t len);
> > +static int
> > +xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
> > +		    xfs_agblock_t bno, xfs_extlen_t len);
> 
> The switch in naming convention looks good to me, but it should
> also be applied to xfs_alloc_clear_busy (-> xfs_alloc_busy_clear?)
> and xfs_alloc_mark_busy (-> xfs_alloc_busy_add/insert?)

Yeah, I can do that.

> > +xfs_alloc_mark_busy(
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len)
> >  {
> > +	struct xfs_busy_extent	*busyp;
> >  
> > +	busyp = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL);
> > +	if (!busyp) {
> > +		/*
> > +		 * No Memory!  Since it is now not possible to track the free
> > +		 * block, make this a synchronous transaction to insure that
> > +		 * the block is not reused before this transaction commits.
> > +		 */
> > +		trace_xfs_alloc_busy(tp, agno, bno, len, 1);
> > +		xfs_trans_set_sync(tp);
> > +		return;
> >  	}
> >  
> > +	busyp->agno = agno;
> > +	busyp->bno = bno;
> > +	busyp->length = len;
> > +	busyp->tid = xfs_trans_get_tid(tp);
> > +	INIT_LIST_HEAD(&busyp->list);
> >  
> > +	/* trace before insert to be able to see failed inserts */
> > +	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
> > +	xfs_alloc_busy_insert(tp, busyp);
> > +}
> 
> I'd rather inline xfs_alloc_busy_insert into this function, there's no
> good reason to keep it separate.

OK.

> > -	xfs_trans_clear_busy_extents(tp);
> > +	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
> >  	xfs_trans_free(tp);
> >  }
> >  
> > @@ -1013,7 +1016,7 @@ xfs_trans_uncommit(
> >  	xfs_trans_unreserve_and_mod_dquots(tp);
> >  
> >  	xfs_trans_free_items(tp, flags);
> > -	xfs_trans_free_busy(tp);
> > +	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
> 
> There's no real need for the prototype change at this point.  I know
> you'll need it for the delayed logging, but let's keep the prototype
> change in that patchset.

Ok. will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-22 16:16   ` Dave Chinner
@ 2010-04-22 17:08     ` Christoph Hellwig
  2010-04-23  3:24       ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2010-04-22 17:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

Been looking at this a bit and I have a theory:

 - a tid is not actually unique to a xfs_trans structure, if
   we call xfs_trans_dup a single xlog_ticket, and with that the
   tid is re-used by multiple transaction structure.
 - because of that the major semantic change in the version vs
   the previous one is that we now do not force the synchronous
   transaction for the case where we re-used a block in
   the rolled over transaction.

Still not quite sure about the implications of this.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-22 11:07 ` Christoph Hellwig
@ 2010-04-22 17:10   ` Christoph Hellwig
  2010-04-23  3:25     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2010-04-22 17:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Apr 22, 2010 at 07:07:58AM -0400, Christoph Hellwig wrote:
> Oh, and while I'm at it:
> 
> I'd move the list_del_init(&busyp->list); from xfs_trans_free_busy into
> xfs_alloc_clear_busy, we also do the list insertation inside
> xfs_alloc_busy_insert, so that makes it symmetic.

Or just merge the loop calling xfs_alloc_clear_busy directly in
xfs_trans_free, ala:


Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2010-04-22 18:59:11.031004018 +0200
+++ xfs/fs/xfs/xfs_alloc.c	2010-04-22 18:59:38.639005695 +0200
@@ -2667,6 +2667,8 @@ xfs_alloc_clear_busy(
 	ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno,
 						busyp->length) == 1);
 
+	list_del_init(&busyp->list);
+
 	pag = xfs_perag_get(mp, busyp->agno);
 	spin_lock(&pag->pagb_lock);
 	rb_erase(&busyp->rb_node, &pag->pagb_tree);
Index: xfs/fs/xfs/xfs_trans.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans.c	2010-04-22 18:59:11.047273888 +0200
+++ xfs/fs/xfs/xfs_trans.c	2010-04-22 19:03:53.613255663 +0200
@@ -620,8 +620,13 @@ _xfs_trans_alloc(
  */
 STATIC void
 xfs_trans_free(
-	xfs_trans_t	*tp)
+	struct xfs_trans	*tp)
 {
+	struct xfs_busy_extent	*busyp, *n;
+
+	list_for_each_entry_safe(busyp, n, &tp->t_busy, list)
+		xfs_alloc_clear_busy(tp->t_mountp, busyp);
+
 	atomic_dec(&tp->t_mountp->m_active_trans);
 	xfs_trans_free_dqinfo(tp);
 	kmem_zone_free(xfs_trans_zone, tp);
@@ -795,23 +800,6 @@ undo_blocks:
 }
 
 /*
- * xfs_trans_free_busy
- * Free all of the busy extents from a transaction
- */
-void
-xfs_trans_free_busy(
-	struct xfs_mount	*mp,
-	struct list_head	*busy_list)
-{
-	struct xfs_busy_extent	*busyp, *n;
-
-	list_for_each_entry_safe(busyp, n, busy_list, list) {
-		list_del_init(&busyp->list);
-		xfs_alloc_clear_busy(mp, busyp);
-	}
-}
-
-/*
  * Record the indicated change to the given field for application
  * to the file system's superblock when the transaction commits.
  * For now, just store the change in the transaction structure.
@@ -1351,7 +1339,6 @@ xfs_trans_committed(
 		kmem_free(licp);
 	}
 
-	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 }
 
@@ -1380,7 +1367,6 @@ xfs_trans_uncommit(
 	xfs_trans_unreserve_and_mod_dquots(tp);
 
 	xfs_trans_free_items(tp, flags);
-	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 }
 
@@ -1629,7 +1615,6 @@ out_unreserve:
 	}
 	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 	xfs_trans_free_items(tp, error ? XFS_TRANS_ABORT : 0);
-	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 
 	XFS_STATS_INC(xs_trans_empty);
@@ -1708,7 +1693,6 @@ xfs_trans_cancel(
 	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
 
 	xfs_trans_free_items(tp, flags);
-	xfs_trans_free_busy(tp->t_mountp, &tp->t_busy);
 	xfs_trans_free(tp);
 }
 
Index: xfs/fs/xfs/xfs_trans_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_priv.h	2010-04-22 18:59:59.745004088 +0200
+++ xfs/fs/xfs/xfs_trans_priv.h	2010-04-22 19:00:03.734255103 +0200
@@ -39,8 +39,6 @@ void				xfs_trans_free_items(struct xfs_
 void				xfs_trans_unlock_items(struct xfs_trans *,
 							xfs_lsn_t);
 
-void	xfs_trans_free_busy(struct xfs_mount *mp, struct list_head *busy_list);
-
 /*
  * AIL traversal cursor.
  *

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-22 17:08     ` Christoph Hellwig
@ 2010-04-23  3:24       ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2010-04-23  3:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Apr 22, 2010 at 01:08:49PM -0400, Christoph Hellwig wrote:
> Been looking at this a bit and I have a theory:
> 
>  - a tid is not actually unique to a xfs_trans structure, if
>    we call xfs_trans_dup a single xlog_ticket, and with that the
>    tid is re-used by multiple transaction structure.

Good point, and an angle I had missed.

>  - because of that the major semantic change in the version vs
>    the previous one is that we now do not force the synchronous
>    transaction for the case where we re-used a block in
>    the rolled over transaction.

The problem case is not re-allocation of the busy extent, it's the
subsequent freeing of that extent that is already busy.  i.e. we've
done "free - alloc - free" before the transaction containing the
alloc has hit the disk and cleared the original busy extent from the
tree.

In a single transaction case, the alloc should mark the transaction
synchronous, so the second free should match the tid and assert that
it is synchronous.

The mutliple transaction case can be split up:

	{free} {alloc} {free}: mismatched tids -> log force
	{free - alloc} {free}: mismatched tids -> log force
	{free} {alloc - free}: same as the single transaction case

But as you point out, the mutiple transaction case could have
duplicate tids, so the first two cases could be getting it wrong.
However, if the tids match, then the ASSERT should fire if the
transactions were not sycnhronous.

Hence all I can conclude is that xfsqa is not triggering the first
two cases from duplicated transactions, but we need to change
tid of the log ticket when we dup transactions.

FWIW, now that you've pointed this out, this appears to be another
existing source of duplicate TIDs in the log that I did not identify
when I first saw them in the log.  It seems to me that the best
approach to fixing this is that when we regrant the log space on a
permanent log ticket we need to regenerate the ticket tid so that
individual components of rolling transactions are not confused in
log replay. Does this seem reasonable?

Doing this would also handle the above two cases correctly as well.
Despite this, I still can't see how such a scenario would cause the
problems with transaction pointer matching because the "same
transaction pointer but different transactions" situation still
can't occur even with duplicated transactions....

> Still not quite sure about the implications of this.

I'll change it and run some tests. It'll be next week before I can
get to this, though...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: Improve scalability of busy extent tracking
  2010-04-22 17:10   ` Christoph Hellwig
@ 2010-04-23  3:25     ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2010-04-23  3:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Apr 22, 2010 at 01:10:35PM -0400, Christoph Hellwig wrote:
> On Thu, Apr 22, 2010 at 07:07:58AM -0400, Christoph Hellwig wrote:
> > Oh, and while I'm at it:
> > 
> > I'd move the list_del_init(&busyp->list); from xfs_trans_free_busy into
> > xfs_alloc_clear_busy, we also do the list insertation inside
> > xfs_alloc_busy_insert, so that makes it symmetic.
> 
> Or just merge the loop calling xfs_alloc_clear_busy directly in
> xfs_trans_free, ala:

Yeah, that looks like a sane thing to do. I'll update it this way.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-04-23  3:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-21  5:47 [PATCH] xfs: Improve scalability of busy extent tracking Dave Chinner
2010-04-22 11:01 ` Christoph Hellwig
2010-04-22 16:16   ` Dave Chinner
2010-04-22 17:08     ` Christoph Hellwig
2010-04-23  3:24       ` Dave Chinner
2010-04-22 11:07 ` Christoph Hellwig
2010-04-22 17:10   ` Christoph Hellwig
2010-04-23  3:25     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox