public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] xfs: current patch stack for 2.6.38 window
@ 2010-11-08  8:55 Dave Chinner
  2010-11-08  8:55 ` [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking Dave Chinner
                   ` (16 more replies)
  0 siblings, 17 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

Folks,

FYI, here is my current XFS patch stack that I'll be trying to get ready in
time for the 2.6.38 merge window.  Note that the first two patches are
candidates for 2.6.37-rc. They are a perag reference counting fix and the
movement of a trace point.

My tree is currently based on the VFS locking changes I have out for review,
so there's a couple fo patches that won't apply sanely to a mainline or OSS xfs
dev tree. See below for a pointer to a git tree with all the patches in it.

First patch is a per-cpu superblock counter rewrite. This uses the generic
per-cpu coutner infrastructure to do the heavy lifting. Needs to be split into
two patches.

Following this is the dynamic speculative allocation patches. These have been
rewritten to be base don the current inode size rather than a thumb-in-the-air
how-many-preallocs-have-we-already-done algorithm. There are also some fixes in
the second patch that fix assumptions about ip->i_delayed_blks being zero after
a flush.

Next up we have the inode cache RCU freeing and lookup patches, including one
that avoids putting the inode in the VFS hash (similar to Christoph's patch,
but using the different VFS code).

Then there are buffer cache reclaim changes. First is a per-buftarg shrinker
interface, followed by a lazily updated per-buftarg buffer LRU. building on
this connecting up the prioritised buffer reclaim hooks that ensure more
critical buffers are harder to reclaim.

AIL lock contention fixes are next, with bulk AIL insert and removal functions
being implemented and connected up to the transaction commit and inode buffer
IO completion routines. These significantly reduce AIL lock contention, and
combined with a reduction in the granularity of xfsaild push wakeups, the AIL
lock drops out of the "top 10" contended locks on ۸-way workloads.

There's a fix to avoid error injection from burning CPU on debug kernels - with
a badly fragmented freespace tree, the btree block validation was taking ~60%
of the CPU time, with most of that running error injection checks. 

Finally, there's a patch to split up the log grant lock. This needs splitting
into 4 or 5 smaller patches (as you can see it was originally from the commit
log). It splits the grant lock into two list locks (reserve and write queues),
and converts all the other variables that the grant lock protected into atomic
variables. Grant head calculations are made atomic by converting them into 64
bit "LSNs" and the use of cmpxchg loops on atomic 64 bit variables. All log
tail and sync LSNs updates are made atomic via conversion to atomic variables.
With this, the grant lock goes away completely, and the transaction reserve
fast path now only has two cmpxchg loops instead of a heavily contended spin
lock.

The result of all this is raw cpu bound 8-way create performance of just over
100,000 inodes/s, and unlink performance of over 90,000 inodes/s. 8-way dbench
performance is improved from ~1150MB/s to ~1650MB/s by this patchset.

For 8-way creation and unlink of small files (~50 million), the lockstat
profiles look like:


				contended	total		Lock
		Lock		acquistions  acquisitions	Description
-----------------------------   -----------  ------------	-------------------
           inode_wb_list_lock:    496330785    836287347	VFS
                  dcache_lock:    116299583    681450027	VFS
        &(&vblk->lock)->rlock:     52829329    131054495	virtio block device
    &sb->s_type->i_lock_key#1:     41772196   2375571240	VFS (inode->i_lock)
  &(&cil->xc_cil_lock)->rlock:     29549897    410553961	XFS (CIL commit lock)
         &irq_desc_lock_class:     27520142     63908701	IRQ edge lock
 &(&pag->pag_buf_lock)->rlock:     11756249   1838039685	XFS (buffer cache lock)
    &(&dentry->d_lock)->rlock:      5735657   1225028487	VFS
 &(&parent->list_lock)->rlock:      4356293    249408696	VM (SLAB list lock)
           inode_sb_list_lock:      3616366    203712449	VFS
                        key#5:      2075310    139221312	XFS SB percpu counter
              inode_hash_lock:      1529969    102359626	VFS
             rcu_node_level_0:      1363470     13730113	RCU
        &(&zone->lock)->rlock:      1247467     16469316	VM (free list lock)
 &(&pag->pag_ici_lock)->rlock:       770880    337090972	XFS (inode cache lock)
                    &rq->lock:       589111    184220946	Scheduler
               inode_lru_lock:       527163    102791204	VFS
g->l_grant_write_lock)->rlock:       526471     51279626	XFS (grant write lock)
    &(&pag->pagb_lock)->rlock:       402878    208861744	XFS (busy extent list)
    &(&zone->lru_lock)->rlock:       167692     25383748	VM (page cache LRU)
              &on_slab_l3_key:       166183     58470153	VM (slab cache)
            semaphore->lock#2:       161321   3659173925	???
     &(&ailp->xa_lock)->rlock:       143859    164470123	XFS (AIL lock)
          &cil->xc_ctx_lock-W:        32850       173279	XFS (CIL push lock)
          &cil->xc_ctx_lock-R:        90868    357572724	XFS (CIL push lock)

I'm still to determine if I'll have the time to finish the removal of the page cache from
the buffer cache yet - for pure inode create/unlink workloads the buftarg
mapping tree lock is the second most heavily contended lock in the system.
Hence this definitely needs solving in some way or another....

Anyway, comments are welcome - just keep in mind that there is still some
polish required for these patches. ;)

If you want the git version, everything is here:

  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working

Dave Chinner (16):
      xfs: fix per-ag reference counting in inode reclaim tree walking
      xfs: move delayed write buffer trace
      [RFC] xfs: use generic per-cpu counter infrastructure
      xfs: dynamic speculative EOF preallocation
      xfs: don't truncate prealloc from frequently accessed inodes
      patch xfs-inode-hash-fake
      xfs: convert inode cache lookups to use RCU locking
      xfs: convert pag_ici_lock to a spin lock
      xfs: convert xfsbud shrinker to a per-buftarg shrinker.
      xfs: add a lru to the XFS buffer cache
      xfs: connect up buffer reclaim priority hooks
      xfs: bulk AIL insertion during transaction commit
      xfs: reduce the number of AIL push wakeups
      xfs: remove all the inodes on a buffer from the AIL in bulk
      xfs: only run xfs_error_test if error injection is active
      xfs: make xlog_space_left() independent of the grant lock

 fs/xfs/linux-2.6/xfs_buf.c     |  239 ++++++++----
 fs/xfs/linux-2.6/xfs_buf.h     |   43 ++-
 fs/xfs/linux-2.6/xfs_iops.c    |   11 +-
 fs/xfs/linux-2.6/xfs_linux.h   |    9 -
 fs/xfs/linux-2.6/xfs_super.c   |   22 +-
 fs/xfs/linux-2.6/xfs_sync.c    |   28 +-
 fs/xfs/linux-2.6/xfs_trace.h   |   36 +-
 fs/xfs/quota/xfs_dquot.c       |    2 +-
 fs/xfs/quota/xfs_qm_syscalls.c |    3 +
 fs/xfs/xfs_ag.h                |    2 +-
 fs/xfs/xfs_alloc.c             |    4 +-
 fs/xfs/xfs_bmap.c              |    9 +-
 fs/xfs/xfs_btree.c             |   11 +-
 fs/xfs/xfs_buf_item.c          |   17 +-
 fs/xfs/xfs_da_btree.c          |    4 +-
 fs/xfs/xfs_dfrag.c             |   13 +
 fs/xfs/xfs_error.c             |    3 +
 fs/xfs/xfs_error.h             |    5 +-
 fs/xfs/xfs_extfree_item.c      |   85 +++--
 fs/xfs/xfs_extfree_item.h      |   12 +-
 fs/xfs/xfs_fsops.c             |    4 +-
 fs/xfs/xfs_ialloc.c            |    2 +-
 fs/xfs/xfs_iget.c              |   55 ++-
 fs/xfs/xfs_inode.c             |   24 +-
 fs/xfs/xfs_inode.h             |    1 +
 fs/xfs/xfs_inode_item.c        |  112 +++++-
 fs/xfs/xfs_iomap.c             |   53 ++-
 fs/xfs/xfs_log.c               |  678 +++++++++++++++++---------------
 fs/xfs/xfs_log_cil.c           |    9 +-
 fs/xfs/xfs_log_priv.h          |   40 ++-
 fs/xfs/xfs_log_recover.c       |   27 +-
 fs/xfs/xfs_mount.c             |  837 +++++++++++-----------------------------
 fs/xfs/xfs_mount.h             |   80 +---
 fs/xfs/xfs_trans.c             |   70 ++++-
 fs/xfs/xfs_trans.h             |    2 +-
 fs/xfs/xfs_trans_ail.c         |  189 ++++++++-
 fs/xfs/xfs_trans_extfree.c     |    4 +-
 fs/xfs/xfs_trans_priv.h        |   13 +-
 fs/xfs/xfs_vnodeops.c          |   61 ++-
 include/linux/percpu_counter.h |   16 +
 lib/percpu_counter.c           |   79 ++++
 41 files changed, 1593 insertions(+), 1321 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08  9:23   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 02/16] xfs: move delayed write buffer trace Dave Chinner
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The walk fails to decrement the per-ag reference count when the
non-blocking walk fails to obtain the per-ag reclaim lock, leading
to an assert failure on debug kernels when unmounting a filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    1 +
 fs/xfs/xfs_mount.c          |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 37d3325..afb0d7c 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -853,6 +853,7 @@ restart:
 		if (trylock) {
 			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
 				skipped++;
+				xfs_perag_put(pag);
 				continue;
 			}
 			first_index = pag->pag_ici_reclaim_cursor;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b1498ab..19e9dfa 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -275,6 +275,7 @@ xfs_free_perag(
 		pag = radix_tree_delete(&mp->m_perag_tree, agno);
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
+		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/16] xfs: move delayed write buffer trace
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
  2010-11-08  8:55 ` [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08  9:24   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure Dave Chinner
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The delayed write buffer split trace currently issues a trace for
every buffer it scans. These buffers are not necessarily queued for
delayed write. Indeed, when buffers are pinned, there can be
thousands of traces of buffers that aren't actually queued for
delayed write and the ones that are are lost in the noise. Move the
trace point to record only buffers that are split out for IO to be
issued on.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 63fd2c0..aa1d353 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1781,7 +1781,6 @@ xfs_buf_delwri_split(
 	INIT_LIST_HEAD(list);
 	spin_lock(dwlk);
 	list_for_each_entry_safe(bp, n, dwq, b_list) {
-		trace_xfs_buf_delwri_split(bp, _RET_IP_);
 		ASSERT(bp->b_flags & XBF_DELWRI);
 
 		if (!XFS_BUF_ISPINNED(bp) && !xfs_buf_cond_lock(bp)) {
@@ -1795,6 +1794,7 @@ xfs_buf_delwri_split(
 					 _XBF_RUN_QUEUES);
 			bp->b_flags |= XBF_WRITE;
 			list_move_tail(&bp->b_list, list);
+			trace_xfs_buf_delwri_split(bp, _RET_IP_);
 		} else
 			skipped++;
 	}
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
  2010-11-08  8:55 ` [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking Dave Chinner
  2010-11-08  8:55 ` [PATCH 02/16] xfs: move delayed write buffer trace Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 12:13   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 04/16] xfs: dynamic speculative EOF preallocation Dave Chinner
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

XFS has a per-cpu counter implementation for in-core superblock
counters that pre-dated the generic implementation. It is complex
and baroque as it is tailored directly to the needs of ENOSPC
detection. Implement the complex accurate-compare-and-add
calculation in the generic per-cpu counter code and convert the
XFS counters to use the much simpler generic counter code.

Passes xfsqa on SMP system.

Still to do:

	1. UP build and test.
	2. split into separate patches

For discussion:
	1. kill the no-per-cpu-counter mode?
	2. does we need a custom batch size?
	3. do we need to factor xfs_mod_sb_incore()?
	4. should all the readers just sum the counters themselves
	   and kill the wrapperѕ?

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_linux.h   |    9 -
 fs/xfs/linux-2.6/xfs_super.c   |    4 +-
 fs/xfs/xfs_fsops.c             |    4 +-
 fs/xfs/xfs_mount.c             |  834 +++++++++++-----------------------------
 fs/xfs/xfs_mount.h             |   80 +---
 include/linux/percpu_counter.h |   16 +
 lib/percpu_counter.c           |   79 ++++
 7 files changed, 341 insertions(+), 685 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_linux.h b/fs/xfs/linux-2.6/xfs_linux.h
index 214ddd7..9fa4f2a 100644
--- a/fs/xfs/linux-2.6/xfs_linux.h
+++ b/fs/xfs/linux-2.6/xfs_linux.h
@@ -88,15 +88,6 @@
 #include <xfs_super.h>
 #include <xfs_buf.h>
 
-/*
- * Feature macros (disable/enable)
- */
-#ifdef CONFIG_SMP
-#define HAVE_PERCPU_SB	/* per cpu superblock counters are a 2.6 feature */
-#else
-#undef  HAVE_PERCPU_SB	/* per cpu superblock counters are a 2.6 feature */
-#endif
-
 #define irix_sgid_inherit	xfs_params.sgid_inherit.val
 #define irix_symlink_mode	xfs_params.symlink_mode.val
 #define xfs_panic_mask		xfs_params.panic_mask.val
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 53ab47f..fa789b7 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1230,9 +1230,9 @@ xfs_fs_statfs(
 	statp->f_fsid.val[0] = (u32)id;
 	statp->f_fsid.val[1] = (u32)(id >> 32);
 
-	xfs_icsb_sync_counters(mp, XFS_ICSB_LAZY_COUNT);
-
 	spin_lock(&mp->m_sb_lock);
+	xfs_icsb_sync_counters_locked(mp);
+
 	statp->f_bsize = sbp->sb_blocksize;
 	lsize = sbp->sb_logstart ? sbp->sb_logblocks : 0;
 	statp->f_blocks = sbp->sb_dblocks - lsize;
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index a7c116e..44ecf1b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -478,7 +478,7 @@ xfs_fs_counts(
 	xfs_mount_t		*mp,
 	xfs_fsop_counts_t	*cnt)
 {
-	xfs_icsb_sync_counters(mp, XFS_ICSB_LAZY_COUNT);
+	xfs_icsb_sync_counters(mp);
 	spin_lock(&mp->m_sb_lock);
 	cnt->freedata = mp->m_sb.sb_fdblocks - XFS_ALLOC_SET_ASIDE(mp);
 	cnt->freertx = mp->m_sb.sb_frextents;
@@ -540,7 +540,7 @@ xfs_reserve_blocks(
 	 */
 retry:
 	spin_lock(&mp->m_sb_lock);
-	xfs_icsb_sync_counters_locked(mp, 0);
+	xfs_icsb_sync_counters_locked(mp);
 
 	/*
 	 * If our previous reservation was larger than the current value,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 19e9dfa..0d9a030 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -46,19 +46,6 @@
 
 STATIC void	xfs_unmountfs_wait(xfs_mount_t *);
 
-
-#ifdef HAVE_PERCPU_SB
-STATIC void	xfs_icsb_balance_counter(xfs_mount_t *, xfs_sb_field_t,
-						int);
-STATIC void	xfs_icsb_balance_counter_locked(xfs_mount_t *, xfs_sb_field_t,
-						int);
-STATIC void	xfs_icsb_disable_counter(xfs_mount_t *, xfs_sb_field_t);
-#else
-
-#define xfs_icsb_balance_counter(mp, a, b)		do { } while (0)
-#define xfs_icsb_balance_counter_locked(mp, a, b)	do { } while (0)
-#endif
-
 static const struct {
 	short offset;
 	short type;	/* 0 = integer
@@ -280,6 +267,111 @@ xfs_free_perag(
 	}
 }
 
+
+/*
+ * Per-cpu incore superblock counters
+ *
+ * Simple concept, difficult implementation, now somewhat simplified by generic
+ * per-cpu counter support.  This provides distributed per cpu counters for
+ * contended fields (e.g.  free block count).
+ *
+ * Difficulties arise in that the incore sb is used for ENOSPC checking, and
+ * hence needs to be accurately read when we are running low on space. Hence We
+ * need to check against counter error bounds and determine how accurately to
+ * sum based on that metric. The percpu counters take care of this for us,
+ * so we only need to modify the fast path to handle per-cpu counter error
+ * cases.
+ */
+static inline int
+xfs_icsb_add(
+	struct xfs_mount	*mp,
+	int			counter,
+	int64_t			delta,
+	int64_t			threshold)
+{
+	int			ret;
+
+	ret = percpu_counter_add_unless_lt(&mp->m_icsb[counter], delta,
+								threshold);
+	if (ret < 0)
+		return -ENOSPC;
+	return 0;
+}
+
+static inline void
+xfs_icsb_set(
+	struct xfs_mount	*mp,
+	int			counter,
+	int64_t			value)
+{
+	percpu_counter_set(&mp->m_icsb[counter], value);
+}
+
+static inline int64_t
+xfs_icsb_sum(
+	struct xfs_mount	*mp,
+	int			counter)
+{
+	return percpu_counter_sum_positive(&mp->m_icsb[counter]);
+}
+
+static inline int64_t
+xfs_icsb_read(
+	struct xfs_mount	*mp,
+	int			counter)
+{
+	return percpu_counter_read_positive(&mp->m_icsb[counter]);
+}
+
+void
+xfs_icsb_reinit_counters(
+	struct xfs_mount	*mp)
+{
+	xfs_icsb_set(mp, XFS_ICSB_FDBLOCKS, mp->m_sb.sb_fdblocks);
+	xfs_icsb_set(mp, XFS_ICSB_IFREE, mp->m_sb.sb_ifree);
+	xfs_icsb_set(mp, XFS_ICSB_ICOUNT, mp->m_sb.sb_icount);
+}
+
+int
+xfs_icsb_init_counters(
+	struct xfs_mount	*mp)
+{
+	int			i;
+	int			error;
+
+	for (i = 0; i < XFS_ICSB_MAX; i++) {
+		error = percpu_counter_init(&mp->m_icsb[i], 0);
+		if (error)
+			goto out_error;
+	}
+	xfs_icsb_reinit_counters(mp);
+	return 0;
+
+out_error:
+	for (; i >= 0; i--)
+		percpu_counter_destroy(&mp->m_icsb[i]);
+	return error;
+}
+
+void
+xfs_icsb_destroy_counters(
+	xfs_mount_t	*mp)
+{
+	int		i;
+
+	for (i = 0; i < XFS_ICSB_MAX; i++)
+		percpu_counter_destroy(&mp->m_icsb[i]);
+}
+
+void
+xfs_icsb_sync_counters_locked(
+	xfs_mount_t	*mp)
+{
+	mp->m_sb.sb_icount = xfs_icsb_sum(mp, XFS_ICSB_ICOUNT);
+	mp->m_sb.sb_ifree = xfs_icsb_sum(mp, XFS_ICSB_IFREE);
+	mp->m_sb.sb_fdblocks = xfs_icsb_sum(mp, XFS_ICSB_FDBLOCKS);
+}
+
 /*
  * Check size of device based on the (data/realtime) block count.
  * Note: this check is used by the growfs code as well as mount.
@@ -1562,7 +1654,7 @@ xfs_log_sbcount(
 	if (!xfs_fs_writable(mp))
 		return 0;
 
-	xfs_icsb_sync_counters(mp, 0);
+	xfs_icsb_sync_counters(mp);
 
 	/*
 	 * we don't need to do this if we are updating the superblock
@@ -1674,9 +1766,9 @@ xfs_mod_incore_sb_unlocked(
 	int64_t		delta,
 	int		rsvd)
 {
-	int		scounter;	/* short counter for 32 bit fields */
-	long long	lcounter;	/* long counter for 64 bit fields */
-	long long	res_used, rem;
+	int		scounter = 0;	/* short counter for 32 bit fields */
+	long long	lcounter = 0;	/* long counter for 64 bit fields */
+	long long	res_used;
 
 	/*
 	 * With the in-core superblock spin lock held, switch
@@ -1708,43 +1800,45 @@ xfs_mod_incore_sb_unlocked(
 			mp->m_sb.sb_fdblocks - XFS_ALLOC_SET_ASIDE(mp);
 		res_used = (long long)(mp->m_resblks - mp->m_resblks_avail);
 
-		if (delta > 0) {		/* Putting blocks back */
+		/*
+		 * if we are putting blocks back, put them into the reserve
+		 * block pool first.
+		 */
+		if (res_used && delta > 0) {
 			if (res_used > delta) {
 				mp->m_resblks_avail += delta;
+				delta = 0;
 			} else {
-				rem = delta - res_used;
 				mp->m_resblks_avail = mp->m_resblks;
-				lcounter += rem;
+				delta -= res_used;
 			}
-		} else {				/* Taking blocks away */
-			lcounter += delta;
-			if (lcounter >= 0) {
-				mp->m_sb.sb_fdblocks = lcounter +
-							XFS_ALLOC_SET_ASIDE(mp);
+			if (!delta)
 				return 0;
-			}
+		}
 
-			/*
-			 * We are out of blocks, use any available reserved
-			 * blocks if were allowed to.
-			 */
-			if (!rsvd)
-				return XFS_ERROR(ENOSPC);
+		lcounter += delta;
+		if (likely(lcounter >= 0)) {
+			mp->m_sb.sb_fdblocks = lcounter +
+						XFS_ALLOC_SET_ASIDE(mp);
+			return 0;
+		}
 
-			lcounter = (long long)mp->m_resblks_avail + delta;
-			if (lcounter >= 0) {
-				mp->m_resblks_avail = lcounter;
-				return 0;
-			}
-			printk_once(KERN_WARNING
-				"Filesystem \"%s\": reserve blocks depleted! "
-				"Consider increasing reserve pool size.",
-				mp->m_fsname);
+		/* ENOSPC */
+		ASSERT(delta < 0);
+		if (!rsvd)
 			return XFS_ERROR(ENOSPC);
+
+		lcounter = (long long)mp->m_resblks_avail + delta;
+		if (lcounter >= 0) {
+			mp->m_resblks_avail = lcounter;
+			return 0;
 		}
+		printk_once(KERN_WARNING
+			"Filesystem \"%s\": reserve blocks depleted! "
+			"Consider increasing reserve pool size.",
+			mp->m_fsname);
+		return XFS_ERROR(ENOSPC);
 
-		mp->m_sb.sb_fdblocks = lcounter + XFS_ALLOC_SET_ASIDE(mp);
-		return 0;
 	case XFS_SBS_FREXTENTS:
 		lcounter = (long long)mp->m_sb.sb_frextents;
 		lcounter += delta;
@@ -1846,9 +1940,7 @@ xfs_mod_incore_sb(
 {
 	int			status;
 
-#ifdef HAVE_PERCPU_SB
 	ASSERT(field < XFS_SBS_ICOUNT || field > XFS_SBS_FDBLOCKS);
-#endif
 	spin_lock(&mp->m_sb_lock);
 	status = xfs_mod_incore_sb_unlocked(mp, field, delta, rsvd);
 	spin_unlock(&mp->m_sb_lock);
@@ -1907,6 +1999,89 @@ unwind:
 	return error;
 }
 
+int
+xfs_icsb_modify_counters(
+	xfs_mount_t	*mp,
+	xfs_sb_field_t	field,
+	int64_t		delta,
+	int		rsvd)
+{
+	int64_t		lcounter;
+	int64_t		res_used;
+	int		ret = 0;
+
+
+	switch (field) {
+	case XFS_SBS_ICOUNT:
+		ret = xfs_icsb_add(mp, XFS_ICSB_ICOUNT, delta, 0);
+		if (ret < 0) {
+			ASSERT(0);
+			return XFS_ERROR(EINVAL);
+		}
+		return 0;
+
+	case XFS_SBS_IFREE:
+		ret = xfs_icsb_add(mp, XFS_ICSB_IFREE, delta, 0);
+		if (ret < 0) {
+			ASSERT(0);
+			return XFS_ERROR(EINVAL);
+		}
+		return 0;
+
+	case XFS_SBS_FDBLOCKS:
+		/*
+		 * if we are putting blocks back, put them into the reserve
+		 * block pool first.
+		 */
+		if (mp->m_resblks != mp->m_resblks_avail && delta > 0) {
+			spin_lock(&mp->m_sb_lock);
+			res_used = (int64_t)(mp->m_resblks -
+						mp->m_resblks_avail);
+			if (res_used > delta) {
+				mp->m_resblks_avail += delta;
+				delta = 0;
+			} else {
+				delta -= res_used;
+				mp->m_resblks_avail = mp->m_resblks;
+			}
+			spin_unlock(&mp->m_sb_lock);
+			if (!delta)
+				return 0;
+		}
+
+		/* try the change */
+		ret = xfs_icsb_add(mp, XFS_ICSB_FDBLOCKS, delta,
+						XFS_ALLOC_SET_ASIDE(mp));
+		if (likely(ret >= 0))
+			return 0;
+
+		/* ENOSPC */
+		ASSERT(ret == -ENOSPC);
+		ASSERT(delta < 0);
+
+		if (!rsvd)
+			return XFS_ERROR(ENOSPC);
+
+		spin_lock(&mp->m_sb_lock);
+		lcounter = (int64_t)mp->m_resblks_avail + delta;
+		if (lcounter >= 0) {
+			mp->m_resblks_avail = lcounter;
+			spin_unlock(&mp->m_sb_lock);
+			return 0;
+		}
+		spin_unlock(&mp->m_sb_lock);
+		printk_once(KERN_WARNING
+			"Filesystem \"%s\": reserve blocks depleted! "
+			"Consider increasing reserve pool size.",
+			mp->m_fsname);
+		return XFS_ERROR(ENOSPC);
+	default:
+		ASSERT(0);
+		return XFS_ERROR(EINVAL);
+	}
+	return 0;
+}
+
 /*
  * xfs_getsb() is called to obtain the buffer for the superblock.
  * The buffer is returned locked and read in from disk.
@@ -2000,572 +2175,3 @@ xfs_dev_is_read_only(
 	}
 	return 0;
 }
-
-#ifdef HAVE_PERCPU_SB
-/*
- * Per-cpu incore superblock counters
- *
- * Simple concept, difficult implementation
- *
- * Basically, replace the incore superblock counters with a distributed per cpu
- * counter for contended fields (e.g.  free block count).
- *
- * Difficulties arise in that the incore sb is used for ENOSPC checking, and
- * hence needs to be accurately read when we are running low on space. Hence
- * there is a method to enable and disable the per-cpu counters based on how
- * much "stuff" is available in them.
- *
- * Basically, a counter is enabled if there is enough free resource to justify
- * running a per-cpu fast-path. If the per-cpu counter runs out (i.e. a local
- * ENOSPC), then we disable the counters to synchronise all callers and
- * re-distribute the available resources.
- *
- * If, once we redistributed the available resources, we still get a failure,
- * we disable the per-cpu counter and go through the slow path.
- *
- * The slow path is the current xfs_mod_incore_sb() function.  This means that
- * when we disable a per-cpu counter, we need to drain its resources back to
- * the global superblock. We do this after disabling the counter to prevent
- * more threads from queueing up on the counter.
- *
- * Essentially, this means that we still need a lock in the fast path to enable
- * synchronisation between the global counters and the per-cpu counters. This
- * is not a problem because the lock will be local to a CPU almost all the time
- * and have little contention except when we get to ENOSPC conditions.
- *
- * Basically, this lock becomes a barrier that enables us to lock out the fast
- * path while we do things like enabling and disabling counters and
- * synchronising the counters.
- *
- * Locking rules:
- *
- * 	1. m_sb_lock before picking up per-cpu locks
- * 	2. per-cpu locks always picked up via for_each_online_cpu() order
- * 	3. accurate counter sync requires m_sb_lock + per cpu locks
- * 	4. modifying per-cpu counters requires holding per-cpu lock
- * 	5. modifying global counters requires holding m_sb_lock
- *	6. enabling or disabling a counter requires holding the m_sb_lock 
- *	   and _none_ of the per-cpu locks.
- *
- * Disabled counters are only ever re-enabled by a balance operation
- * that results in more free resources per CPU than a given threshold.
- * To ensure counters don't remain disabled, they are rebalanced when
- * the global resource goes above a higher threshold (i.e. some hysteresis
- * is present to prevent thrashing).
- */
-
-#ifdef CONFIG_HOTPLUG_CPU
-/*
- * hot-plug CPU notifier support.
- *
- * We need a notifier per filesystem as we need to be able to identify
- * the filesystem to balance the counters out. This is achieved by
- * having a notifier block embedded in the xfs_mount_t and doing pointer
- * magic to get the mount pointer from the notifier block address.
- */
-STATIC int
-xfs_icsb_cpu_notify(
-	struct notifier_block *nfb,
-	unsigned long action,
-	void *hcpu)
-{
-	xfs_icsb_cnts_t *cntp;
-	xfs_mount_t	*mp;
-
-	mp = (xfs_mount_t *)container_of(nfb, xfs_mount_t, m_icsb_notifier);
-	cntp = (xfs_icsb_cnts_t *)
-			per_cpu_ptr(mp->m_sb_cnts, (unsigned long)hcpu);
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		/* Easy Case - initialize the area and locks, and
-		 * then rebalance when online does everything else for us. */
-		memset(cntp, 0, sizeof(xfs_icsb_cnts_t));
-		break;
-	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
-		xfs_icsb_lock(mp);
-		xfs_icsb_balance_counter(mp, XFS_SBS_ICOUNT, 0);
-		xfs_icsb_balance_counter(mp, XFS_SBS_IFREE, 0);
-		xfs_icsb_balance_counter(mp, XFS_SBS_FDBLOCKS, 0);
-		xfs_icsb_unlock(mp);
-		break;
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		/* Disable all the counters, then fold the dead cpu's
-		 * count into the total on the global superblock and
-		 * re-enable the counters. */
-		xfs_icsb_lock(mp);
-		spin_lock(&mp->m_sb_lock);
-		xfs_icsb_disable_counter(mp, XFS_SBS_ICOUNT);
-		xfs_icsb_disable_counter(mp, XFS_SBS_IFREE);
-		xfs_icsb_disable_counter(mp, XFS_SBS_FDBLOCKS);
-
-		mp->m_sb.sb_icount += cntp->icsb_icount;
-		mp->m_sb.sb_ifree += cntp->icsb_ifree;
-		mp->m_sb.sb_fdblocks += cntp->icsb_fdblocks;
-
-		memset(cntp, 0, sizeof(xfs_icsb_cnts_t));
-
-		xfs_icsb_balance_counter_locked(mp, XFS_SBS_ICOUNT, 0);
-		xfs_icsb_balance_counter_locked(mp, XFS_SBS_IFREE, 0);
-		xfs_icsb_balance_counter_locked(mp, XFS_SBS_FDBLOCKS, 0);
-		spin_unlock(&mp->m_sb_lock);
-		xfs_icsb_unlock(mp);
-		break;
-	}
-
-	return NOTIFY_OK;
-}
-#endif /* CONFIG_HOTPLUG_CPU */
-
-int
-xfs_icsb_init_counters(
-	xfs_mount_t	*mp)
-{
-	xfs_icsb_cnts_t *cntp;
-	int		i;
-
-	mp->m_sb_cnts = alloc_percpu(xfs_icsb_cnts_t);
-	if (mp->m_sb_cnts == NULL)
-		return -ENOMEM;
-
-#ifdef CONFIG_HOTPLUG_CPU
-	mp->m_icsb_notifier.notifier_call = xfs_icsb_cpu_notify;
-	mp->m_icsb_notifier.priority = 0;
-	register_hotcpu_notifier(&mp->m_icsb_notifier);
-#endif /* CONFIG_HOTPLUG_CPU */
-
-	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
-		memset(cntp, 0, sizeof(xfs_icsb_cnts_t));
-	}
-
-	mutex_init(&mp->m_icsb_mutex);
-
-	/*
-	 * start with all counters disabled so that the
-	 * initial balance kicks us off correctly
-	 */
-	mp->m_icsb_counters = -1;
-	return 0;
-}
-
-void
-xfs_icsb_reinit_counters(
-	xfs_mount_t	*mp)
-{
-	xfs_icsb_lock(mp);
-	/*
-	 * start with all counters disabled so that the
-	 * initial balance kicks us off correctly
-	 */
-	mp->m_icsb_counters = -1;
-	xfs_icsb_balance_counter(mp, XFS_SBS_ICOUNT, 0);
-	xfs_icsb_balance_counter(mp, XFS_SBS_IFREE, 0);
-	xfs_icsb_balance_counter(mp, XFS_SBS_FDBLOCKS, 0);
-	xfs_icsb_unlock(mp);
-}
-
-void
-xfs_icsb_destroy_counters(
-	xfs_mount_t	*mp)
-{
-	if (mp->m_sb_cnts) {
-		unregister_hotcpu_notifier(&mp->m_icsb_notifier);
-		free_percpu(mp->m_sb_cnts);
-	}
-	mutex_destroy(&mp->m_icsb_mutex);
-}
-
-STATIC void
-xfs_icsb_lock_cntr(
-	xfs_icsb_cnts_t	*icsbp)
-{
-	while (test_and_set_bit(XFS_ICSB_FLAG_LOCK, &icsbp->icsb_flags)) {
-		ndelay(1000);
-	}
-}
-
-STATIC void
-xfs_icsb_unlock_cntr(
-	xfs_icsb_cnts_t	*icsbp)
-{
-	clear_bit(XFS_ICSB_FLAG_LOCK, &icsbp->icsb_flags);
-}
-
-
-STATIC void
-xfs_icsb_lock_all_counters(
-	xfs_mount_t	*mp)
-{
-	xfs_icsb_cnts_t *cntp;
-	int		i;
-
-	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
-		xfs_icsb_lock_cntr(cntp);
-	}
-}
-
-STATIC void
-xfs_icsb_unlock_all_counters(
-	xfs_mount_t	*mp)
-{
-	xfs_icsb_cnts_t *cntp;
-	int		i;
-
-	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
-		xfs_icsb_unlock_cntr(cntp);
-	}
-}
-
-STATIC void
-xfs_icsb_count(
-	xfs_mount_t	*mp,
-	xfs_icsb_cnts_t	*cnt,
-	int		flags)
-{
-	xfs_icsb_cnts_t *cntp;
-	int		i;
-
-	memset(cnt, 0, sizeof(xfs_icsb_cnts_t));
-
-	if (!(flags & XFS_ICSB_LAZY_COUNT))
-		xfs_icsb_lock_all_counters(mp);
-
-	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
-		cnt->icsb_icount += cntp->icsb_icount;
-		cnt->icsb_ifree += cntp->icsb_ifree;
-		cnt->icsb_fdblocks += cntp->icsb_fdblocks;
-	}
-
-	if (!(flags & XFS_ICSB_LAZY_COUNT))
-		xfs_icsb_unlock_all_counters(mp);
-}
-
-STATIC int
-xfs_icsb_counter_disabled(
-	xfs_mount_t	*mp,
-	xfs_sb_field_t	field)
-{
-	ASSERT((field >= XFS_SBS_ICOUNT) && (field <= XFS_SBS_FDBLOCKS));
-	return test_bit(field, &mp->m_icsb_counters);
-}
-
-STATIC void
-xfs_icsb_disable_counter(
-	xfs_mount_t	*mp,
-	xfs_sb_field_t	field)
-{
-	xfs_icsb_cnts_t	cnt;
-
-	ASSERT((field >= XFS_SBS_ICOUNT) && (field <= XFS_SBS_FDBLOCKS));
-
-	/*
-	 * If we are already disabled, then there is nothing to do
-	 * here. We check before locking all the counters to avoid
-	 * the expensive lock operation when being called in the
-	 * slow path and the counter is already disabled. This is
-	 * safe because the only time we set or clear this state is under
-	 * the m_icsb_mutex.
-	 */
-	if (xfs_icsb_counter_disabled(mp, field))
-		return;
-
-	xfs_icsb_lock_all_counters(mp);
-	if (!test_and_set_bit(field, &mp->m_icsb_counters)) {
-		/* drain back to superblock */
-
-		xfs_icsb_count(mp, &cnt, XFS_ICSB_LAZY_COUNT);
-		switch(field) {
-		case XFS_SBS_ICOUNT:
-			mp->m_sb.sb_icount = cnt.icsb_icount;
-			break;
-		case XFS_SBS_IFREE:
-			mp->m_sb.sb_ifree = cnt.icsb_ifree;
-			break;
-		case XFS_SBS_FDBLOCKS:
-			mp->m_sb.sb_fdblocks = cnt.icsb_fdblocks;
-			break;
-		default:
-			BUG();
-		}
-	}
-
-	xfs_icsb_unlock_all_counters(mp);
-}
-
-STATIC void
-xfs_icsb_enable_counter(
-	xfs_mount_t	*mp,
-	xfs_sb_field_t	field,
-	uint64_t	count,
-	uint64_t	resid)
-{
-	xfs_icsb_cnts_t	*cntp;
-	int		i;
-
-	ASSERT((field >= XFS_SBS_ICOUNT) && (field <= XFS_SBS_FDBLOCKS));
-
-	xfs_icsb_lock_all_counters(mp);
-	for_each_online_cpu(i) {
-		cntp = per_cpu_ptr(mp->m_sb_cnts, i);
-		switch (field) {
-		case XFS_SBS_ICOUNT:
-			cntp->icsb_icount = count + resid;
-			break;
-		case XFS_SBS_IFREE:
-			cntp->icsb_ifree = count + resid;
-			break;
-		case XFS_SBS_FDBLOCKS:
-			cntp->icsb_fdblocks = count + resid;
-			break;
-		default:
-			BUG();
-			break;
-		}
-		resid = 0;
-	}
-	clear_bit(field, &mp->m_icsb_counters);
-	xfs_icsb_unlock_all_counters(mp);
-}
-
-void
-xfs_icsb_sync_counters_locked(
-	xfs_mount_t	*mp,
-	int		flags)
-{
-	xfs_icsb_cnts_t	cnt;
-
-	xfs_icsb_count(mp, &cnt, flags);
-
-	if (!xfs_icsb_counter_disabled(mp, XFS_SBS_ICOUNT))
-		mp->m_sb.sb_icount = cnt.icsb_icount;
-	if (!xfs_icsb_counter_disabled(mp, XFS_SBS_IFREE))
-		mp->m_sb.sb_ifree = cnt.icsb_ifree;
-	if (!xfs_icsb_counter_disabled(mp, XFS_SBS_FDBLOCKS))
-		mp->m_sb.sb_fdblocks = cnt.icsb_fdblocks;
-}
-
-/*
- * Accurate update of per-cpu counters to incore superblock
- */
-void
-xfs_icsb_sync_counters(
-	xfs_mount_t	*mp,
-	int		flags)
-{
-	spin_lock(&mp->m_sb_lock);
-	xfs_icsb_sync_counters_locked(mp, flags);
-	spin_unlock(&mp->m_sb_lock);
-}
-
-/*
- * Balance and enable/disable counters as necessary.
- *
- * Thresholds for re-enabling counters are somewhat magic.  inode counts are
- * chosen to be the same number as single on disk allocation chunk per CPU, and
- * free blocks is something far enough zero that we aren't going thrash when we
- * get near ENOSPC. We also need to supply a minimum we require per cpu to
- * prevent looping endlessly when xfs_alloc_space asks for more than will
- * be distributed to a single CPU but each CPU has enough blocks to be
- * reenabled.
- *
- * Note that we can be called when counters are already disabled.
- * xfs_icsb_disable_counter() optimises the counter locking in this case to
- * prevent locking every per-cpu counter needlessly.
- */
-
-#define XFS_ICSB_INO_CNTR_REENABLE	(uint64_t)64
-#define XFS_ICSB_FDBLK_CNTR_REENABLE(mp) \
-		(uint64_t)(512 + XFS_ALLOC_SET_ASIDE(mp))
-STATIC void
-xfs_icsb_balance_counter_locked(
-	xfs_mount_t	*mp,
-	xfs_sb_field_t  field,
-	int		min_per_cpu)
-{
-	uint64_t	count, resid;
-	int		weight = num_online_cpus();
-	uint64_t	min = (uint64_t)min_per_cpu;
-
-	/* disable counter and sync counter */
-	xfs_icsb_disable_counter(mp, field);
-
-	/* update counters  - first CPU gets residual*/
-	switch (field) {
-	case XFS_SBS_ICOUNT:
-		count = mp->m_sb.sb_icount;
-		resid = do_div(count, weight);
-		if (count < max(min, XFS_ICSB_INO_CNTR_REENABLE))
-			return;
-		break;
-	case XFS_SBS_IFREE:
-		count = mp->m_sb.sb_ifree;
-		resid = do_div(count, weight);
-		if (count < max(min, XFS_ICSB_INO_CNTR_REENABLE))
-			return;
-		break;
-	case XFS_SBS_FDBLOCKS:
-		count = mp->m_sb.sb_fdblocks;
-		resid = do_div(count, weight);
-		if (count < max(min, XFS_ICSB_FDBLK_CNTR_REENABLE(mp)))
-			return;
-		break;
-	default:
-		BUG();
-		count = resid = 0;	/* quiet, gcc */
-		break;
-	}
-
-	xfs_icsb_enable_counter(mp, field, count, resid);
-}
-
-STATIC void
-xfs_icsb_balance_counter(
-	xfs_mount_t	*mp,
-	xfs_sb_field_t  fields,
-	int		min_per_cpu)
-{
-	spin_lock(&mp->m_sb_lock);
-	xfs_icsb_balance_counter_locked(mp, fields, min_per_cpu);
-	spin_unlock(&mp->m_sb_lock);
-}
-
-int
-xfs_icsb_modify_counters(
-	xfs_mount_t	*mp,
-	xfs_sb_field_t	field,
-	int64_t		delta,
-	int		rsvd)
-{
-	xfs_icsb_cnts_t	*icsbp;
-	long long	lcounter;	/* long counter for 64 bit fields */
-	int		ret = 0;
-
-	might_sleep();
-again:
-	preempt_disable();
-	icsbp = this_cpu_ptr(mp->m_sb_cnts);
-
-	/*
-	 * if the counter is disabled, go to slow path
-	 */
-	if (unlikely(xfs_icsb_counter_disabled(mp, field)))
-		goto slow_path;
-	xfs_icsb_lock_cntr(icsbp);
-	if (unlikely(xfs_icsb_counter_disabled(mp, field))) {
-		xfs_icsb_unlock_cntr(icsbp);
-		goto slow_path;
-	}
-
-	switch (field) {
-	case XFS_SBS_ICOUNT:
-		lcounter = icsbp->icsb_icount;
-		lcounter += delta;
-		if (unlikely(lcounter < 0))
-			goto balance_counter;
-		icsbp->icsb_icount = lcounter;
-		break;
-
-	case XFS_SBS_IFREE:
-		lcounter = icsbp->icsb_ifree;
-		lcounter += delta;
-		if (unlikely(lcounter < 0))
-			goto balance_counter;
-		icsbp->icsb_ifree = lcounter;
-		break;
-
-	case XFS_SBS_FDBLOCKS:
-		BUG_ON((mp->m_resblks - mp->m_resblks_avail) != 0);
-
-		lcounter = icsbp->icsb_fdblocks - XFS_ALLOC_SET_ASIDE(mp);
-		lcounter += delta;
-		if (unlikely(lcounter < 0))
-			goto balance_counter;
-		icsbp->icsb_fdblocks = lcounter + XFS_ALLOC_SET_ASIDE(mp);
-		break;
-	default:
-		BUG();
-		break;
-	}
-	xfs_icsb_unlock_cntr(icsbp);
-	preempt_enable();
-	return 0;
-
-slow_path:
-	preempt_enable();
-
-	/*
-	 * serialise with a mutex so we don't burn lots of cpu on
-	 * the superblock lock. We still need to hold the superblock
-	 * lock, however, when we modify the global structures.
-	 */
-	xfs_icsb_lock(mp);
-
-	/*
-	 * Now running atomically.
-	 *
-	 * If the counter is enabled, someone has beaten us to rebalancing.
-	 * Drop the lock and try again in the fast path....
-	 */
-	if (!(xfs_icsb_counter_disabled(mp, field))) {
-		xfs_icsb_unlock(mp);
-		goto again;
-	}
-
-	/*
-	 * The counter is currently disabled. Because we are
-	 * running atomically here, we know a rebalance cannot
-	 * be in progress. Hence we can go straight to operating
-	 * on the global superblock. We do not call xfs_mod_incore_sb()
-	 * here even though we need to get the m_sb_lock. Doing so
-	 * will cause us to re-enter this function and deadlock.
-	 * Hence we get the m_sb_lock ourselves and then call
-	 * xfs_mod_incore_sb_unlocked() as the unlocked path operates
-	 * directly on the global counters.
-	 */
-	spin_lock(&mp->m_sb_lock);
-	ret = xfs_mod_incore_sb_unlocked(mp, field, delta, rsvd);
-	spin_unlock(&mp->m_sb_lock);
-
-	/*
-	 * Now that we've modified the global superblock, we
-	 * may be able to re-enable the distributed counters
-	 * (e.g. lots of space just got freed). After that
-	 * we are done.
-	 */
-	if (ret != ENOSPC)
-		xfs_icsb_balance_counter(mp, field, 0);
-	xfs_icsb_unlock(mp);
-	return ret;
-
-balance_counter:
-	xfs_icsb_unlock_cntr(icsbp);
-	preempt_enable();
-
-	/*
-	 * We may have multiple threads here if multiple per-cpu
-	 * counters run dry at the same time. This will mean we can
-	 * do more balances than strictly necessary but it is not
-	 * the common slowpath case.
-	 */
-	xfs_icsb_lock(mp);
-
-	/*
-	 * running atomically.
-	 *
-	 * This will leave the counter in the correct state for future
-	 * accesses. After the rebalance, we simply try again and our retry
-	 * will either succeed through the fast path or slow path without
-	 * another balance operation being required.
-	 */
-	xfs_icsb_balance_counter(mp, field, delta);
-	xfs_icsb_unlock(mp);
-	goto again;
-}
-
-#endif
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 5861b49..7efae1d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -65,44 +65,19 @@ struct xfs_nameops;
 struct xfs_ail;
 struct xfs_quotainfo;
 
-#ifdef HAVE_PERCPU_SB
-
 /*
- * Valid per-cpu incore superblock counters. Note that if you add new counters,
- * you may need to define new counter disabled bit field descriptors as there
- * are more possible fields in the superblock that can fit in a bitfield on a
- * 32 bit platform. The XFS_SBS_* values for the current current counters just
- * fit.
+ * Per-cpu incore superblock counters.
  */
-typedef struct xfs_icsb_cnts {
-	uint64_t	icsb_fdblocks;
-	uint64_t	icsb_ifree;
-	uint64_t	icsb_icount;
-	unsigned long	icsb_flags;
-} xfs_icsb_cnts_t;
-
-#define XFS_ICSB_FLAG_LOCK	(1 << 0)	/* counter lock bit */
+enum {
+	XFS_ICSB_FDBLOCKS = 0,
+	XFS_ICSB_IFREE,
+	XFS_ICSB_ICOUNT,
+	XFS_ICSB_MAX,
+};
 
-#define XFS_ICSB_LAZY_COUNT	(1 << 1)	/* accuracy not needed */
-
-extern int	xfs_icsb_init_counters(struct xfs_mount *);
-extern void	xfs_icsb_reinit_counters(struct xfs_mount *);
-extern void	xfs_icsb_destroy_counters(struct xfs_mount *);
-extern void	xfs_icsb_sync_counters(struct xfs_mount *, int);
-extern void	xfs_icsb_sync_counters_locked(struct xfs_mount *, int);
 extern int	xfs_icsb_modify_counters(struct xfs_mount *, xfs_sb_field_t,
 						int64_t, int);
 
-#else
-#define xfs_icsb_init_counters(mp)		(0)
-#define xfs_icsb_destroy_counters(mp)		do { } while (0)
-#define xfs_icsb_reinit_counters(mp)		do { } while (0)
-#define xfs_icsb_sync_counters(mp, flags)	do { } while (0)
-#define xfs_icsb_sync_counters_locked(mp, flags) do { } while (0)
-#define xfs_icsb_modify_counters(mp, field, delta, rsvd) \
-	xfs_mod_incore_sb(mp, field, delta, rsvd)
-#endif
-
 typedef struct xfs_mount {
 	struct super_block	*m_super;
 	xfs_tid_t		m_tid;		/* next unused tid for fs */
@@ -186,12 +161,6 @@ typedef struct xfs_mount {
 	struct xfs_chash	*m_chash;	/* fs private inode per-cluster
 						 * hash table */
 	atomic_t		m_active_trans;	/* number trans frozen */
-#ifdef HAVE_PERCPU_SB
-	xfs_icsb_cnts_t __percpu *m_sb_cnts;	/* per-cpu superblock counters */
-	unsigned long		m_icsb_counters; /* disabled per-cpu counters */
-	struct notifier_block	m_icsb_notifier; /* hotplug cpu notifier */
-	struct mutex		m_icsb_mutex;	/* balancer sync lock */
-#endif
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct task_struct	*m_sync_task;	/* generalised sync thread */
 	xfs_sync_work_t		m_sync_work;	/* work item for VFS_SYNC */
@@ -202,6 +171,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	struct percpu_counter	m_icsb[XFS_ICSB_MAX];
 } xfs_mount_t;
 
 /*
@@ -333,26 +303,6 @@ struct xfs_perag *xfs_perag_get_tag(struct xfs_mount *mp, xfs_agnumber_t agno,
 void	xfs_perag_put(struct xfs_perag *pag);
 
 /*
- * Per-cpu superblock locking functions
- */
-#ifdef HAVE_PERCPU_SB
-static inline void
-xfs_icsb_lock(xfs_mount_t *mp)
-{
-	mutex_lock(&mp->m_icsb_mutex);
-}
-
-static inline void
-xfs_icsb_unlock(xfs_mount_t *mp)
-{
-	mutex_unlock(&mp->m_icsb_mutex);
-}
-#else
-#define xfs_icsb_lock(mp)
-#define xfs_icsb_unlock(mp)
-#endif
-
-/*
  * This structure is for use by the xfs_mod_incore_sb_batch() routine.
  * xfs_growfs can specify a few fields which are more than int limit
  */
@@ -379,6 +329,20 @@ extern int	xfs_sb_validate_fsb_count(struct xfs_sb *, __uint64_t);
 
 extern int	xfs_dev_is_read_only(struct xfs_mount *, char *);
 
+extern int	xfs_icsb_init_counters(struct xfs_mount *);
+extern void	xfs_icsb_reinit_counters(struct xfs_mount *);
+extern void	xfs_icsb_destroy_counters(struct xfs_mount *);
+extern void	xfs_icsb_sync_counters_locked(struct xfs_mount *);
+
+static inline void
+xfs_icsb_sync_counters(
+	struct xfs_mount	*mp)
+{
+	spin_lock(&mp->m_sb_lock);
+	xfs_icsb_sync_counters_locked(mp);
+	spin_unlock(&mp->m_sb_lock);
+}
+
 #endif	/* __KERNEL__ */
 
 extern void	xfs_mod_sb(struct xfs_trans *, __int64_t);
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..32014a4 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -41,6 +41,8 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs);
+int percpu_counter_add_unless_lt(struct percpu_counter *fbc, s64 amount,
+							s64 threshold);
 
 static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
@@ -153,6 +155,20 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
 	return 1;
 }
 
+static inline int percpu_counter_test_and_add_delta(struct percpu_counter *fbc, s64 delta)
+{
+	s64 count;
+
+	preempt_disable();
+	count = fbc->count + delta;
+	if (count < 0)
+		return -1;
+	fbc->count = count;
+	preempt_enable();
+	return count ? 1 : 0;
+}
+
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 604678d..13c4ff3 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -213,6 +213,85 @@ int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
 }
 EXPORT_SYMBOL(percpu_counter_compare);
 
+/**
+ *
+ * percpu_counter_add_unless_lt - add to a counter avoiding underruns
+ * @fbc:	counter
+ * @amount:	amount to add
+ * @threshold:	underrun threshold
+ *
+ * Add @amount to @fdc if and only if result of addition is greater than or
+ * equal to @threshold  Return 1 if greater and added, 0 if equal and added
+ * and -1 if and underrun would have occured.
+ *
+ * This is useful for operations that must accurately and atomically only add a
+ * delta to a counter if the result is greater than a given (e.g. for freespace
+ * accounting with ENOSPC checking in filesystems).
+ */
+int percpu_counter_add_unless_lt(struct percpu_counter *fbc, s64 amount, s64
+threshold)
+{
+	s64	count;
+	s64	error = 2 * percpu_counter_batch * num_online_cpus();
+	int	cpu;
+	int	ret = -1;
+
+	preempt_disable();
+
+	/* Check to see if rough count will be sufficient for comparison */
+	count = percpu_counter_read(fbc);
+	if (count + amount < threshold - error)
+		goto out;
+
+	/*
+	 * If the counter is over the threshold and the change is less than the
+	 * batch size, we might be able to avoid locking.
+	 */
+	if (count > threshold + error && abs(amount) < percpu_counter_batch) {
+		__percpu_counter_add(fbc, amount, percpu_counter_batch);
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * If the result is is over the error threshold, we can just add it
+	 * into the global counter ignoring what is in the per-cpu counters
+	 * as they will not change the result of the calculation.
+	 */
+	spin_lock(&fbc->lock);
+	if (fbc->count + amount > threshold + error) {
+		fbc->count += amount;
+		ret = 1;
+		goto out_unlock;
+	}
+
+	/*
+	 * Result is withing the error margin. Run an open-coded sum of the
+	 * per-cpu counters to get the exact value at this point in time,
+	 * and if the result woul dbe above the threshold, add the amount to
+	 * the global counter.
+	 */
+	count = fbc->count;
+	for_each_online_cpu(cpu) {
+		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		count += *pcount;
+	}
+	WARN_ON(count < threshold);
+
+	if (count + amount >= threshold) {
+		ret = 0;
+		if (count + amount > threshold)
+			ret = 1;
+		fbc->count += amount;
+	}
+out_unlock:
+	spin_unlock(&fbc->lock);
+out:
+	preempt_enable();
+	return ret;
+}
+EXPORT_SYMBOL(percpu_counter_add_unless_lt);
+
 static int __init percpu_counter_startup(void)
 {
 	compute_batch_value();
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/16] xfs: dynamic speculative EOF preallocation
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (2 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 11:43   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.

Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases.  Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.

For default settings, ひe size and the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:

EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
   1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
   2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
   3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
   4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
   5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
   6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
   7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088

and for 16 concurrent 16GB file writes:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
   1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
   2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
   3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
   4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
   5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
   6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
   7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208

The allocsize mount option still controls the minimum preallocation size, so
the smallest extent size can stil be bound in situations where this behaviour
is not sufficient.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c |   53 ++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2057614..0227ac1 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -389,6 +389,9 @@ error_out:
  * If the caller is doing a write at the end of the file, then extend the
  * allocation out to the file system's write iosize.  We clean up any extra
  * space left over when the file is closed in xfs_inactive().
+ *
+ * If we find we already have delalloc preallocation beyond EOF, don't do more
+ * preallocation as it it not needed.
  */
 STATIC int
 xfs_iomap_eof_want_preallocate(
@@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate(
 	xfs_filblks_t   count_fsb;
 	xfs_fsblock_t	firstblock;
 	int		n, error, imaps;
+	int		found_delalloc = 0;
 
 	*prealloc = 0;
 	if ((offset + count) <= ip->i_size)
@@ -427,11 +431,16 @@ xfs_iomap_eof_want_preallocate(
 			if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
 			    (imap[n].br_startblock != DELAYSTARTBLOCK))
 				return 0;
+
 			start_fsb += imap[n].br_blockcount;
 			count_fsb -= imap[n].br_blockcount;
+
+			if (imap[n].br_startblock == DELAYSTARTBLOCK)
+				found_delalloc = 1;
 		}
 	}
-	*prealloc = 1;
+	if (!found_delalloc)
+		*prealloc = 1;
 	return 0;
 }
 
@@ -469,6 +478,7 @@ xfs_iomap_write_delay(
 	extsz = xfs_get_extsz_hint(ip);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
+
 	error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
 				ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
 	if (error)
@@ -476,9 +486,23 @@ xfs_iomap_write_delay(
 
 retry:
 	if (prealloc) {
+		xfs_fileoff_t	alloc_blocks = 0;
+		/*
+		 * If we don't have a user specified preallocation size, dynamically
+		 * increase the preallocation size as the size of the file
+		 * grows. Cap the maximum size at a single extent.
+		 */
+		if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
+			alloc_blocks = XFS_B_TO_FSB(mp, ip->i_size);
+			alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
+					rounddown_pow_of_two(alloc_blocks));
+		}
+		if (alloc_blocks < mp->m_writeio_blocks)
+			alloc_blocks = mp->m_writeio_blocks;
+
 		aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1));
 		ioalign = XFS_B_TO_FSBT(mp, aligned_offset);
-		last_fsb = ioalign + mp->m_writeio_blocks;
+		last_fsb = ioalign + alloc_blocks;
 	} else {
 		last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
 	}
@@ -496,22 +520,31 @@ retry:
 			  XFS_BMAPI_DELAY | XFS_BMAPI_WRITE |
 			  XFS_BMAPI_ENTIRE, &firstblock, 1, imap,
 			  &nimaps, NULL);
-	if (error && (error != ENOSPC))
+	switch (error) {
+	case 0:
+	case ENOSPC:
+	case EDQUOT:
+		break;
+	default:
 		return XFS_ERROR(error);
+	}
 
 	/*
-	 * If bmapi returned us nothing, and if we didn't get back EDQUOT,
-	 * then we must have run out of space - flush all other inodes with
-	 * delalloc blocks and retry without EOF preallocation.
+	 * If bmapi returned us nothing, we got either ENOSPC or EDQUOT.  For
+	 * ENOSPC, * flush all other inodes with delalloc blocks to free up
+	 * some of the excess reserved metadata space. For both cases, retry
+	 * without EOF preallocation.
 	 */
 	if (nimaps == 0) {
 		trace_xfs_delalloc_enospc(ip, offset, count);
 		if (flushed)
-			return XFS_ERROR(ENOSPC);
+			return XFS_ERROR(error ? error : ENOSPC);
 
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		xfs_flush_inodes(ip);
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		if (error == ENOSPC) {
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			xfs_flush_inodes(ip);
+			xfs_ilock(ip, XFS_ILOCK_EXCL);
+		}
 
 		flushed = 1;
 		error = 0;
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (3 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 04/16] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 11:36   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 06/16] patch xfs-inode-hash-fake Dave Chinner
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF.  This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.

To avoid this problem, keep a count of the number of ->release calls
made on an inode. For most cases, an inode is only going to be opened
once for writing and then closed again during it's lifetime in
cache. Hence if there are multiple ->release calls, there is a good
chance that the inode is being accessed by the NFS server. Hence
count up every time ->release is called while there are delalloc
blocks still outstanding on the inode.

If this count is non-zero when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.

If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.

The new counter in the struct xfs_inode fits into a hole in the
structure on 64 bit machines, so does not grow the size of the inode
at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap.c     |    9 +++++-
 fs/xfs/xfs_dfrag.c    |   13 ++++++++++
 fs/xfs/xfs_iget.c     |    1 +
 fs/xfs/xfs_inode.h    |    1 +
 fs/xfs/xfs_vnodeops.c |   61 ++++++++++++++++++++++++++++++++-----------------
 5 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 8abd12e..7764a4f 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -5471,8 +5471,13 @@ xfs_getbmap(
 			if (error)
 				goto out_unlock_iolock;
 		}
-
-		ASSERT(ip->i_delayed_blks == 0);
+		/*
+		 * even after flushing the inode, there can still be delalloc
+		 * blocks on the inode beyond EOF due to speculative
+		 * preallocation. These are not removed until the release
+		 * function is called or the inode is inactivated. Hence we
+		 * cannot assert here that ip->i_delayed_blks == 0.
+		 */
 	}
 
 	lock = xfs_ilock_map_shared(ip);
diff --git a/fs/xfs/xfs_dfrag.c b/fs/xfs/xfs_dfrag.c
index 3b9582c..e60490b 100644
--- a/fs/xfs/xfs_dfrag.c
+++ b/fs/xfs/xfs_dfrag.c
@@ -377,6 +377,19 @@ xfs_swap_extents(
 	ip->i_d.di_format = tip->i_d.di_format;
 	tip->i_d.di_format = tmp;
 
+	/*
+	 * The extents in the source inode could still contain speculative
+	 * preallocation beyond EOF (e.g. the file is open but not modified
+	 * while defrag is in progress). In that case, we need to copy over the
+	 * number of delalloc blocks the data fork in the source inode is
+	 * tracking beyond EOF so that when the fork is truncated away when the
+	 * temporary inode is unlinked we don't underrun the i_delayed_blks
+	 * counter on that inode.
+	 */
+	ASSERT(tip->i_delayed_blks == 0);
+	tip->i_delayed_blks = ip->i_delayed_blks;
+	ip->i_delayed_blks = 0;
+
 	ilf_fields = XFS_ILOG_CORE;
 
 	switch(ip->i_d.di_format) {
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 0cdd269..18991a9 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -84,6 +84,7 @@ xfs_inode_alloc(
 	memset(&ip->i_d, 0, sizeof(xfs_icdinode_t));
 	ip->i_size = 0;
 	ip->i_new_size = 0;
+	ip->i_dirty_releases = 0;
 
 	/* prevent anyone from using this yet */
 	VFS_I(ip)->i_state = I_NEW;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index fb2ca2e..ea2f34e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -260,6 +260,7 @@ typedef struct xfs_inode {
 	xfs_fsize_t		i_size;		/* in-memory size */
 	xfs_fsize_t		i_new_size;	/* size when write completes */
 	atomic_t		i_iocount;	/* outstanding I/O count */
+	int			i_dirty_releases; /* dirty ->release calls */
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 8e4a63c..49f3a5a 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -964,29 +964,48 @@ xfs_release(
 			xfs_flush_pages(ip, 0, -1, XBF_ASYNC, FI_NONE);
 	}
 
-	if (ip->i_d.di_nlink != 0) {
-		if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
-		     ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
-		       ip->i_delayed_blks > 0)) &&
-		     (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
-		    (!(ip->i_d.di_flags &
-				(XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+	if (ip->i_d.di_nlink == 0)
+		return 0;
 
-			/*
-			 * If we can't get the iolock just skip truncating
-			 * the blocks past EOF because we could deadlock
-			 * with the mmap_sem otherwise.  We'll get another
-			 * chance to drop them once the last reference to
-			 * the inode is dropped, so we'll never leak blocks
-			 * permanently.
-			 */
-			error = xfs_free_eofblocks(mp, ip,
-						   XFS_FREE_EOF_TRYLOCK);
-			if (error)
-				return error;
-		}
-	}
+	if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
+	     ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
+	       ip->i_delayed_blks > 0)) &&
+	     (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
+	    (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+		/*
+		 * If we can't get the iolock just skip truncating the blocks
+		 * past EOF because we could deadlock with the mmap_sem
+		 * otherwise.  We'll get another chance to drop them once the
+		 * last reference to the inode is dropped, so we'll never leak
+		 * blocks permanently.
+		 *
+		 * Further, count the number of times we get here in the life
+		 * of this inode. If the inode is being opened, written and
+		 * closed frequently and we have delayed allocation blocks
+		 * oustanding (e.g. streaming writes from the NFS server),
+		 * truncating the blocks past EOF will cause fragmentation to
+		 * occur.
+		 *
+		 * In this case don't do the truncation, either, but we have to
+		 * be careful how we detect this case. Blocks beyond EOF show
+		 * up as i_delayed_blks even when the inode is clean, so we
+		 * need to truncate them away first before checking for a dirty
+		 * release. Hence on the first couple of dirty closes,we will
+		 * still remove the speculative allocation, but then we will
+		 * leave it in place.
+		 */
+		if (ip->i_dirty_releases > 1)
+			return 0;
 
+		error = xfs_free_eofblocks(mp, ip,
+					   XFS_FREE_EOF_TRYLOCK);
+		if (error)
+			return error;
+
+		/* delalloc blocks after truncation means it really is dirty */
+		if (ip->i_delayed_blks)
+			ip->i_dirty_releases++;
+	}
 	return 0;
 }
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/16] patch xfs-inode-hash-fake
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (4 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08  9:19   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking Dave Chinner
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

---
 fs/xfs/linux-2.6/xfs_iops.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 496455a..8b46867 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -753,6 +753,10 @@ xfs_diflags_to_iflags(
  * We are always called with an uninitialised linux inode here.
  * We need to initialise the necessary fields and take a reference
  * on it.
+ *
+ * We don't use the VFS inode hash for lookups anymore, so make the inode look
+ * hashed to the VFS by faking it. This avoids needing to touch inode hash
+ * locks in this path, but makes the VFS believe the inode is validly hashed.
  */
 void
 xfs_setup_inode(
@@ -764,7 +768,7 @@ xfs_setup_inode(
 	inode->i_state = I_NEW;
 
 	inode_sb_list_add(inode);
-	insert_inode_hash(inode);
+	hlist_nulls_add_fake(&inode->i_hash);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	inode->i_nlink	= ip->i_d.di_nlink;
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (5 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 06/16] patch xfs-inode-hash-fake Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 23:09   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock Dave Chinner
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.

To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.

We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/linux-2.6/xfs_iops.c    |    7 +++++-
 fs/xfs/linux-2.6/xfs_sync.c    |   13 +++++++++--
 fs/xfs/quota/xfs_qm_syscalls.c |    3 ++
 fs/xfs/xfs_iget.c              |   44 ++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_inode.c             |   22 ++++++++++++-------
 5 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 8b46867..909bd9c 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -757,6 +757,8 @@ xfs_diflags_to_iflags(
  * We don't use the VFS inode hash for lookups anymore, so make the inode look
  * hashed to the VFS by faking it. This avoids needing to touch inode hash
  * locks in this path, but makes the VFS believe the inode is validly hashed.
+ * We initialise i_state and i_hash under the i_lock so that we follow the same
+ * setup rules that the rest of the VFS follows.
  */
 void
 xfs_setup_inode(
@@ -765,10 +767,13 @@ xfs_setup_inode(
 	struct inode		*inode = &ip->i_vnode;
 
 	inode->i_ino = ip->i_ino;
+
+	spin_lock(&inode->i_lock);
 	inode->i_state = I_NEW;
+	hlist_nulls_add_fake(&inode->i_hash);
+	spin_unlock(&inode->i_lock);
 
 	inode_sb_list_add(inode);
-	hlist_nulls_add_fake(&inode->i_hash);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	inode->i_nlink	= ip->i_d.di_nlink;
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index afb0d7c..9a53cc9 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -53,6 +53,10 @@ xfs_inode_ag_walk_grab(
 {
 	struct inode		*inode = VFS_I(ip);
 
+	/* check for stale RCU freed inode */
+	if (!ip->i_ino)
+		return ENOENT;
+
 	/* nothing to sync during shutdown */
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return EFSCORRUPTED;
@@ -98,12 +102,12 @@ restart:
 		int		error = 0;
 		int		i;
 
-		read_lock(&pag->pag_ici_lock);
+		rcu_read_lock();
 		nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
 					(void **)batch, first_index,
 					XFS_LOOKUP_BATCH);
 		if (!nr_found) {
-			read_unlock(&pag->pag_ici_lock);
+			rcu_read_unlock();
 			break;
 		}
 
@@ -129,7 +133,7 @@ restart:
 		}
 
 		/* unlock now we've grabbed the inodes. */
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 
 		for (i = 0; i < nr_found; i++) {
 			if (!batch[i])
@@ -639,6 +643,9 @@ xfs_reclaim_inode_grab(
 	struct xfs_inode	*ip,
 	int			flags)
 {
+	/* check for stale RCU freed inode */
+	if (!ip->i_ino)
+		return 1;
 
 	/*
 	 * do some unlocked checks first to avoid unnecceary lock traffic.
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index bdebc18..8b207fc 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -875,6 +875,9 @@ xfs_dqrele_inode(
 	struct xfs_perag	*pag,
 	int			flags)
 {
+	if (!ip->i_ino)
+		return ENOENT;
+
 	/* skip quota inodes */
 	if (ip == ip->i_mount->m_quotainfo->qi_uquotaip ||
 	    ip == ip->i_mount->m_quotainfo->qi_gquotaip) {
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 18991a9..edeb918 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -69,6 +69,7 @@ xfs_inode_alloc(
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
 	ASSERT(!spin_is_locked(&ip->i_flags_lock));
 	ASSERT(completion_done(&ip->i_flush));
+	ASSERT(ip->i_ino == 0);
 
 	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
 
@@ -86,9 +87,6 @@ xfs_inode_alloc(
 	ip->i_new_size = 0;
 	ip->i_dirty_releases = 0;
 
-	/* prevent anyone from using this yet */
-	VFS_I(ip)->i_state = I_NEW;
-
 	return ip;
 }
 
@@ -135,6 +133,16 @@ xfs_inode_free(
 	ASSERT(!spin_is_locked(&ip->i_flags_lock));
 	ASSERT(completion_done(&ip->i_flush));
 
+	/*
+	 * because we use SLAB_DESTROY_BY_RCU freeing, ensure the inode
+	 * always appears to be reclaimed with an invalid inode number
+	 * when in the free state. The ip->i_flags_lock provides the barrier
+	 * against lookup races.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags = XFS_IRECLAIM;
+	ip->i_ino = 0;
+	spin_unlock(&ip->i_flags_lock);
 	kmem_zone_free(xfs_inode_zone, ip);
 }
 
@@ -146,12 +154,28 @@ xfs_iget_cache_hit(
 	struct xfs_perag	*pag,
 	struct xfs_inode	*ip,
 	int			flags,
-	int			lock_flags) __releases(pag->pag_ici_lock)
+	int			lock_flags) __releases(RCU)
 {
 	struct inode		*inode = VFS_I(ip);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error;
 
+	/*
+	 * check for re-use of an inode within an RCU grace period due to the
+	 * radix tree nodes not being updated yet. We monitor for this by
+	 * setting the inode number to zero before freeing the inode structure.
+	 * We don't need to recheck this after taking the i_flags_lock because
+	 * the check against XFS_IRECLAIM will catch a freed inode.
+	 */
+	if (ip->i_ino == 0) {
+		trace_xfs_iget_skip(ip);
+		XFS_STATS_INC(xs_ig_frecycle);
+		rcu_read_unlock();
+		/* Expire the grace period so we don't trip over it again. */
+		synchronize_rcu();
+		return EAGAIN;
+	}
+
 	spin_lock(&ip->i_flags_lock);
 
 	/*
@@ -195,7 +219,7 @@ xfs_iget_cache_hit(
 		ip->i_flags |= XFS_IRECLAIM;
 
 		spin_unlock(&ip->i_flags_lock);
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 
 		error = -inode_init_always(mp->m_super, inode);
 		if (error) {
@@ -203,7 +227,7 @@ xfs_iget_cache_hit(
 			 * Re-initializing the inode failed, and we are in deep
 			 * trouble.  Try to re-add it to the reclaim list.
 			 */
-			read_lock(&pag->pag_ici_lock);
+			rcu_read_lock();
 			spin_lock(&ip->i_flags_lock);
 
 			ip->i_flags &= ~XFS_INEW;
@@ -231,7 +255,7 @@ xfs_iget_cache_hit(
 
 		/* We've got a live one. */
 		spin_unlock(&ip->i_flags_lock);
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 		trace_xfs_iget_hit(ip);
 	}
 
@@ -245,7 +269,7 @@ xfs_iget_cache_hit(
 
 out_error:
 	spin_unlock(&ip->i_flags_lock);
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	return error;
 }
 
@@ -376,7 +400,7 @@ xfs_iget(
 
 again:
 	error = 0;
-	read_lock(&pag->pag_ici_lock);
+	rcu_read_lock();
 	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
 
 	if (ip) {
@@ -384,7 +408,7 @@ again:
 		if (error)
 			goto out_error_or_again;
 	} else {
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 		XFS_STATS_INC(xs_ig_missed);
 
 		error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 108c7a0..25becb1 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2000,13 +2000,14 @@ xfs_ifree_cluster(
 		 */
 		for (i = 0; i < ninodes; i++) {
 retry:
-			read_lock(&pag->pag_ici_lock);
+			rcu_read_lock();
 			ip = radix_tree_lookup(&pag->pag_ici_root,
 					XFS_INO_TO_AGINO(mp, (inum + i)));
 
 			/* Inode not in memory or stale, nothing to do */
-			if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) {
-				read_unlock(&pag->pag_ici_lock);
+			if (!ip || !ip->i_ino ||
+			    xfs_iflags_test(ip, XFS_ISTALE)) {
+				rcu_read_unlock();
 				continue;
 			}
 
@@ -2019,11 +2020,11 @@ retry:
 			 */
 			if (ip != free_ip &&
 			    !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
-				read_unlock(&pag->pag_ici_lock);
+				rcu_read_unlock();
 				delay(1);
 				goto retry;
 			}
-			read_unlock(&pag->pag_ici_lock);
+			rcu_read_unlock();
 
 			xfs_iflock(ip);
 			xfs_iflags_set(ip, XFS_ISTALE);
@@ -2629,7 +2630,7 @@ xfs_iflush_cluster(
 
 	mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1);
 	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	read_lock(&pag->pag_ici_lock);
+	rcu_read_lock();
 	/* really need a gang lookup range call here */
 	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
 					first_index, inodes_per_cluster);
@@ -2640,6 +2641,11 @@ xfs_iflush_cluster(
 		iq = ilist[i];
 		if (iq == ip)
 			continue;
+
+		/* check we've got a valid inode */
+		if (!iq->i_ino)
+			continue;
+
 		/* if the inode lies outside this cluster, we're done. */
 		if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index)
 			break;
@@ -2692,7 +2698,7 @@ xfs_iflush_cluster(
 	}
 
 out_free:
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	kmem_free(ilist);
 out_put:
 	xfs_perag_put(pag);
@@ -2704,7 +2710,7 @@ cluster_corrupt_out:
 	 * Corruption detected in the clustering loop.  Invalidate the
 	 * inode buffer and shut down the filesystem.
 	 */
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	/*
 	 * Clean up the buffer.  If it was B_DELWRI, just release it --
 	 * brelse can handle it with no problems.  If not, shut down the
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (6 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 23:10   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 09/16] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   14 +++++++-------
 fs/xfs/xfs_ag.h             |    2 +-
 fs/xfs/xfs_iget.c           |   10 +++++-----
 fs/xfs/xfs_mount.c          |    2 +-
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 9a53cc9..0b3d367 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -596,12 +596,12 @@ xfs_inode_set_reclaim_tag(
 	struct xfs_perag *pag;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	write_lock(&pag->pag_ici_lock);
+	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
 	__xfs_inode_set_reclaim_tag(pag, ip);
 	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
 	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
 }
 
@@ -802,12 +802,12 @@ reclaim:
 	 * added to the tree assert that it's been there before to catch
 	 * problems with the inode life time early on.
 	 */
-	write_lock(&pag->pag_ici_lock);
+	spin_lock(&pag->pag_ici_lock);
 	if (!radix_tree_delete(&pag->pag_ici_root,
 				XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino)))
 		ASSERT(0);
 	__xfs_inode_clear_reclaim(pag, ip);
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 
 	/*
 	 * Here we do an (almost) spurious inode lock in order to coordinate
@@ -871,14 +871,14 @@ restart:
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
 
-			write_lock(&pag->pag_ici_lock);
+			spin_lock(&pag->pag_ici_lock);
 			nr_found = radix_tree_gang_lookup_tag(
 					&pag->pag_ici_root,
 					(void **)batch, first_index,
 					XFS_LOOKUP_BATCH,
 					XFS_ICI_RECLAIM_TAG);
 			if (!nr_found) {
-				write_unlock(&pag->pag_ici_lock);
+				spin_unlock(&pag->pag_ici_lock);
 				break;
 			}
 
@@ -905,7 +905,7 @@ restart:
 			}
 
 			/* unlock now we've grabbed the inodes. */
-			write_unlock(&pag->pag_ici_lock);
+			spin_unlock(&pag->pag_ici_lock);
 
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 63c7a1a..58632cc 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -227,7 +227,7 @@ typedef struct xfs_perag {
 
 	atomic_t        pagf_fstrms;    /* # of filestreams active in this AG */
 
-	rwlock_t	pag_ici_lock;	/* incore inode lock */
+	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
 	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index edeb918..e00d88c 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -237,14 +237,14 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
-		write_lock(&pag->pag_ici_lock);
+		spin_lock(&pag->pag_ici_lock);
 		spin_lock(&ip->i_flags_lock);
 		ip->i_flags &= ~(XFS_IRECLAIMABLE | XFS_IRECLAIM);
 		ip->i_flags |= XFS_INEW;
 		__xfs_inode_clear_reclaim_tag(mp, pag, ip);
 		inode->i_state = I_NEW;
 		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
+		spin_unlock(&pag->pag_ici_lock);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -322,7 +322,7 @@ xfs_iget_cache_miss(
 			BUG();
 	}
 
-	write_lock(&pag->pag_ici_lock);
+	spin_lock(&pag->pag_ici_lock);
 
 	/* insert the new inode */
 	error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
@@ -337,14 +337,14 @@ xfs_iget_cache_miss(
 	ip->i_udquot = ip->i_gdquot = NULL;
 	xfs_iflags_set(ip, XFS_INEW);
 
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 	radix_tree_preload_end();
 
 	*ipp = ip;
 	return 0;
 
 out_preload_end:
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 	radix_tree_preload_end();
 	if (lock_flags)
 		xfs_iunlock(ip, lock_flags);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 0d9a030..7544258 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -564,7 +564,7 @@ xfs_initialize_perag(
 			goto out_unwind;
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
-		rwlock_init(&pag->pag_ici_lock);
+		spin_lock_init(&pag->pag_ici_lock);
 		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		spin_lock_init(&pag->pag_buf_lock);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/16] xfs: convert xfsbud shrinker to a per-buftarg shrinker.
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (7 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08  8:55 ` [PATCH 10/16] xfs: add a lru to the XFS buffer cache Dave Chinner
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   89 ++++++++++++--------------------------------
 fs/xfs/linux-2.6/xfs_buf.h |    4 +-
 2 files changed, 27 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index aa1d353..f21803b 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -44,12 +44,7 @@
 
 static kmem_zone_t *xfs_buf_zone;
 STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(struct shrinker *, int, gfp_t);
 STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
-static struct shrinker xfs_buf_shake = {
-	.shrink = xfsbufd_wakeup,
-	.seeks = DEFAULT_SEEKS,
-};
 
 static struct workqueue_struct *xfslogd_workqueue;
 struct workqueue_struct *xfsdatad_workqueue;
@@ -337,7 +332,6 @@ _xfs_buf_lookup_pages(
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
-			xfsbufd_wakeup(NULL, 0, gfp_mask);
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
@@ -1464,28 +1458,23 @@ xfs_wait_buftarg(
 	}
 }
 
-/*
- *	buftarg list for delwrite queue processing
- */
-static LIST_HEAD(xfs_buftarg_list);
-static DEFINE_SPINLOCK(xfs_buftarg_lock);
-
-STATIC void
-xfs_register_buftarg(
-	xfs_buftarg_t           *btp)
-{
-	spin_lock(&xfs_buftarg_lock);
-	list_add(&btp->bt_list, &xfs_buftarg_list);
-	spin_unlock(&xfs_buftarg_lock);
-}
-
-STATIC void
-xfs_unregister_buftarg(
-	xfs_buftarg_t           *btp)
+int
+xfs_buftarg_shrink(
+	struct shrinker		*shrink,
+	int			nr_to_scan,
+	gfp_t			mask)
 {
-	spin_lock(&xfs_buftarg_lock);
-	list_del(&btp->bt_list);
-	spin_unlock(&xfs_buftarg_lock);
+	struct xfs_buftarg	*btp = container_of(shrink,
+					struct xfs_buftarg, bt_shrinker);
+	if (nr_to_scan) {
+		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
+			return -1;
+		if (list_empty(&btp->bt_delwrite_queue))
+			return -1;
+		set_bit(XBT_FORCE_FLUSH, &btp->bt_flags);
+		wake_up_process(btp->bt_task);
+	}
+	return list_empty(&btp->bt_delwrite_queue) ? -1 : 1;
 }
 
 void
@@ -1493,17 +1482,14 @@ xfs_free_buftarg(
 	struct xfs_mount	*mp,
 	struct xfs_buftarg	*btp)
 {
+	unregister_shrinker(&btp->bt_shrinker);
+
 	xfs_flush_buftarg(btp, 1);
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
 	iput(btp->bt_mapping->host);
 
-	/* Unregister the buftarg first so that we don't get a
-	 * wakeup finding a non-existent task
-	 */
-	xfs_unregister_buftarg(btp);
 	kthread_stop(btp->bt_task);
-
 	kmem_free(btp);
 }
 
@@ -1600,20 +1586,13 @@ xfs_alloc_delwrite_queue(
 	xfs_buftarg_t		*btp,
 	const char		*fsname)
 {
-	int	error = 0;
-
-	INIT_LIST_HEAD(&btp->bt_list);
 	INIT_LIST_HEAD(&btp->bt_delwrite_queue);
 	spin_lock_init(&btp->bt_delwrite_lock);
 	btp->bt_flags = 0;
 	btp->bt_task = kthread_run(xfsbufd, btp, "xfsbufd/%s", fsname);
-	if (IS_ERR(btp->bt_task)) {
-		error = PTR_ERR(btp->bt_task);
-		goto out_error;
-	}
-	xfs_register_buftarg(btp);
-out_error:
-	return error;
+	if (IS_ERR(btp->bt_task))
+		return PTR_ERR(btp->bt_task);
+	return 0;
 }
 
 xfs_buftarg_t *
@@ -1636,6 +1615,9 @@ xfs_alloc_buftarg(
 		goto error;
 	if (xfs_alloc_delwrite_queue(btp, fsname))
 		goto error;
+	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
+	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
+	register_shrinker(&btp->bt_shrinker);
 	return btp;
 
 error:
@@ -1740,27 +1722,6 @@ xfs_buf_runall_queues(
 	flush_workqueue(queue);
 }
 
-STATIC int
-xfsbufd_wakeup(
-	struct shrinker		*shrink,
-	int			priority,
-	gfp_t			mask)
-{
-	xfs_buftarg_t		*btp;
-
-	spin_lock(&xfs_buftarg_lock);
-	list_for_each_entry(btp, &xfs_buftarg_list, bt_list) {
-		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
-			continue;
-		if (list_empty(&btp->bt_delwrite_queue))
-			continue;
-		set_bit(XBT_FORCE_FLUSH, &btp->bt_flags);
-		wake_up_process(btp->bt_task);
-	}
-	spin_unlock(&xfs_buftarg_lock);
-	return 0;
-}
-
 /*
  * Move as many buffers as specified to the supplied list
  * idicating if we skipped any buffers to prevent deadlocks.
@@ -1955,7 +1916,6 @@ xfs_buf_init(void)
 	if (!xfsconvertd_workqueue)
 		goto out_destroy_xfsdatad_workqueue;
 
-	register_shrinker(&xfs_buf_shake);
 	return 0;
 
  out_destroy_xfsdatad_workqueue:
@@ -1971,7 +1931,6 @@ xfs_buf_init(void)
 void
 xfs_buf_terminate(void)
 {
-	unregister_shrinker(&xfs_buf_shake);
 	destroy_workqueue(xfsconvertd_workqueue);
 	destroy_workqueue(xfsdatad_workqueue);
 	destroy_workqueue(xfslogd_workqueue);
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 383a3f3..9344103 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -128,10 +128,12 @@ typedef struct xfs_buftarg {
 
 	/* per device delwri queue */
 	struct task_struct	*bt_task;
-	struct list_head	bt_list;
 	struct list_head	bt_delwrite_queue;
 	spinlock_t		bt_delwrite_lock;
 	unsigned long		bt_flags;
+
+	/* LRU control structures */
+	struct shrinker		bt_shrinker;
 } xfs_buftarg_t;
 
 /*
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/16] xfs: add a lru to the XFS buffer cache
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (8 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 09/16] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 23:19   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 11/16] xfs: connect up buffer reclaim priority hooks Dave Chinner
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.

The implementation introduces a b_lru_ref counter into the buffer.
This is currently set to 1 whenever the buffer is referenced and so is used to
determine if the buffer should be added to the LRU or not when freed.
Effectively it allows lazy LRU initialisation of the buffer so we do not need
to touch the LRU list and locks in xfs_buf_find().

Instead, when the buffer is being released and we drop the last
reference to it, we check the b_lru_ref count and if it is none zero
we re-add the buffer reference and add the inode to the LRU. The
b_lru_ref counter is decremented by the shrinker, and whenever the
shrinker comes across a buffer with a zero b_lru_ref counter, if
released the LRU reference on the buffer. In the absence of a lookup
race, this will result in the buffer being freed.

This counting mechanism is used instead of a reference flag so that
it is simple to re-introduce buffer-type specific reclaim reference
counts to prioritise reclaim more effectively. We still have all
those hooks in the XFS code, so this will provide the infrastructure
to re-implement that functionality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |  168 ++++++++++++++++++++++++++++++++++++++------
 fs/xfs/linux-2.6/xfs_buf.h |    8 ++-
 2 files changed, 153 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index f21803b..80d9f13 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -163,8 +163,79 @@ test_page_region(
 }
 
 /*
- *	Internal xfs_buf_t object manipulation
+ * xfs_buf_lru_add - add a buffer to the LRU.
+ *
+ * The LRU takes a new reference to the buffer so that it will only be freed
+ * once the shrinker takes the buffer off the LRU.
+ */
+STATIC void
+xfs_buf_lru_add(
+	struct xfs_buf	*bp)
+{
+	struct xfs_buftarg *btp = bp->b_target;
+
+	spin_lock(&btp->bt_lru_lock);
+	if (list_empty(&bp->b_lru)) {
+		atomic_inc(&bp->b_hold);
+		list_add_tail(&bp->b_lru, &btp->bt_lru);
+		btp->bt_lru_nr++;
+	}
+	spin_unlock(&btp->bt_lru_lock);
+}
+
+/*
+ * xfs_buf_lru_del - remove a buffer from the LRU
+ *
+ * The unlocked check is safe here because it only occurs when there are not
+ * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
+ * to optimise the shrinker removing the buffer from the LRU and calling
+ * xfs_buf_free(). i.e. it removes an unneccessary round trip on the
+ * bt_lru_lock.
+ */
+STATIC void
+xfs_buf_lru_del(
+	struct xfs_buf	*bp)
+{
+	struct xfs_buftarg *btp = bp->b_target;
+
+	if (list_empty(&bp->b_lru))
+		return;
+
+	spin_lock(&btp->bt_lru_lock);
+	if (!list_empty(&bp->b_lru)) {
+		list_del_init(&bp->b_lru);
+		btp->bt_lru_nr--;
+	}
+	spin_unlock(&btp->bt_lru_lock);
+}
+
+/*
+ * When we mark a buffer stale, we remove the buffer from the LRU and clear the
+ * b_lru_ref count so that the buffer is freed immediately when the buffer
+ * reference count falls to zero. If the buffer is already on the LRU, we need
+ * to remove the reference that LRU holds on the buffer.
+ *
+ * This prevents build-up of stale buffers on the LRU.
  */
+void
+xfs_buf_stale(
+	struct xfs_buf	*bp)
+{
+	bp->b_flags |= XBF_STALE;
+	atomic_set(&(bp)->b_lru_ref, 0);
+	if (!list_empty(&bp->b_lru)) {
+		struct xfs_buftarg *btp = bp->b_target;
+
+		spin_lock(&btp->bt_lru_lock);
+		if (!list_empty(&bp->b_lru)) {
+			list_del_init(&bp->b_lru);
+			btp->bt_lru_nr--;
+			atomic_dec(&bp->b_hold);
+		}
+		spin_unlock(&btp->bt_lru_lock);
+	}
+	ASSERT(atomic_read(&bp->b_hold) >= 1);
+}
 
 STATIC void
 _xfs_buf_initialize(
@@ -181,7 +252,9 @@ _xfs_buf_initialize(
 
 	memset(bp, 0, sizeof(xfs_buf_t));
 	atomic_set(&bp->b_hold, 1);
+	atomic_set(&bp->b_lru_ref, 1);
 	init_completion(&bp->b_iowait);
+	INIT_LIST_HEAD(&bp->b_lru);
 	INIT_LIST_HEAD(&bp->b_list);
 	RB_CLEAR_NODE(&bp->b_rbnode);
 	sema_init(&bp->b_sema, 0); /* held, no waiters */
@@ -257,6 +330,8 @@ xfs_buf_free(
 {
 	trace_xfs_buf_free(bp, _RET_IP_);
 
+	ASSERT(list_empty(&bp->b_lru));
+
 	if (bp->b_flags & (_XBF_PAGE_CACHE|_XBF_PAGES)) {
 		uint		i;
 
@@ -471,6 +546,8 @@ _xfs_buf_find(
 		/* the buffer keeps the perag reference until it is freed */
 		new_bp->b_pag = pag;
 		spin_unlock(&pag->pag_buf_lock);
+
+		xfs_buf_lru_add(new_bp);
 	} else {
 		XFS_STATS_INC(xb_miss_locked);
 		spin_unlock(&pag->pag_buf_lock);
@@ -835,6 +912,7 @@ xfs_buf_rele(
 
 	if (!pag) {
 		ASSERT(!bp->b_relse);
+		ASSERT(list_empty(&bp->b_lru));
 		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
 		if (atomic_dec_and_test(&bp->b_hold))
 			xfs_buf_free(bp);
@@ -842,13 +920,19 @@ xfs_buf_rele(
 	}
 
 	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
+
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
 		if (bp->b_relse) {
 			atomic_inc(&bp->b_hold);
 			spin_unlock(&pag->pag_buf_lock);
 			bp->b_relse(bp);
+		} else if (!(bp->b_flags & XBF_STALE) &&
+			   atomic_read(&bp->b_lru_ref)) {
+			xfs_buf_lru_add(bp);
+			spin_unlock(&pag->pag_buf_lock);
 		} else {
+			xfs_buf_lru_del(bp);
 			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
 			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 			spin_unlock(&pag->pag_buf_lock);
@@ -1435,27 +1519,35 @@ xfs_buf_iomove(
  */
 
 /*
- *	Wait for any bufs with callbacks that have been submitted but
- *	have not yet returned... walk the hash list for the target.
+ * Wait for any bufs with callbacks that have been submitted but have not yet
+ * returned. These buffers will have an elevated hold count, so wait on those
+ * while freeing all the buffers only held by the LRU.
  */
 void
 xfs_wait_buftarg(
 	struct xfs_buftarg	*btp)
 {
-	struct xfs_perag	*pag;
-	uint			i;
+	struct xfs_buf		*bp;
 
-	for (i = 0; i < btp->bt_mount->m_sb.sb_agcount; i++) {
-		pag = xfs_perag_get(btp->bt_mount, i);
-		spin_lock(&pag->pag_buf_lock);
-		while (rb_first(&pag->pag_buf_tree)) {
-			spin_unlock(&pag->pag_buf_lock);
+restart:
+	spin_lock(&btp->bt_lru_lock);
+	while (!list_empty(&btp->bt_lru)) {
+		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
+		if (atomic_read(&bp->b_hold) > 1) {
+			spin_unlock(&btp->bt_lru_lock);
 			delay(100);
-			spin_lock(&pag->pag_buf_lock);
+			goto restart;
 		}
-		spin_unlock(&pag->pag_buf_lock);
-		xfs_perag_put(pag);
+		/*
+		 * clear the LRU reference count so the bufer doesn't get
+		 * ignored in xfs_buf_rele().
+		 */
+		atomic_set(&bp->b_lru_ref, 0);
+		spin_unlock(&btp->bt_lru_lock);
+		xfs_buf_rele(bp);
+		spin_lock(&btp->bt_lru_lock);
 	}
+	spin_unlock(&btp->bt_lru_lock);
 }
 
 int
@@ -1466,15 +1558,45 @@ xfs_buftarg_shrink(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	if (nr_to_scan) {
-		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
-			return -1;
-		if (list_empty(&btp->bt_delwrite_queue))
-			return -1;
-		set_bit(XBT_FORCE_FLUSH, &btp->bt_flags);
-		wake_up_process(btp->bt_task);
+	struct xfs_buf		*bp;
+	LIST_HEAD(dispose);
+
+	if (!nr_to_scan)
+		return btp->bt_lru_nr;
+
+	spin_lock(&btp->bt_lru_lock);
+	while (!list_empty(&btp->bt_lru)) {
+		if (nr_to_scan-- <= 0)
+			break;
+
+		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
+
+		/*
+		 * Decrement the b_lru_ref count unless the value is already
+		 * zero. If the value is already zero, we need to reclaim the
+		 * buffer, otherwise it gets another trip through the LRU.
+		 */
+		if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
+			list_move_tail(&bp->b_lru, &btp->bt_lru);
+			continue;
+		}
+
+		/*
+		 * remove the buffer from the LRU now to avoid needing another
+		 * lock round trip inside xfs_buf_rele().
+		 */
+		list_move(&bp->b_lru, &dispose);
+		btp->bt_lru_nr--;
 	}
-	return list_empty(&btp->bt_delwrite_queue) ? -1 : 1;
+	spin_unlock(&btp->bt_lru_lock);
+
+	while (!list_empty(&dispose)) {
+		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
+		list_del_init(&bp->b_lru);
+		xfs_buf_rele(bp);
+	}
+
+	return btp->bt_lru_nr;
 }
 
 void
@@ -1609,6 +1731,8 @@ xfs_alloc_buftarg(
 	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
+	INIT_LIST_HEAD(&btp->bt_lru);
+	spin_lock_init(&btp->bt_lru_lock);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
 	if (xfs_mapping_buftarg(btp, bdev))
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 9344103..4601eab 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -134,6 +134,9 @@ typedef struct xfs_buftarg {
 
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
+	struct list_head	bt_lru;
+	spinlock_t		bt_lru_lock;
+	unsigned int		bt_lru_nr;
 } xfs_buftarg_t;
 
 /*
@@ -166,9 +169,11 @@ typedef struct xfs_buf {
 	xfs_off_t		b_file_offset;	/* offset in file */
 	size_t			b_buffer_length;/* size of buffer in bytes */
 	atomic_t		b_hold;		/* reference count */
+	atomic_t		b_lru_ref;	/* lru reclaim ref count */
 	xfs_buf_flags_t		b_flags;	/* status flags */
 	struct semaphore	b_sema;		/* semaphore for lockables */
 
+	struct list_head	b_lru;		/* lru list */
 	wait_queue_head_t	b_waiters;	/* unpin waiters */
 	struct list_head	b_list;
 	struct xfs_perag	*b_pag;		/* contains rbtree root */
@@ -266,7 +271,8 @@ extern void xfs_buf_terminate(void);
 #define XFS_BUF_ZEROFLAGS(bp)	((bp)->b_flags &= \
 		~(XBF_READ|XBF_WRITE|XBF_ASYNC|XBF_DELWRI|XBF_ORDERED))
 
-#define XFS_BUF_STALE(bp)	((bp)->b_flags |= XBF_STALE)
+void xfs_buf_stale(struct xfs_buf *bp);
+#define XFS_BUF_STALE(bp)	xfs_buf_stale(bp);
 #define XFS_BUF_UNSTALE(bp)	((bp)->b_flags &= ~XBF_STALE)
 #define XFS_BUF_ISSTALE(bp)	((bp)->b_flags & XBF_STALE)
 #define XFS_BUF_SUPER_STALE(bp)	do {				\
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/16] xfs: connect up buffer reclaim priority hooks
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (9 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 10/16] xfs: add a lru to the XFS buffer cache Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 11:25   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 12/16] xfs: bulk AIL insertion during transaction commit Dave Chinner
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.h |   31 +++++++++++++++++++++++++++++--
 fs/xfs/quota/xfs_dquot.c   |    2 +-
 fs/xfs/xfs_alloc.c         |    4 ++--
 fs/xfs/xfs_btree.c         |   11 +++++------
 fs/xfs/xfs_da_btree.c      |    4 ++--
 fs/xfs/xfs_ialloc.c        |    2 +-
 fs/xfs/xfs_inode.c         |    2 +-
 fs/xfs/xfs_trans.h         |    2 +-
 8 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 4601eab..2dff03f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -336,9 +336,36 @@ void xfs_buf_stale(struct xfs_buf *bp);
 #define XFS_BUF_SIZE(bp)		((bp)->b_buffer_length)
 #define XFS_BUF_SET_SIZE(bp, cnt)	((bp)->b_buffer_length = (cnt))
 
-#define XFS_BUF_SET_VTYPE_REF(bp, type, ref)	do { } while (0)
+/*
+ * buffer types 
+ */
+#define	B_FS_DQUOT	1
+#define	B_FS_AGFL	2
+#define	B_FS_AGF	3
+#define	B_FS_ATTR_BTREE	4
+#define	B_FS_DIR_BTREE	5
+#define	B_FS_MAP	6
+#define	B_FS_INOMAP	7
+#define	B_FS_AGI	8
+#define	B_FS_INO	9
+
+static inline void
+xfs_buf_set_vtype_ref(
+	struct xfs_buf	*bp,
+	int		type,
+	int		lru_ref)
+{
+	atomic_set(&bp->b_lru_ref, lru_ref);
+}
+
+static inline void
+xfs_buf_set_ref(
+	struct xfs_buf	*bp,
+	int		lru_ref)
+{
+	atomic_set(&bp->b_lru_ref, lru_ref);
+}
 #define XFS_BUF_SET_VTYPE(bp, type)		do { } while (0)
-#define XFS_BUF_SET_REF(bp, ref)		do { } while (0)
 
 #define XFS_BUF_ISPINNED(bp)	atomic_read(&((bp)->b_pin_count))
 
diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index faf8e1a..682cbf5 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -607,7 +607,7 @@ xfs_qm_dqread(
 	dqp->q_res_rtbcount = be64_to_cpu(ddqp->d_rtbcount);
 
 	/* Mark the buf so that this will stay incore a little longer */
-	XFS_BUF_SET_VTYPE_REF(bp, B_FS_DQUOT, XFS_DQUOT_REF);
+	xfs_buf_set_vtype_ref(bp, B_FS_DQUOT, XFS_DQUOT_REF);
 
 	/*
 	 * We got the buffer with a xfs_trans_read_buf() (in dqtobp())
diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 112abc4..702b097 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -463,7 +463,7 @@ xfs_alloc_read_agfl(
 		return error;
 	ASSERT(bp);
 	ASSERT(!XFS_BUF_GETERROR(bp));
-	XFS_BUF_SET_VTYPE_REF(bp, B_FS_AGFL, XFS_AGFL_REF);
+	xfs_buf_set_vtype_ref(bp, B_FS_AGFL, XFS_AGFL_REF);
 	*bpp = bp;
 	return 0;
 }
@@ -2160,7 +2160,7 @@ xfs_read_agf(
 		xfs_trans_brelse(tp, *bpp);
 		return XFS_ERROR(EFSCORRUPTED);
 	}
-	XFS_BUF_SET_VTYPE_REF(*bpp, B_FS_AGF, XFS_AGF_REF);
+	xfs_buf_set_vtype_ref(*bpp, B_FS_AGF, XFS_AGF_REF);
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 04f9cca..20cec22 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -634,9 +634,8 @@ xfs_btree_read_bufl(
 		return error;
 	}
 	ASSERT(!bp || !XFS_BUF_GETERROR(bp));
-	if (bp != NULL) {
-		XFS_BUF_SET_VTYPE_REF(bp, B_FS_MAP, refval);
-	}
+	if (bp)
+		xfs_buf_set_vtype_ref(bp, B_FS_MAP, refval);
 	*bpp = bp;
 	return 0;
 }
@@ -944,13 +943,13 @@ xfs_btree_set_refs(
 	switch (cur->bc_btnum) {
 	case XFS_BTNUM_BNO:
 	case XFS_BTNUM_CNT:
-		XFS_BUF_SET_VTYPE_REF(*bpp, B_FS_MAP, XFS_ALLOC_BTREE_REF);
+		xfs_buf_set_vtype_ref(bp, B_FS_MAP, XFS_ALLOC_BTREE_REF);
 		break;
 	case XFS_BTNUM_INO:
-		XFS_BUF_SET_VTYPE_REF(*bpp, B_FS_INOMAP, XFS_INO_BTREE_REF);
+		xfs_buf_set_vtype_ref(bp, B_FS_INOMAP, XFS_INO_BTREE_REF);
 		break;
 	case XFS_BTNUM_BMAP:
-		XFS_BUF_SET_VTYPE_REF(*bpp, B_FS_MAP, XFS_BMAP_BTREE_REF);
+		xfs_buf_set_vtype_ref(bp, B_FS_MAP, XFS_BMAP_BTREE_REF);
 		break;
 	default:
 		ASSERT(0);
diff --git a/fs/xfs/xfs_da_btree.c b/fs/xfs/xfs_da_btree.c
index 1c00bed..eea90ff 100644
--- a/fs/xfs/xfs_da_btree.c
+++ b/fs/xfs/xfs_da_btree.c
@@ -2056,10 +2056,10 @@ xfs_da_do_buf(
 			continue;
 		if (caller == 1) {
 			if (whichfork == XFS_ATTR_FORK) {
-				XFS_BUF_SET_VTYPE_REF(bp, B_FS_ATTR_BTREE,
+				xfs_buf_set_vtype_ref(bp, B_FS_ATTR_BTREE,
 						XFS_ATTR_BTREE_REF);
 			} else {
-				XFS_BUF_SET_VTYPE_REF(bp, B_FS_DIR_BTREE,
+				xfs_buf_set_vtype_ref(bp, B_FS_DIR_BTREE,
 						XFS_DIR_BTREE_REF);
 			}
 		}
diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index 0626a32..7fe7f35 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -1517,7 +1517,7 @@ xfs_read_agi(
 		return XFS_ERROR(EFSCORRUPTED);
 	}
 
-	XFS_BUF_SET_VTYPE_REF(*bpp, B_FS_AGI, XFS_AGI_REF);
+	xfs_buf_set_vtype_ref(*bpp, B_FS_AGI, XFS_AGI_REF);
 
 	xfs_check_agi_unlinked(agi);
 	return 0;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 25becb1..fc09a22 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -887,7 +887,7 @@ xfs_iread(
 	 * around for a while.  This helps to keep recently accessed
 	 * meta-data in-core longer.
 	 */
-	XFS_BUF_SET_REF(bp, XFS_INO_REF);
+	xfs_buf_set_ref(bp, XFS_INO_REF);
 
 	/*
 	 * Use xfs_trans_brelse() to release the buffer containing the
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 246286b..c2042b7 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -294,8 +294,8 @@ struct xfs_log_item_desc {
 #define	XFS_ALLOC_BTREE_REF	2
 #define	XFS_BMAP_BTREE_REF	2
 #define	XFS_DIR_BTREE_REF	2
+#define	XFS_INO_REF		2
 #define	XFS_ATTR_BTREE_REF	1
-#define	XFS_INO_REF		1
 #define	XFS_DQUOT_REF		1
 
 #ifdef __KERNEL__
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/16] xfs: bulk AIL insertion during transaction commit
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (10 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 11/16] xfs: connect up buffer reclaim priority hooks Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08  8:55 ` [PATCH 13/16] xfs: reduce the number of AIL push wakeups Dave Chinner
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When inserting items into the AIL from the transaction committed
callbacks, we take the AIL lock for every single item that is to be
inserted. For a CIL checkpoint commit, this can be tens of thousands
of individual inserts, yet almost all of the items will be inserted
at the same point in the AIL because they have the same index.

To reduce the overhead and contention on the AIL lock for such
operations, introduce a "bulk insert" operation which allows a list
of log items with the same LSN to be inserted in a single operation
via a list splice. To do this, we need to pre-sort the log items
being committed into a temporary list for insertion.

The complexity is that not every log item will end up with the same
LSN, and not every item is actually inserted into the AIL. Items
that don't match the commit LSN will be inserted and unpinned as per
the current one-at-a-time method (relatively rare), while items that
are not to be inserted will be unpinned and freed immediately. Items
that are to be inserted at the given commit lsn are placed in a
temporary array and inserted into the AIL in bulk each time the
array fills up.

As a result of this, we trade off AIL hold time for a significant
reduction in traffic. lock_stat output shows that the worst case
hold time is unchanged, but contention from AIL inserts drops by an
order of magnitude and the number of lock traversal decreases
significantly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_extfree_item.c  |   85 +++++++++++++++++-------------
 fs/xfs/xfs_extfree_item.h  |   12 ++--
 fs/xfs/xfs_inode_item.c    |   20 +++++++
 fs/xfs/xfs_log_cil.c       |    9 +---
 fs/xfs/xfs_log_recover.c   |    4 +-
 fs/xfs/xfs_trans.c         |   70 ++++++++++++++++++++++++-
 fs/xfs/xfs_trans_ail.c     |  124 +++++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_trans_extfree.c |    4 +-
 fs/xfs/xfs_trans_priv.h    |    9 +++-
 9 files changed, 259 insertions(+), 78 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index a55e687..5e16d7d 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -74,7 +74,8 @@ xfs_efi_item_format(
 	struct xfs_efi_log_item	*efip = EFI_ITEM(lip);
 	uint			size;
 
-	ASSERT(efip->efi_next_extent == efip->efi_format.efi_nextents);
+	ASSERT(atomic_read(&efip->efi_next_extent) ==
+				efip->efi_format.efi_nextents);
 
 	efip->efi_format.efi_type = XFS_LI_EFI;
 
@@ -99,10 +100,10 @@ xfs_efi_item_pin(
 }
 
 /*
- * While EFIs cannot really be pinned, the unpin operation is the
- * last place at which the EFI is manipulated during a transaction.
- * Here we coordinate with xfs_efi_cancel() to determine who gets to
- * free the EFI.
+ * While EFIs cannot really be pinned, the unpin operation is the last place at
+ * which the EFI is manipulated during a transaction.  Here we coordinate with
+ * xfs_efi_release() (via XFS_EFI_COMMITTED) to determine who gets to free
+ * the EFI.
  */
 STATIC void
 xfs_efi_item_unpin(
@@ -112,18 +113,18 @@ xfs_efi_item_unpin(
 	struct xfs_efi_log_item	*efip = EFI_ITEM(lip);
 	struct xfs_ail		*ailp = lip->li_ailp;
 
-	spin_lock(&ailp->xa_lock);
-	if (efip->efi_flags & XFS_EFI_CANCELED) {
-		if (remove)
-			xfs_trans_del_item(lip);
-
-		/* xfs_trans_ail_delete() drops the AIL lock. */
-		xfs_trans_ail_delete(ailp, lip);
-		xfs_efi_item_free(efip);
-	} else {
-		efip->efi_flags |= XFS_EFI_COMMITTED;
-		spin_unlock(&ailp->xa_lock);
+	if (remove) {
+		/* transaction cancel - delete and free the item */
+		xfs_trans_del_item(lip);
+	} else if (test_and_clear_bit(XFS_EFI_COMMITTED, &efip->efi_flags)) {
+		/* efd has not be processed yet, it will free the efi */
+		return;
 	}
+
+	spin_lock(&ailp->xa_lock);
+	/* xfs_trans_ail_delete() drops the AIL lock. */
+	xfs_trans_ail_delete(ailp, lip);
+	xfs_efi_item_free(efip);
 }
 
 /*
@@ -152,16 +153,22 @@ xfs_efi_item_unlock(
 }
 
 /*
- * The EFI is logged only once and cannot be moved in the log, so
- * simply return the lsn at which it's been logged.  The canceled
- * flag is not paid any attention here.  Checking for that is delayed
- * until the EFI is unpinned.
+ * The EFI is logged only once and cannot be moved in the log, so simply return
+ * the lsn at which it's been logged.  The canceled flag is not paid any
+ * attention here.  Checking for that is delayed until the EFI is unpinned.
+ *
+ * For bulk transaction committed processing, the EFI may be processed but not
+ * yet unpinned prior to the EFD being processed. Set the XFS_EFI_COMMITTED
+ * flag so this case can be detected when processing the EFD.
  */
 STATIC xfs_lsn_t
 xfs_efi_item_committed(
 	struct xfs_log_item	*lip,
 	xfs_lsn_t		lsn)
 {
+	struct xfs_efi_log_item	*efip = EFI_ITEM(lip);
+
+	set_bit(XFS_EFI_COMMITTED, &efip->efi_flags);
 	return lsn;
 }
 
@@ -289,36 +296,38 @@ xfs_efi_copy_format(xfs_log_iovec_t *buf, xfs_efi_log_format_t *dst_efi_fmt)
 }
 
 /*
- * This is called by the efd item code below to release references to
- * the given efi item.  Each efd calls this with the number of
- * extents that it has logged, and when the sum of these reaches
- * the total number of extents logged by this efi item we can free
- * the efi item.
+ * This is called by the efd item code below to release references to the given
+ * efi item.  Each efd calls this with the number of extents that it has
+ * logged, and when the sum of these reaches the total number of extents logged
+ * by this efi item we can free the efi item.
+ *
+ * Freeing the efi item requires that we remove it from the AIL if it has
+ * already been placed there. However, the EFI may not yet have been placed in
+ * the AIL due to a bulk insert operation, so we have to be careful here. This
+ * case is detected if the XFS_EFI_COMMITTED flag is set. This code is
+ * tricky - both xfs_efi_item_unpin() and this code do test_and_clear_bit()
+ * operations on this flag - if it is not set here, then it means that the
+ * unpin has run and we don't need to free it. If it is set here, then we clear
+ * it to tell the unpin we have run and that the unpin needs to free the EFI.
  *
- * Freeing the efi item requires that we remove it from the AIL.
- * We'll use the AIL lock to protect our counters as well as
- * the removal from the AIL.
  */
 void
 xfs_efi_release(xfs_efi_log_item_t	*efip,
 		uint			nextents)
 {
 	struct xfs_ail		*ailp = efip->efi_item.li_ailp;
-	int			extents_left;
 
-	ASSERT(efip->efi_next_extent > 0);
-	ASSERT(efip->efi_flags & XFS_EFI_COMMITTED);
+	ASSERT(atomic_read(&efip->efi_next_extent) > 0);
 
-	spin_lock(&ailp->xa_lock);
-	ASSERT(efip->efi_next_extent >= nextents);
-	efip->efi_next_extent -= nextents;
-	extents_left = efip->efi_next_extent;
-	if (extents_left == 0) {
+	ASSERT(atomic_read(&efip->efi_next_extent) >= nextents);
+	if (!atomic_sub_and_test(nextents, &efip->efi_next_extent))
+		return;
+
+	if (!test_and_clear_bit(XFS_EFI_COMMITTED, &efip->efi_flags)) {
+		spin_lock(&ailp->xa_lock);
 		/* xfs_trans_ail_delete() drops the AIL lock. */
 		xfs_trans_ail_delete(ailp, (xfs_log_item_t *)efip);
 		xfs_efi_item_free(efip);
-	} else {
-		spin_unlock(&ailp->xa_lock);
 	}
 }
 
diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
index 0d22c56..26a7550 100644
--- a/fs/xfs/xfs_extfree_item.h
+++ b/fs/xfs/xfs_extfree_item.h
@@ -111,11 +111,11 @@ typedef struct xfs_efd_log_format_64 {
 #define	XFS_EFI_MAX_FAST_EXTENTS	16
 
 /*
- * Define EFI flags.
+ * Define EFI flag bits. Manipulated by set/clear/test_bit operators.
  */
-#define	XFS_EFI_RECOVERED	0x1
-#define	XFS_EFI_COMMITTED	0x2
-#define	XFS_EFI_CANCELED	0x4
+#define	XFS_EFI_RECOVERED	1
+#define	XFS_EFI_CANCELED	2
+#define	XFS_EFI_COMMITTED	3
 
 /*
  * This is the "extent free intention" log item.  It is used
@@ -125,8 +125,8 @@ typedef struct xfs_efd_log_format_64 {
  */
 typedef struct xfs_efi_log_item {
 	xfs_log_item_t		efi_item;
-	uint			efi_flags;	/* misc flags */
-	uint			efi_next_extent;
+	atomic_t		efi_next_extent;
+	unsigned long		efi_flags;	/* misc flags */
 	xfs_efi_log_format_t	efi_format;
 } xfs_efi_log_item_t;
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index c7ac020..3be7bdc 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -663,12 +663,32 @@ xfs_inode_item_unlock(
  * all dirty data in an inode, the latest copy in the on disk log
  * is the only one that matters.  Therefore, simply return the
  * given lsn.
+ *
+ * If the inode has been marked stale because the cluster is being freed,
+ * we don't want to (re-)insert this inode into the AIL. There is a race
+ * condition where the cluster buffer may be unpinned before the inode is
+ * inserted into the AIL during transaction committed processing. If the buffer
+ * is unpinned before the inode item has been committed and inserted, then
+ * it is possible for the buffer to be written and process IO completions
+ * before the inode is inserted into the AIL. In that case, we'd be inserting a
+ * clean, stale inode into the AIL which will never get removed. It will,
+ * however, get reclaimed which triggers an assert in xfs_inode_free()
+ * complaining about freein an inode still in the AIL.
+ *
+ * To avoid this, return a lower LSN than the one passed in so that the
+ * transaction committed code will not move the inode forward in the AIL
+ * but will still unpin it properly.
  */
 STATIC xfs_lsn_t
 xfs_inode_item_committed(
 	struct xfs_log_item	*lip,
 	xfs_lsn_t		lsn)
 {
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+	struct xfs_inode	*ip = iip->ili_inode;
+
+	if (xfs_iflags_test(ip, XFS_ISTALE))
+		return lsn - 1;
 	return lsn;
 }
 
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 23d6ceb..f36f1a2 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -361,15 +361,10 @@ xlog_cil_committed(
 	int	abort)
 {
 	struct xfs_cil_ctx	*ctx = args;
-	struct xfs_log_vec	*lv;
-	int			abortflag = abort ? XFS_LI_ABORTED : 0;
 	struct xfs_busy_extent	*busyp, *n;
 
-	/* unpin all the log items */
-	for (lv = ctx->lv_chain; lv; lv = lv->lv_next ) {
-		xfs_trans_item_committed(lv->lv_item, ctx->start_lsn,
-							abortflag);
-	}
+	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
+					ctx->start_lsn, abort);
 
 	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
 		xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 966d3f9..baad94a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2717,8 +2717,8 @@ xlog_recover_do_efi_trans(
 		xfs_efi_item_free(efip);
 		return error;
 	}
-	efip->efi_next_extent = efi_formatp->efi_nextents;
-	efip->efi_flags |= XFS_EFI_COMMITTED;
+	atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents);
+	clear_bit(XFS_EFI_COMMITTED, &efip->efi_flags);
 
 	spin_lock(&log->l_ailp->xa_lock);
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index f6d956b..5180b18 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -1350,7 +1350,7 @@ xfs_trans_fill_vecs(
  * they could be immediately flushed and we'd have to race with the flusher
  * trying to pull the item from the AIL as we add it.
  */
-void
+static void
 xfs_trans_item_committed(
 	struct xfs_log_item	*lip,
 	xfs_lsn_t		commit_lsn,
@@ -1426,6 +1426,74 @@ xfs_trans_committed(
 }
 
 /*
+ * Bulk operation version of xfs_trans_committed that takes a log vector of
+ * items to insert into the AIL. This uses bulk AIL insertion techniques to
+ * minimise lock traffic.
+ */
+void
+xfs_trans_committed_bulk(
+	struct xfs_ail		*ailp,
+	struct xfs_log_vec	*log_vector,
+	xfs_lsn_t		commit_lsn,
+	int			aborted)
+{
+#define LGIA_SIZE	32
+	struct xfs_log_item	*lgia[LGIA_SIZE];
+	struct xfs_log_vec	*lv;
+	int			i = 0;
+
+	/* unpin all the log items */
+	for (lv = log_vector; lv; lv = lv->lv_next ) {
+		struct xfs_log_item	*lip = lv->lv_item;
+		xfs_lsn_t		item_lsn;
+
+		if (aborted)
+			lip->li_flags |= XFS_LI_ABORTED;
+		item_lsn = IOP_COMMITTED(lip, commit_lsn);
+
+		/* item_lsn of -1 means the item was freed */
+		if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
+			continue;
+
+		if (item_lsn != commit_lsn) {
+
+			/*
+			 * Not a bulk update option due to unusual item_lsn.
+			 * Push into AIL immediately, rechecking the lsn once
+			 * we have the ail lock. Then unpin the item.
+			 */
+			spin_lock(&ailp->xa_lock);
+			if (XFS_LSN_CMP(item_lsn, lip->li_lsn) > 0) {
+				xfs_trans_ail_update(ailp, lip, item_lsn);
+			} else {
+				spin_unlock(&ailp->xa_lock);
+			}
+			IOP_UNPIN(lip, 0);
+			continue;
+		}
+
+		/* Item is a candidate for bulk AIL insert.  */
+		lgia[i++] = lv->lv_item;
+		if (i >= LGIA_SIZE) {
+			xfs_trans_ail_update_bulk(ailp, lgia, LGIA_SIZE,
+							commit_lsn);
+			for (i = 0; i < LGIA_SIZE; i++)
+				IOP_UNPIN(lgia[i], 0);
+			i = 0;
+		}
+	}
+
+	/* make sure we insert the remainder! */
+	if (i) {
+		int j;
+
+		xfs_trans_ail_update_bulk(ailp, lgia, i, commit_lsn);
+		for (j = 0; j < i; j++)
+			IOP_UNPIN(lgia[j], 0);
+	}
+}
+
+/*
  * Called from the trans_commit code when we notice that
  * the filesystem is in the middle of a forced shutdown.
  */
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dc90695..c83e6e9 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -29,7 +29,8 @@
 #include "xfs_error.h"
 
 STATIC void xfs_ail_insert(struct xfs_ail *, xfs_log_item_t *);
-STATIC xfs_log_item_t * xfs_ail_delete(struct xfs_ail *, xfs_log_item_t *);
+STATIC void xfs_ail_splice(struct xfs_ail *, struct list_head *, xfs_lsn_t);
+STATIC void xfs_ail_delete(struct xfs_ail *, xfs_log_item_t *);
 STATIC xfs_log_item_t * xfs_ail_min(struct xfs_ail *);
 STATIC xfs_log_item_t * xfs_ail_next(struct xfs_ail *, xfs_log_item_t *);
 
@@ -468,16 +469,13 @@ xfs_trans_ail_update(
 	xfs_log_item_t	*lip,
 	xfs_lsn_t	lsn) __releases(ailp->xa_lock)
 {
-	xfs_log_item_t		*dlip = NULL;
 	xfs_log_item_t		*mlip;	/* ptr to minimum lip */
 	xfs_lsn_t		tail_lsn;
 
 	mlip = xfs_ail_min(ailp);
 
 	if (lip->li_flags & XFS_LI_IN_AIL) {
-		dlip = xfs_ail_delete(ailp, lip);
-		ASSERT(dlip == lip);
-		xfs_trans_ail_cursor_clear(ailp, dlip);
+		xfs_ail_delete(ailp, lip);
 	} else {
 		lip->li_flags |= XFS_LI_IN_AIL;
 	}
@@ -485,7 +483,7 @@ xfs_trans_ail_update(
 	lip->li_lsn = lsn;
 	xfs_ail_insert(ailp, lip);
 
-	if (mlip == dlip) {
+	if (mlip == lip) {
 		mlip = xfs_ail_min(ailp);
 		/*
 		 * It is not safe to access mlip after the AIL lock is
@@ -505,6 +503,74 @@ xfs_trans_ail_update(
 }	/* xfs_trans_update_ail */
 
 /*
+ * Bulk update version of xfs_trans_ail_update.
+ *
+ * This version takes an array of log items that all need to be positioned at
+ * the same LSN in the AIL. This function takes the AIL lock once to execute
+ * the update operations on all the items in the array, and as such shoul dnot
+ * be called with the AIL lock held. As a result, once we have the AIL lock,
+ * we need to check each log item LSN to confirm it needs to be moved forward
+ * in the AIL.
+ *
+ * To optimise the insert operation, we delete all the items from the AIL in
+ * the first pass, moving them into a temporary list, then splice the temporary
+ * list into the correct position in the AIL. THis avoids needing to do an
+ * insert operation on every item.
+ */
+void
+xfs_trans_ail_update_bulk(
+	struct xfs_ail		*ailp,
+	struct xfs_log_item	**lgia,
+	int			nr_items,
+	xfs_lsn_t		lsn)
+{
+	xfs_log_item_t		*mlip;
+	int			mlip_changed = 0;
+	int			i;
+	LIST_HEAD(tmp);
+
+	spin_lock(&ailp->xa_lock);
+	mlip = xfs_ail_min(ailp);
+
+	for (i = 0; i < nr_items; i++) {
+		struct xfs_log_item *lip = lgia[i];
+		if (lip->li_flags & XFS_LI_IN_AIL) {
+			/* check if we really need to move the item */
+			if (XFS_LSN_CMP(lsn, lip->li_lsn) <= 0)
+				continue;
+
+			xfs_ail_delete(ailp, lip);
+			if (mlip == lip)
+				mlip_changed = 1;
+		} else {
+			lip->li_flags |= XFS_LI_IN_AIL;
+		}
+		lip->li_lsn = lsn;
+		list_add(&lip->li_ail, &tmp);
+	}
+
+	xfs_ail_splice(ailp, &tmp, lsn);
+
+	if (mlip_changed) {
+		/*
+		 * It is not safe to access mlip after the AIL lock is
+		 * dropped, so we must get a copy of li_lsn before we do
+		 * so.  This is especially important on 32-bit platforms
+		 * where accessing and updating 64-bit values like li_lsn
+		 * is not atomic.
+		 */
+		xfs_lsn_t	tail_lsn;
+
+		mlip = xfs_ail_min(ailp);
+		tail_lsn = mlip->li_lsn;
+		spin_unlock(&ailp->xa_lock);
+		xfs_log_move_tail(ailp->xa_mount, tail_lsn);
+		return;
+	}
+	spin_unlock(&ailp->xa_lock);
+}
+
+/*
  * Delete the given item from the AIL.  It must already be in
  * the AIL.
  *
@@ -524,21 +590,18 @@ xfs_trans_ail_delete(
 	struct xfs_ail	*ailp,
 	xfs_log_item_t	*lip) __releases(ailp->xa_lock)
 {
-	xfs_log_item_t		*dlip;
 	xfs_log_item_t		*mlip;
 	xfs_lsn_t		tail_lsn;
 
 	if (lip->li_flags & XFS_LI_IN_AIL) {
 		mlip = xfs_ail_min(ailp);
-		dlip = xfs_ail_delete(ailp, lip);
-		ASSERT(dlip == lip);
-		xfs_trans_ail_cursor_clear(ailp, dlip);
+		xfs_ail_delete(ailp, lip);
 
 
 		lip->li_flags &= ~XFS_LI_IN_AIL;
 		lip->li_lsn = 0;
 
-		if (mlip == dlip) {
+		if (mlip == lip) {
 			mlip = xfs_ail_min(ailp);
 			/*
 			 * It is not safe to access mlip after the AIL lock
@@ -632,7 +695,6 @@ STATIC void
 xfs_ail_insert(
 	struct xfs_ail	*ailp,
 	xfs_log_item_t	*lip)
-/* ARGSUSED */
 {
 	xfs_log_item_t	*next_lip;
 
@@ -658,21 +720,45 @@ xfs_ail_insert(
 	return;
 }
 
+STATIC void
+xfs_ail_splice(
+	struct xfs_ail	*ailp,
+	struct list_head *list,
+	xfs_lsn_t	lsn)
+{
+	xfs_log_item_t	*next_lip;
+
+	/*
+	 * If the list is empty, just insert the item.
+	 */
+	if (list_empty(&ailp->xa_ail)) {
+		list_splice(list, &ailp->xa_ail);
+		return;
+	}
+
+	list_for_each_entry_reverse(next_lip, &ailp->xa_ail, li_ail) {
+		if (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0)
+			break;
+	}
+
+	ASSERT((&next_lip->li_ail == &ailp->xa_ail) ||
+	       (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0));
+
+	list_splice_init(list, &next_lip->li_ail);
+	return;
+}
+
 /*
  * Delete the given item from the AIL.  Return a pointer to the item.
  */
-/*ARGSUSED*/
-STATIC xfs_log_item_t *
+STATIC void
 xfs_ail_delete(
 	struct xfs_ail	*ailp,
 	xfs_log_item_t	*lip)
-/* ARGSUSED */
 {
 	xfs_ail_check(ailp, lip);
-
 	list_del(&lip->li_ail);
-
-	return lip;
+	xfs_trans_ail_cursor_clear(ailp, lip);
 }
 
 /*
@@ -682,7 +768,6 @@ xfs_ail_delete(
 STATIC xfs_log_item_t *
 xfs_ail_min(
 	struct xfs_ail	*ailp)
-/* ARGSUSED */
 {
 	if (list_empty(&ailp->xa_ail))
 		return NULL;
@@ -699,7 +784,6 @@ STATIC xfs_log_item_t *
 xfs_ail_next(
 	struct xfs_ail	*ailp,
 	xfs_log_item_t	*lip)
-/* ARGSUSED */
 {
 	if (lip->li_ail.next == &ailp->xa_ail)
 		return NULL;
diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
index f783d5e..143ff840 100644
--- a/fs/xfs/xfs_trans_extfree.c
+++ b/fs/xfs/xfs_trans_extfree.c
@@ -69,12 +69,12 @@ xfs_trans_log_efi_extent(xfs_trans_t		*tp,
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	efip->efi_item.li_desc->lid_flags |= XFS_LID_DIRTY;
 
-	next_extent = efip->efi_next_extent;
+	next_extent = atomic_read(&efip->efi_next_extent);
 	ASSERT(next_extent < efip->efi_format.efi_nextents);
 	extp = &(efip->efi_format.efi_extents[next_extent]);
 	extp->ext_start = start_block;
 	extp->ext_len = ext_len;
-	efip->efi_next_extent++;
+	atomic_inc(&efip->efi_next_extent);
 }
 
 
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 62da86c..d25460f 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -22,15 +22,17 @@ struct xfs_log_item;
 struct xfs_log_item_desc;
 struct xfs_mount;
 struct xfs_trans;
+struct xfs_ail;
+struct xfs_log_vec;
 
 void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
 void	xfs_trans_free_items(struct xfs_trans *tp, xfs_lsn_t commit_lsn,
 				int flags);
-void	xfs_trans_item_committed(struct xfs_log_item *lip,
-				xfs_lsn_t commit_lsn, int aborted);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
+void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
+				xfs_lsn_t commit_lsn, int aborted);
 /*
  * AIL traversal cursor.
  *
@@ -76,6 +78,9 @@ struct xfs_ail {
 void			xfs_trans_ail_update(struct xfs_ail *ailp,
 					struct xfs_log_item *lip, xfs_lsn_t lsn)
 					__releases(ailp->xa_lock);
+void			 xfs_trans_ail_update_bulk(struct xfs_ail *ailp,
+					struct xfs_log_item **lgia,
+					int nr_items, xfs_lsn_t lsn);
 void			xfs_trans_ail_delete(struct xfs_ail *ailp,
 					struct xfs_log_item *lip)
 					__releases(ailp->xa_lock);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/16] xfs: reduce the number of AIL push wakeups
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (11 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 12/16] xfs: bulk AIL insertion during transaction commit Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 11:32   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 14/16] xfs: remove all the inodes on a buffer from the AIL in bulk Dave Chinner
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The xfaild often tries to rest to wait for congestion to pass of for
IO to complete, but is regularly woken in tail-pushing situations.
In severe cases, the xfsaild is getting woken tens of thousands of
times a second. Reduce the number needless wakeups by only waking
the xfsaild if the new target is larger than the old one. Further
make short sleeps uninterruptible as they occur when the xfsaild has
decided it needs to back off to allow some IO to complete and being
woken early is counter-productive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index fa789b7..668b010 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -837,8 +837,11 @@ xfsaild_wakeup(
 	struct xfs_ail		*ailp,
 	xfs_lsn_t		threshold_lsn)
 {
-	ailp->xa_target = threshold_lsn;
-	wake_up_process(ailp->xa_task);
+	/* only ever move the target forwards */
+	if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0) {
+		ailp->xa_target = threshold_lsn;
+		wake_up_process(ailp->xa_task);
+	}
 }
 
 STATIC int
@@ -850,8 +853,17 @@ xfsaild(
 	long		tout = 0; /* milliseconds */
 
 	while (!kthread_should_stop()) {
-		schedule_timeout_interruptible(tout ?
+		/*
+		 * for short sleeps indicating congestion, don't allow us to
+		 * get woken early. Otherwise all we do is bang on the AIL lock
+		 * without making progress.
+		 */
+		if (tout && tout <= 20) {
+			schedule_timeout_uninterruptible(msecs_to_jiffies(tout));
+		} else {
+			schedule_timeout_interruptible(tout ?
 				msecs_to_jiffies(tout) : MAX_SCHEDULE_TIMEOUT);
+		}
 
 		/* swsusp */
 		try_to_freeze();
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/16] xfs: remove all the inodes on a buffer from the AIL in bulk
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (12 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 13/16] xfs: reduce the number of AIL push wakeups Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08  8:55 ` [PATCH 15/16] xfs: only run xfs_error_test if error injection is active Dave Chinner
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When inode buffer IO completes, usually all of the inodes are removed from the
AIL. This involves processing them one at a time and taking the AIL lock once
for every inode. When all CPUs are processing inode IO completions, this causes
excessive amount sof contention on the AIL lock.

Instead, change the way we process inode IO completion in the buffer
IO done callback. Allow the inode IO done callback to walk the list
of IO done callbacks and pull all the inodes off the buffer in one
go and then process them as a batch.

Once all the inodes for removal are collected, take the AIL lock
once and do a bulk removal operation to minimise traffic on the AIL
lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c   |   17 +++++++--
 fs/xfs/xfs_inode_item.c |   92 ++++++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_trans_ail.c  |   65 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_priv.h |    4 ++
 4 files changed, 158 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 2686d0d..46a7ef2 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -918,6 +918,18 @@ xfs_buf_attach_iodone(
 	XFS_BUF_SET_IODONE_FUNC(bp, xfs_buf_iodone_callbacks);
 }
 
+/*
+ * We can have many callbacks on a buffer. Running the callbacks individually
+ * can cause a lot of contention on the AIL lock, so we allow for a single
+ * callback to be able to scan the remaining lip->li_bio_list for other items
+ * of the same type and callback to be processed in the first call.
+ *
+ * As a result, the loop walking the callback list below will also modify the
+ * list. it removes the first item from the list and then runs the callback.
+ * The loop then restarts from the new head of the list. This allows the
+ * callback to scan and modify the list attached to the buffer and we don't
+ * have to care about maintaining a next item pointer.
+ */
 STATIC void
 xfs_buf_do_callbacks(
 	xfs_buf_t	*bp,
@@ -925,8 +937,8 @@ xfs_buf_do_callbacks(
 {
 	xfs_log_item_t	*nlip;
 
-	while (lip != NULL) {
-		nlip = lip->li_bio_list;
+	while ((lip = XFS_BUF_FSPRIVATE(bp, xfs_log_item_t *)) != NULL) {
+		XFS_BUF_SET_FSPRIVATE(bp, lip->li_bio_list);
 		ASSERT(lip->li_cb != NULL);
 		/*
 		 * Clear the next pointer so we don't have any
@@ -936,7 +948,6 @@ xfs_buf_do_callbacks(
 		 */
 		lip->li_bio_list = NULL;
 		lip->li_cb(bp, lip);
-		lip = nlip;
 	}
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 3be7bdc..cb341ba 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -843,15 +843,64 @@ xfs_inode_item_destroy(
  * flushed to disk.  It is responsible for removing the inode item
  * from the AIL if it has not been re-logged, and unlocking the inode's
  * flush lock.
+ *
+ * To reduce AIL lock traffic as much as possible, we scan the buffer log item
+ * list for other inodes that will run this function. We remove them from the
+ * buffer list so we can process all the inode IO completions in one AIL lock
+ * traversal.
  */
 void
 xfs_iflush_done(
 	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
-	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
-	xfs_inode_t		*ip = iip->ili_inode;
+	struct xfs_inode_log_item *iip;
+	struct xfs_log_item	*blip;
+	struct xfs_log_item	*next;
+	struct xfs_log_item	*prev;
 	struct xfs_ail		*ailp = lip->li_ailp;
+	int			need_ail = 0;
+
+	/*
+	 * Scan the buffer IO completions for other inodes being completed and
+	 * attach them to the current inode log item.
+	 */
+	blip = XFS_BUF_FSPRIVATE(bp, xfs_log_item_t *);
+	prev = NULL;
+	while (blip != NULL) {
+		if (lip->li_cb != xfs_iflush_done) {
+			prev = blip;
+			blip = blip->li_bio_list;
+			continue;
+		}
+
+		/* remove from list */
+		next = blip->li_bio_list;
+		if (!prev) {
+			XFS_BUF_SET_FSPRIVATE(bp, next);
+		} else {
+			prev->li_bio_list = next;
+		}
+
+		/* add to current list */
+		blip->li_bio_list = lip->li_bio_list;
+		lip->li_bio_list = blip;
+
+		/*
+		 * while we have the item, do the unlocked check for needing
+		 * the AIL lock.
+		 */
+		iip = INODE_ITEM(blip);
+		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
+			need_ail++;
+
+		blip = next;
+	}
+
+	/* make sure we capture the state of the initial inode. */
+	iip = INODE_ITEM(lip);
+	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
+		need_ail++;
 
 	/*
 	 * We only want to pull the item from the AIL if it is
@@ -862,28 +911,37 @@ xfs_iflush_done(
 	 * the lock since it's cheaper, and then we recheck while
 	 * holding the lock before removing the inode from the AIL.
 	 */
-	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) {
+	if (need_ail) {
+		struct xfs_log_item *lgia[need_ail];
+		int i = 0;
 		spin_lock(&ailp->xa_lock);
-		if (lip->li_lsn == iip->ili_flush_lsn) {
-			/* xfs_trans_ail_delete() drops the AIL lock. */
-			xfs_trans_ail_delete(ailp, lip);
-		} else {
-			spin_unlock(&ailp->xa_lock);
+		for (blip = lip; blip; blip = blip->li_bio_list) {
+			iip = INODE_ITEM(blip);
+			if (iip->ili_logged &&
+			    blip->li_lsn == iip->ili_flush_lsn) {
+				lgia[i++] = blip;
+			}
+			ASSERT(i <= need_ail);
 		}
+		/* xfs_trans_ail_delete_bulk() drops the AIL lock. */
+		xfs_trans_ail_delete_bulk(ailp, lgia, i);
 	}
 
-	iip->ili_logged = 0;
 
 	/*
-	 * Clear the ili_last_fields bits now that we know that the
-	 * data corresponding to them is safely on disk.
+	 * clean up and unlock the flush lock now we are done. We can clear the
+	 * ili_last_fields bits now that we know that the data corresponding to
+	 * them is safely on disk.
 	 */
-	iip->ili_last_fields = 0;
+	for (blip = lip; blip; blip = next) {
+		next = blip->li_bio_list;
+		blip->li_bio_list = NULL;
 
-	/*
-	 * Release the inode's flush lock since we're done with it.
-	 */
-	xfs_ifunlock(ip);
+		iip = INODE_ITEM(blip);
+		iip->ili_logged = 0;
+		iip->ili_last_fields = 0;
+		xfs_ifunlock(iip->ili_inode);
+	}
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index c83e6e9..4261d75 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -634,6 +634,71 @@ xfs_trans_ail_delete(
 	}
 }
 
+/*
+ * Bulk update version of xfs_trans_ail_delete
+ *
+ * This version takes an array of log items that all need to removed from the
+ * AIL. The caller is already holding the AIL lock, and done all the checks
+ * necessary to ensure the items passed in via @lgia are ready for deletion.
+ *
+ * This function will not drop the AIL lock until all items are removed from
+ * the AIL to minimise the amount of lock traffic on the AIL. This does not
+ * greatly increase the AIL hold time, but does significantly reduce the amount
+ * of traffic on the lock, especially during IO completion.
+ */
+void
+xfs_trans_ail_delete_bulk(
+	struct xfs_ail		*ailp,
+	struct xfs_log_item	**lgia,
+	int			nr_items) __releases(ailp->xa_lock)
+{
+	xfs_log_item_t		*mlip;
+	int			mlip_changed = 0;
+	int			i;
+
+	mlip = xfs_ail_min(ailp);
+
+	for (i = 0; i < nr_items; i++) {
+		struct xfs_log_item *lip = lgia[i];
+		if (!(lip->li_flags & XFS_LI_IN_AIL)) {
+			struct xfs_mount	*mp = ailp->xa_mount;
+
+			spin_unlock(&ailp->xa_lock);
+			if (!XFS_FORCED_SHUTDOWN(mp)) {
+				xfs_cmn_err(XFS_PTAG_AILDELETE, CE_ALERT, mp,
+		"%s: attempting to delete a log item that is not in the AIL",
+						__func__);
+				xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			}
+			return;
+		}
+
+		xfs_ail_delete(ailp, lip);
+		lip->li_flags &= ~XFS_LI_IN_AIL;
+		lip->li_lsn = 0;
+		if (mlip == lip)
+			mlip_changed = 1;
+	}
+
+	if (mlip_changed) {
+		/*
+		 * It is not safe to access mlip after the AIL lock is
+		 * dropped, so we must get a copy of li_lsn before we do
+		 * so.  This is especially important on 32-bit platforms
+		 * where accessing and updating 64-bit values like li_lsn
+		 * is not atomic. It is possible we've emptied the AIL here,
+		 * so if that is the case, return a LSN of 0.
+		 */
+		xfs_lsn_t	tail_lsn;
+
+		mlip = xfs_ail_min(ailp);
+		tail_lsn = mlip ? mlip->li_lsn : 0;
+		spin_unlock(&ailp->xa_lock);
+		xfs_log_move_tail(ailp->xa_mount, tail_lsn);
+		return;
+	}
+	spin_unlock(&ailp->xa_lock);
+}
 
 
 /*
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index d25460f..8110e6f 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -84,6 +84,10 @@ void			 xfs_trans_ail_update_bulk(struct xfs_ail *ailp,
 void			xfs_trans_ail_delete(struct xfs_ail *ailp,
 					struct xfs_log_item *lip)
 					__releases(ailp->xa_lock);
+void			xfs_trans_ail_delete_bulk(struct xfs_ail *ailp,
+					struct xfs_log_item **lgia,
+					int nr_items)
+					__releases(ailp->xa_lock);
 void			xfs_trans_ail_push(struct xfs_ail *, xfs_lsn_t);
 void			xfs_trans_unlocked_item(struct xfs_ail *,
 					xfs_log_item_t *);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/16] xfs: only run xfs_error_test if error injection is active
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (13 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 14/16] xfs: remove all the inodes on a buffer from the AIL in bulk Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 11:33   ` Christoph Hellwig
  2010-11-08  8:55 ` [PATCH 16/16] xfs: make xlog_space_left() independent of the grant lock Dave Chinner
  2010-11-08 14:17 ` [PATCH 00/16] xfs: current patch stack for 2.6.38 window Christoph Hellwig
  16 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Recent tests writing lots of small files showed the flusher thread
being CPU bound and taking a long time to do allocations on a debug
kernel. perf showed this as the prime reason:

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

           224648.00 36.8% xfs_error_test              [kernel.kallsyms]
            86045.00 14.1% xfs_btree_check_sblock      [kernel.kallsyms]
            39778.00  6.5% prandom32                   [kernel.kallsyms]
            37436.00  6.1% xfs_btree_increment         [kernel.kallsyms]
            29278.00  4.8% xfs_btree_get_rec           [kernel.kallsyms]
            27717.00  4.5% random32                    [kernel.kallsyms]

Walking btree blocks during allocation checking them requires each
block (a cache hit, so no I/O) call xfs_error_test(), which then
does a random32() call as teh first operation.  IOWs, ~50% of the
CPU is being consumed just testing whether we need to inject an
error, even though error injection is not active.

Kill this overhead when error injection is not active by adding a
global counter of active error traps and only calling into
xfs_error_test when fault injection is active.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_error.c |    3 +++
 fs/xfs/xfs_error.h |    5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index ed99902..c78cc6a 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -58,6 +58,7 @@ xfs_error_trap(int e)
 int	xfs_etest[XFS_NUM_INJECT_ERROR];
 int64_t	xfs_etest_fsid[XFS_NUM_INJECT_ERROR];
 char *	xfs_etest_fsname[XFS_NUM_INJECT_ERROR];
+int	xfs_error_test_active;
 
 int
 xfs_error_test(int error_tag, int *fsidp, char *expression,
@@ -108,6 +109,7 @@ xfs_errortag_add(int error_tag, xfs_mount_t *mp)
 			len = strlen(mp->m_fsname);
 			xfs_etest_fsname[i] = kmem_alloc(len + 1, KM_SLEEP);
 			strcpy(xfs_etest_fsname[i], mp->m_fsname);
+			xfs_error_test_active++;
 			return 0;
 		}
 	}
@@ -137,6 +139,7 @@ xfs_errortag_clearall(xfs_mount_t *mp, int loud)
 			xfs_etest_fsid[i] = 0LL;
 			kmem_free(xfs_etest_fsname[i]);
 			xfs_etest_fsname[i] = NULL;
+			xfs_error_test_active--;
 		}
 	}
 
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index c2c1a07..f338847 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -127,13 +127,14 @@ extern void xfs_corruption_error(const char *tag, int level,
 #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
 
 #ifdef DEBUG
+extern int xfs_error_test_active;
 extern int xfs_error_test(int, int *, char *, int, char *, unsigned long);
 
 #define	XFS_NUM_INJECT_ERROR				10
 #define XFS_TEST_ERROR(expr, mp, tag, rf)		\
-	((expr) || \
+	((expr) || (xfs_error_test_active && \
 	 xfs_error_test((tag), (mp)->m_fixedfsid, "expr", __LINE__, __FILE__, \
-			(rf)))
+			(rf))))
 
 extern int xfs_errortag_add(int error_tag, xfs_mount_t *mp);
 extern int xfs_errortag_clearall(xfs_mount_t *mp, int loud);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 16/16] xfs: make xlog_space_left() independent of the grant lock
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (14 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 15/16] xfs: only run xfs_error_test if error injection is active Dave Chinner
@ 2010-11-08  8:55 ` Dave Chinner
  2010-11-08 14:17 ` [PATCH 00/16] xfs: current patch stack for 2.6.38 window Christoph Hellwig
  16 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08  8:55 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the xlog_space_left() calculation to take the tail_lsn as a
parameter.  This allows the function to be called with fixed values
rather than sampling the tail_lsn during the call and hence
requiring it to be called under the log grant lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

Header from folded patch 'xfs-log-ail-push-tail-unlocked':

xfs: make AIL tail pushing independent of the grant lock

Convert the xlog_grant_push_ail() calculation to take the tail_lsn
and the last_sync_lsn as parameters.  This allows the function to be
called with fixed values rather than sampling variables protected by
the grant lock.  This allows us to move the grant lock outside the
push function which immediately reduces unnecessary grant lock
traffic, but also allows use to split the function away from the
grant lock in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

Header from folded patch 'xfs-log-ticket-queue-list-head':

xfs: Convert the log space ticket queue to use list_heads

The current code uses a roll-your-own double linked list, so convert
it to a standard list_head structure and convert all the list
traversals to use list_for_each_entry(). We can also get rid of the
XLOG_TIC_IN_Q flag as we can use the list_empty() check to tell if
the ticket is in a list or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_trace.h |   36 +--
 fs/xfs/xfs_log.c             |  678 ++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h        |   40 ++-
 fs/xfs/xfs_log_recover.c     |   23 +-
 4 files changed, 409 insertions(+), 368 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index acef2e9..1a029bd 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -766,12 +766,10 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
 		__field(int, curr_res)
 		__field(int, unit_res)
 		__field(unsigned int, flags)
-		__field(void *, reserve_headq)
-		__field(void *, write_headq)
-		__field(int, grant_reserve_cycle)
-		__field(int, grant_reserve_bytes)
-		__field(int, grant_write_cycle)
-		__field(int, grant_write_bytes)
+		__field(void *, reserveq)
+		__field(void *, writeq)
+		__field(xfs_lsn_t, grant_reserve_lsn)
+		__field(xfs_lsn_t, grant_write_lsn)
 		__field(int, curr_cycle)
 		__field(int, curr_block)
 		__field(xfs_lsn_t, tail_lsn)
@@ -784,15 +782,15 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
 		__entry->curr_res = tic->t_curr_res;
 		__entry->unit_res = tic->t_unit_res;
 		__entry->flags = tic->t_flags;
-		__entry->reserve_headq = log->l_reserve_headq;
-		__entry->write_headq = log->l_write_headq;
-		__entry->grant_reserve_cycle = log->l_grant_reserve_cycle;
-		__entry->grant_reserve_bytes = log->l_grant_reserve_bytes;
-		__entry->grant_write_cycle = log->l_grant_write_cycle;
-		__entry->grant_write_bytes = log->l_grant_write_bytes;
+		__entry->reserveq = log->l_reserveq.next;
+		__entry->writeq = log->l_writeq.next;
+		__entry->grant_reserve_lsn =
+				atomic64_read(&log->l_grant_reserve_lsn);
+		__entry->grant_write_lsn =
+				atomic64_read(&log->l_grant_write_lsn);
 		__entry->curr_cycle = log->l_curr_cycle;
 		__entry->curr_block = log->l_curr_block;
-		__entry->tail_lsn = log->l_tail_lsn;
+		__entry->tail_lsn = atomic64_read(&log->l_tail_lsn);
 	),
 	TP_printk("dev %d:%d type %s t_ocnt %u t_cnt %u t_curr_res %u "
 		  "t_unit_res %u t_flags %s reserve_headq 0x%p "
@@ -807,12 +805,12 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class,
 		  __entry->curr_res,
 		  __entry->unit_res,
 		  __print_flags(__entry->flags, "|", XLOG_TIC_FLAGS),
-		  __entry->reserve_headq,
-		  __entry->write_headq,
-		  __entry->grant_reserve_cycle,
-		  __entry->grant_reserve_bytes,
-		  __entry->grant_write_cycle,
-		  __entry->grant_write_bytes,
+		  __entry->reserveq,
+		  __entry->writeq,
+		  CYCLE_LSN(__entry->grant_reserve_lsn),
+		  BLOCK_LSN(__entry->grant_reserve_lsn),
+		  CYCLE_LSN(__entry->grant_write_lsn),
+		  BLOCK_LSN(__entry->grant_write_lsn),
 		  __entry->curr_cycle,
 		  __entry->curr_block,
 		  CYCLE_LSN(__entry->tail_lsn),
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index cee4ab9..12c726b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -47,7 +47,8 @@ STATIC xlog_t *  xlog_alloc_log(xfs_mount_t	*mp,
 				xfs_buftarg_t	*log_target,
 				xfs_daddr_t	blk_offset,
 				int		num_bblks);
-STATIC int	 xlog_space_left(xlog_t *log, int cycle, int bytes);
+STATIC int	 xlog_space_left(xfs_lsn_t tail_lsn, int log_size,
+				xfs_lsn_t marker);
 STATIC int	 xlog_sync(xlog_t *log, xlog_in_core_t *iclog);
 STATIC void	 xlog_dealloc_log(xlog_t *log);
 
@@ -70,8 +71,8 @@ STATIC void xlog_state_want_sync(xlog_t	*log, xlog_in_core_t *iclog);
 /* local functions to manipulate grant head */
 STATIC int  xlog_grant_log_space(xlog_t		*log,
 				 xlog_ticket_t	*xtic);
-STATIC void xlog_grant_push_ail(xfs_mount_t	*mp,
-				int		need_bytes);
+STATIC void xlog_grant_push_ail(struct log *log, xfs_lsn_t tail_lsn,
+				xfs_lsn_t last_sync_lsn, int need_bytes);
 STATIC void xlog_regrant_reserve_log_space(xlog_t	 *log,
 					   xlog_ticket_t *ticket);
 STATIC int xlog_regrant_write_log_space(xlog_t		*log,
@@ -81,7 +82,8 @@ STATIC void xlog_ungrant_log_space(xlog_t	 *log,
 
 #if defined(DEBUG)
 STATIC void	xlog_verify_dest_ptr(xlog_t *log, char *ptr);
-STATIC void	xlog_verify_grant_head(xlog_t *log, int equals);
+STATIC void	xlog_verify_grant_head(struct log *log, int equals);
+STATIC void	xlog_verify_grant_tail(struct log *log);
 STATIC void	xlog_verify_iclog(xlog_t *log, xlog_in_core_t *iclog,
 				  int count, boolean_t syncing);
 STATIC void	xlog_verify_tail_lsn(xlog_t *log, xlog_in_core_t *iclog,
@@ -89,90 +91,85 @@ STATIC void	xlog_verify_tail_lsn(xlog_t *log, xlog_in_core_t *iclog,
 #else
 #define xlog_verify_dest_ptr(a,b)
 #define xlog_verify_grant_head(a,b)
+#define xlog_verify_grant_tail(a)
 #define xlog_verify_iclog(a,b,c,d)
 #define xlog_verify_tail_lsn(a,b,c)
 #endif
 
 STATIC int	xlog_iclogs_empty(xlog_t *log);
 
-
-static void
-xlog_ins_ticketq(struct xlog_ticket **qp, struct xlog_ticket *tic)
-{
-	if (*qp) {
-		tic->t_next	    = (*qp);
-		tic->t_prev	    = (*qp)->t_prev;
-		(*qp)->t_prev->t_next = tic;
-		(*qp)->t_prev	    = tic;
-	} else {
-		tic->t_prev = tic->t_next = tic;
-		*qp = tic;
-	}
-
-	tic->t_flags |= XLOG_TIC_IN_Q;
-}
-
-static void
-xlog_del_ticketq(struct xlog_ticket **qp, struct xlog_ticket *tic)
+/*
+ * Grant space calculations use 64 bit atomic variables to store the current reserve
+ * and write grant markers. However, these are really two 32 bit numbers which
+ * need to be cracked out of the 64 bit variable, modified, recombined and then
+ * written back into the 64 bit atomic variable. And it has to be done
+ * atomically (i.e. without locks).
+ *
+ * The upper 32 bits is the log cycle, just like a xfs_lsn_t. The lower 32 bits
+ * is the byte offset into the log for the marker. Unlike the xfs_lsn_t, this
+ * is held in bytes rather than basic blocks, even though it uses the
+ * BLOCK_LSN() macro to extract it.
+ *
+ * Essentially, we use an compare and exchange algorithm to atomically update
+ * the markers. That is, we sample the current marker, crack it, perform the
+ * calculation, recombine it into a new value, and then conditionally set the
+ * value back into the atomic variable only if it hasn't changed since we first
+ * sampled it. This provides atomic updates of the marker, even though we do
+ * non-atomic, multi-step calculation on the value.
+ */
+static inline void
+xlog_grant_sub_space(
+	struct log	*log,
+	int		space,
+	atomic64_t	*val)
 {
-	if (tic == tic->t_next) {
-		*qp = NULL;
-	} else {
-		*qp = tic->t_next;
-		tic->t_next->t_prev = tic->t_prev;
-		tic->t_prev->t_next = tic->t_next;
-	}
+	xfs_lsn_t	last, old, new;
 
-	tic->t_next = tic->t_prev = NULL;
-	tic->t_flags &= ~XLOG_TIC_IN_Q;
-}
-
-static void
-xlog_grant_sub_space(struct log *log, int bytes)
-{
-	log->l_grant_write_bytes -= bytes;
-	if (log->l_grant_write_bytes < 0) {
-		log->l_grant_write_bytes += log->l_logsize;
-		log->l_grant_write_cycle--;
-	}
+	last = atomic64_read(val);
+	do {
+		int	cycle, bytes;
 
-	log->l_grant_reserve_bytes -= bytes;
-	if ((log)->l_grant_reserve_bytes < 0) {
-		log->l_grant_reserve_bytes += log->l_logsize;
-		log->l_grant_reserve_cycle--;
-	}
+		old = last;
+		cycle = CYCLE_LSN(old);
+		bytes = BLOCK_LSN(old);
 
+		bytes -= space;
+		if (bytes < 0) {
+			bytes += log->l_logsize;
+			cycle--;
+		}
+		new = xlog_assign_lsn(cycle, bytes);
+		last = atomic64_cmpxchg(val, old, new);
+	} while (last != old);
 }
 
 static void
-xlog_grant_add_space_write(struct log *log, int bytes)
+xlog_grant_add_space(
+	struct log	*log,
+	int		space,
+	atomic64_t	*val)
 {
-	int tmp = log->l_logsize - log->l_grant_write_bytes;
-	if (tmp > bytes)
-		log->l_grant_write_bytes += bytes;
-	else {
-		log->l_grant_write_cycle++;
-		log->l_grant_write_bytes = bytes - tmp;
-	}
-}
+	xfs_lsn_t	last, old, new;
 
-static void
-xlog_grant_add_space_reserve(struct log *log, int bytes)
-{
-	int tmp = log->l_logsize - log->l_grant_reserve_bytes;
-	if (tmp > bytes)
-		log->l_grant_reserve_bytes += bytes;
-	else {
-		log->l_grant_reserve_cycle++;
-		log->l_grant_reserve_bytes = bytes - tmp;
-	}
-}
+	last = atomic64_read(val);
+	do {
+		int	cycle, bytes, available;
+
+		old = last;
+		cycle = CYCLE_LSN(old);
+		bytes = BLOCK_LSN(old);
+		available = log->l_logsize - bytes;
+
+		if (available > space)
+			bytes += space;
+		else {
+			cycle++;
+			bytes = space - available;
+		}
 
-static inline void
-xlog_grant_add_space(struct log *log, int bytes)
-{
-	xlog_grant_add_space_write(log, bytes);
-	xlog_grant_add_space_reserve(log, bytes);
+		new = xlog_assign_lsn(cycle, bytes);
+		last = atomic64_cmpxchg(val, old, new);
+	} while (last != old);
 }
 
 static void
@@ -321,12 +318,12 @@ xfs_log_release_iclog(
 int
 xfs_log_reserve(
 	struct xfs_mount	*mp,
-	int		 	unit_bytes,
-	int		 	cnt,
+	int			unit_bytes,
+	int			cnt,
 	struct xlog_ticket	**ticket,
-	__uint8_t	 	client,
-	uint		 	flags,
-	uint		 	t_type)
+	__uint8_t		client,
+	uint			flags,
+	uint			t_type)
 {
 	struct log		*log = mp->m_log;
 	struct xlog_ticket	*internal_ticket;
@@ -339,7 +336,6 @@ xfs_log_reserve(
 
 	XFS_STATS_INC(xs_try_logspace);
 
-
 	if (*ticket != NULL) {
 		ASSERT(flags & XFS_LOG_PERM_RESERV);
 		internal_ticket = *ticket;
@@ -355,7 +351,9 @@ xfs_log_reserve(
 
 		trace_xfs_log_reserve(log, internal_ticket);
 
-		xlog_grant_push_ail(mp, internal_ticket->t_unit_res);
+		xlog_grant_push_ail(log, atomic64_read(&log->l_tail_lsn),
+				    atomic64_read(&log->l_last_sync_lsn),
+				    internal_ticket->t_unit_res);
 		retval = xlog_regrant_write_log_space(log, internal_ticket);
 	} else {
 		/* may sleep if need to allocate more tickets */
@@ -369,14 +367,15 @@ xfs_log_reserve(
 
 		trace_xfs_log_reserve(log, internal_ticket);
 
-		xlog_grant_push_ail(mp,
+		xlog_grant_push_ail(log, atomic64_read(&log->l_tail_lsn),
+				    atomic64_read(&log->l_last_sync_lsn),
 				    (internal_ticket->t_unit_res *
 				     internal_ticket->t_cnt));
 		retval = xlog_grant_log_space(log, internal_ticket);
 	}
 
 	return retval;
-}	/* xfs_log_reserve */
+}
 
 
 /*
@@ -699,73 +698,80 @@ xfs_log_write(
 
 void
 xfs_log_move_tail(xfs_mount_t	*mp,
-		  xfs_lsn_t	tail_lsn)
+		  xfs_lsn_t	new_tail_lsn)
 {
 	xlog_ticket_t	*tic;
 	xlog_t		*log = mp->m_log;
-	int		need_bytes, free_bytes, cycle, bytes;
+	int		need_bytes, free_bytes;
 
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return;
 
-	if (tail_lsn == 0) {
-		/* needed since sync_lsn is 64 bits */
-		spin_lock(&log->l_icloglock);
-		tail_lsn = log->l_last_sync_lsn;
-		spin_unlock(&log->l_icloglock);
-	}
-
-	spin_lock(&log->l_grant_lock);
-
-	/* Also an invalid lsn.  1 implies that we aren't passing in a valid
-	 * tail_lsn.
+	/*
+	 * new_tail_lsn == 1 implies that we aren't passing in a valid
+	 * tail_lsn, so don't set the tail.
 	 */
-	if (tail_lsn != 1) {
-		log->l_tail_lsn = tail_lsn;
+	switch (new_tail_lsn) {
+	case 0:
+		/* AIL is empty, so tail is what was last written to disk */
+		atomic64_set(&log->l_tail_lsn,
+				atomic64_read(&log->l_last_sync_lsn));
+		break;
+	case 1:
+		/* Current tail is unknown, so just use the existing one */
+		break;
+	default:
+		/* update the tail with the new lsn. */
+		atomic64_set(&log->l_tail_lsn, new_tail_lsn);
+		break;
 	}
 
-	if ((tic = log->l_write_headq)) {
+	if (!list_empty(&log->l_writeq)) {
 #ifdef DEBUG
 		if (log->l_flags & XLOG_ACTIVE_RECOVERY)
 			panic("Recovery problem");
 #endif
-		cycle = log->l_grant_write_cycle;
-		bytes = log->l_grant_write_bytes;
-		free_bytes = xlog_space_left(log, cycle, bytes);
-		do {
+		spin_lock(&log->l_grant_write_lock);
+		free_bytes = xlog_space_left(atomic64_read(&log->l_tail_lsn),
+				log->l_logsize,
+				atomic64_read(&log->l_grant_write_lsn));
+
+		list_for_each_entry(tic, &log->l_writeq, t_queue) {
 			ASSERT(tic->t_flags & XLOG_TIC_PERM_RESERV);
 
-			if (free_bytes < tic->t_unit_res && tail_lsn != 1)
+			if (free_bytes < tic->t_unit_res && new_tail_lsn != 1)
 				break;
-			tail_lsn = 0;
+			new_tail_lsn = 0;
 			free_bytes -= tic->t_unit_res;
 			sv_signal(&tic->t_wait);
-			tic = tic->t_next;
-		} while (tic != log->l_write_headq);
+		}
+		spin_unlock(&log->l_grant_write_lock);
 	}
-	if ((tic = log->l_reserve_headq)) {
+
+	if (!list_empty(&log->l_reserveq)) {
 #ifdef DEBUG
 		if (log->l_flags & XLOG_ACTIVE_RECOVERY)
 			panic("Recovery problem");
 #endif
-		cycle = log->l_grant_reserve_cycle;
-		bytes = log->l_grant_reserve_bytes;
-		free_bytes = xlog_space_left(log, cycle, bytes);
-		do {
+		spin_lock(&log->l_grant_reserve_lock);
+		free_bytes = xlog_space_left(atomic64_read(&log->l_tail_lsn),
+				log->l_logsize,
+				atomic64_read(&log->l_grant_reserve_lsn));
+
+		list_for_each_entry(tic, &log->l_reserveq, t_queue) {
 			if (tic->t_flags & XLOG_TIC_PERM_RESERV)
 				need_bytes = tic->t_unit_res*tic->t_cnt;
 			else
 				need_bytes = tic->t_unit_res;
-			if (free_bytes < need_bytes && tail_lsn != 1)
+			if (free_bytes < need_bytes && new_tail_lsn != 1)
 				break;
-			tail_lsn = 0;
+			new_tail_lsn = 0;
 			free_bytes -= need_bytes;
 			sv_signal(&tic->t_wait);
-			tic = tic->t_next;
-		} while (tic != log->l_reserve_headq);
+		}
+		spin_unlock(&log->l_grant_reserve_lock);
 	}
-	spin_unlock(&log->l_grant_lock);
-}	/* xfs_log_move_tail */
+}
 
 /*
  * Determine if we have a transaction that has gone to disk
@@ -837,16 +843,13 @@ xlog_assign_tail_lsn(xfs_mount_t *mp)
 	xlog_t	  *log = mp->m_log;
 
 	tail_lsn = xfs_trans_ail_tail(mp->m_ail);
-	spin_lock(&log->l_grant_lock);
-	if (tail_lsn != 0) {
-		log->l_tail_lsn = tail_lsn;
-	} else {
-		tail_lsn = log->l_tail_lsn = log->l_last_sync_lsn;
+	if (tail_lsn) {
+		atomic64_set(&log->l_tail_lsn, tail_lsn);
+		return tail_lsn;
 	}
-	spin_unlock(&log->l_grant_lock);
-
-	return tail_lsn;
-}	/* xlog_assign_tail_lsn */
+	atomic64_set(&log->l_tail_lsn, atomic64_read(&log->l_last_sync_lsn));
+	return atomic64_read(&log->l_tail_lsn);
+}
 
 
 /*
@@ -864,16 +867,21 @@ xlog_assign_tail_lsn(xfs_mount_t *mp)
  * result is that we return the size of the log as the amount of space left.
  */
 STATIC int
-xlog_space_left(xlog_t *log, int cycle, int bytes)
+xlog_space_left(
+	xfs_lsn_t	tail_lsn,
+	int		log_size,
+	xfs_lsn_t	head)
 {
 	int free_bytes;
-	int tail_bytes;
-	int tail_cycle;
+	int tail_bytes = BBTOB(BLOCK_LSN(tail_lsn));
+	int tail_cycle = CYCLE_LSN(tail_lsn);
+	int cycle = CYCLE_LSN(head);
+	int bytes = BLOCK_LSN(head);
 
-	tail_bytes = BBTOB(BLOCK_LSN(log->l_tail_lsn));
-	tail_cycle = CYCLE_LSN(log->l_tail_lsn);
+	tail_bytes = BBTOB(BLOCK_LSN(tail_lsn));
+	tail_cycle = CYCLE_LSN(tail_lsn);
 	if ((tail_cycle == cycle) && (bytes >= tail_bytes)) {
-		free_bytes = log->l_logsize - (bytes - tail_bytes);
+		free_bytes = log_size - (bytes - tail_bytes);
 	} else if ((tail_cycle + 1) < cycle) {
 		return 0;
 	} else if (tail_cycle < cycle) {
@@ -885,13 +893,13 @@ xlog_space_left(xlog_t *log, int cycle, int bytes)
 		 * In this case we just want to return the size of the
 		 * log as the amount of space left.
 		 */
-		xfs_fs_cmn_err(CE_ALERT, log->l_mp,
+		cmn_err(CE_ALERT,
 			"xlog_space_left: head behind tail\n"
 			"  tail_cycle = %d, tail_bytes = %d\n"
 			"  GH   cycle = %d, GH   bytes = %d",
 			tail_cycle, tail_bytes, cycle, bytes);
 		ASSERT(0);
-		free_bytes = log->l_logsize;
+		free_bytes = log_size;
 	}
 	return free_bytes;
 }	/* xlog_space_left */
@@ -1047,12 +1055,17 @@ xlog_alloc_log(xfs_mount_t	*mp,
 	log->l_flags	   |= XLOG_ACTIVE_RECOVERY;
 
 	log->l_prev_block  = -1;
-	log->l_tail_lsn	   = xlog_assign_lsn(1, 0);
 	/* log->l_tail_lsn = 0x100000000LL; cycle = 1; current block = 0 */
-	log->l_last_sync_lsn = log->l_tail_lsn;
 	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
-	log->l_grant_reserve_cycle = 1;
-	log->l_grant_write_cycle = 1;
+	atomic64_set(&log->l_tail_lsn, xlog_assign_lsn(log->l_curr_cycle, 0));
+	atomic64_set(&log->l_last_sync_lsn, atomic64_read(&log->l_tail_lsn));
+	atomic64_set(&log->l_grant_reserve_lsn, atomic64_read(&log->l_tail_lsn));
+	atomic64_set(&log->l_grant_write_lsn, atomic64_read(&log->l_tail_lsn));
+
+	spin_lock_init(&log->l_grant_reserve_lock);
+	INIT_LIST_HEAD(&log->l_reserveq);
+	spin_lock_init(&log->l_grant_write_lock);
+	INIT_LIST_HEAD(&log->l_writeq);
 
 	error = EFSCORRUPTED;
 	if (xfs_sb_version_hassector(&mp->m_sb)) {
@@ -1094,7 +1107,6 @@ xlog_alloc_log(xfs_mount_t	*mp,
 	log->l_xbuf = bp;
 
 	spin_lock_init(&log->l_icloglock);
-	spin_lock_init(&log->l_grant_lock);
 	sv_init(&log->l_flush_wait, 0, "flush_wait");
 
 	/* log record size must be multiple of BBSIZE; see xlog_rec_header_t */
@@ -1175,7 +1187,6 @@ out_free_iclog:
 		kmem_free(iclog);
 	}
 	spinlock_destroy(&log->l_icloglock);
-	spinlock_destroy(&log->l_grant_lock);
 	xfs_buf_free(log->l_xbuf);
 out_free_log:
 	kmem_free(log);
@@ -1223,11 +1234,12 @@ xlog_commit_record(
  * water mark.  In this manner, we would be creating a low water mark.
  */
 STATIC void
-xlog_grant_push_ail(xfs_mount_t	*mp,
-		    int		need_bytes)
+xlog_grant_push_ail(
+	struct log	*log,
+	xfs_lsn_t	tail_lsn,
+	xfs_lsn_t	last_sync_lsn,
+	int		need_bytes)
 {
-    xlog_t	*log = mp->m_log;	/* pointer to the log */
-    xfs_lsn_t	tail_lsn;		/* lsn of the log tail */
     xfs_lsn_t	threshold_lsn = 0;	/* lsn we'd like to be at */
     int		free_blocks;		/* free blocks left to write to */
     int		free_bytes;		/* free bytes left to write to */
@@ -1237,11 +1249,8 @@ xlog_grant_push_ail(xfs_mount_t	*mp,
 
     ASSERT(BTOBB(need_bytes) < log->l_logBBsize);
 
-    spin_lock(&log->l_grant_lock);
-    free_bytes = xlog_space_left(log,
-				 log->l_grant_reserve_cycle,
-				 log->l_grant_reserve_bytes);
-    tail_lsn = log->l_tail_lsn;
+    free_bytes = xlog_space_left(tail_lsn, log->l_logsize,
+				atomic64_read(&log->l_grant_reserve_lsn));
     free_blocks = BTOBBT(free_bytes);
 
     /*
@@ -1264,10 +1273,9 @@ xlog_grant_push_ail(xfs_mount_t	*mp,
 	/* Don't pass in an lsn greater than the lsn of the last
 	 * log record known to be on disk.
 	 */
-	if (XFS_LSN_CMP(threshold_lsn, log->l_last_sync_lsn) > 0)
-	    threshold_lsn = log->l_last_sync_lsn;
+	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
+	    threshold_lsn = last_sync_lsn;
     }
-    spin_unlock(&log->l_grant_lock);
 
     /*
      * Get the transaction layer to kick the dirty buffers out to
@@ -1277,7 +1285,7 @@ xlog_grant_push_ail(xfs_mount_t	*mp,
     if (threshold_lsn &&
 	!XLOG_FORCED_SHUTDOWN(log))
 	    xfs_trans_ail_push(log->l_ailp, threshold_lsn);
-}	/* xlog_grant_push_ail */
+}
 
 /*
  * The bdstrat callback function for log bufs. This gives us a central
@@ -1365,19 +1373,17 @@ xlog_sync(xlog_t		*log,
 	}
 	roundoff = count - count_init;
 	ASSERT(roundoff >= 0);
-	ASSERT((v2 && log->l_mp->m_sb.sb_logsunit > 1 && 
-                roundoff < log->l_mp->m_sb.sb_logsunit)
-		|| 
-		(log->l_mp->m_sb.sb_logsunit <= 1 && 
+	ASSERT((v2 && log->l_mp->m_sb.sb_logsunit > 1 &&
+                roundoff < log->l_mp->m_sb.sb_logsunit) ||
+		(log->l_mp->m_sb.sb_logsunit <= 1 &&
 		 roundoff < BBTOB(1)));
 
 	/* move grant heads by roundoff in sync */
-	spin_lock(&log->l_grant_lock);
-	xlog_grant_add_space(log, roundoff);
-	spin_unlock(&log->l_grant_lock);
+	xlog_grant_add_space(log, roundoff, &log->l_grant_reserve_lsn);
+	xlog_grant_add_space(log, roundoff, &log->l_grant_write_lsn);
 
 	/* put cycle number in every block */
-	xlog_pack_data(log, iclog, roundoff); 
+	xlog_pack_data(log, iclog, roundoff);
 
 	/* real byte length */
 	if (v2) {
@@ -1497,7 +1503,6 @@ xlog_dealloc_log(xlog_t *log)
 		iclog = next_iclog;
 	}
 	spinlock_destroy(&log->l_icloglock);
-	spinlock_destroy(&log->l_grant_lock);
 
 	xfs_buf_free(log->l_xbuf);
 	log->l_mp->m_log = NULL;
@@ -2240,19 +2245,14 @@ xlog_state_do_callback(
 
 				iclog->ic_state = XLOG_STATE_CALLBACK;
 
-				spin_unlock(&log->l_icloglock);
-
-				/* l_last_sync_lsn field protected by
-				 * l_grant_lock. Don't worry about iclog's lsn.
-				 * No one else can be here except us.
-				 */
-				spin_lock(&log->l_grant_lock);
-				ASSERT(XFS_LSN_CMP(log->l_last_sync_lsn,
+				ASSERT(XFS_LSN_CMP(
+				       atomic64_read(&log->l_last_sync_lsn),
 				       be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
-				log->l_last_sync_lsn =
-					be64_to_cpu(iclog->ic_header.h_lsn);
-				spin_unlock(&log->l_grant_lock);
 
+				atomic64_set(&log->l_last_sync_lsn,
+					be64_to_cpu(iclog->ic_header.h_lsn));
+
+				spin_unlock(&log->l_icloglock);
 			} else {
 				spin_unlock(&log->l_icloglock);
 				ioerrors++;
@@ -2527,6 +2527,18 @@ restart:
  *
  * Once a ticket gets put onto the reserveq, it will only return after
  * the needed reservation is satisfied.
+ *
+ * This function is structured so that it has a lock free fast path. This is
+ * necessary because every new transaction reservation will come through this
+ * path. Hence any lock will be globally hot if we take it unconditionally on
+ * every pass.
+ *
+ * As tickets are only ever moved on and off the reserveq under the
+ * l_grant_reserve_lock, we only need to take that lock if we are going
+ * to add the ticket to the queue and sleep. We can avoid taking the lock if the
+ * ticket was never added to the reserveq because the t_queue list head will be
+ * empty and we hold the only reference to it so it can safely be checked
+ * unlocked.
  */
 STATIC int
 xlog_grant_log_space(xlog_t	   *log,
@@ -2534,24 +2546,27 @@ xlog_grant_log_space(xlog_t	   *log,
 {
 	int		 free_bytes;
 	int		 need_bytes;
-#ifdef DEBUG
-	xfs_lsn_t	 tail_lsn;
-#endif
-
 
 #ifdef DEBUG
 	if (log->l_flags & XLOG_ACTIVE_RECOVERY)
 		panic("grant Recovery problem");
 #endif
 
-	/* Is there space or do we need to sleep? */
-	spin_lock(&log->l_grant_lock);
-
 	trace_xfs_log_grant_enter(log, tic);
 
+	need_bytes = tic->t_unit_res;
+	if (tic->t_flags & XFS_LOG_PERM_RESERV)
+		need_bytes *= tic->t_ocnt;
+
 	/* something is already sleeping; insert new transaction at end */
-	if (log->l_reserve_headq) {
-		xlog_ins_ticketq(&log->l_reserve_headq, tic);
+	if (!list_empty(&log->l_reserveq)) {
+		spin_lock(&log->l_grant_reserve_lock);
+		if (list_empty(&log->l_reserveq)) {
+			spin_unlock(&log->l_grant_reserve_lock);
+			goto redo;
+		}
+
+		list_add_tail(&tic->t_queue, &log->l_reserveq);
 
 		trace_xfs_log_grant_sleep1(log, tic);
 
@@ -2563,71 +2578,64 @@ xlog_grant_log_space(xlog_t	   *log,
 			goto error_return;
 
 		XFS_STATS_INC(xs_sleep_logspace);
-		sv_wait(&tic->t_wait, PINOD|PLTWAIT, &log->l_grant_lock, s);
+		sv_wait(&tic->t_wait, PINOD|PLTWAIT,
+			&log->l_grant_reserve_lock, s);
 		/*
 		 * If we got an error, and the filesystem is shutting down,
 		 * we'll catch it down below. So just continue...
 		 */
 		trace_xfs_log_grant_wake1(log, tic);
-		spin_lock(&log->l_grant_lock);
 	}
-	if (tic->t_flags & XFS_LOG_PERM_RESERV)
-		need_bytes = tic->t_unit_res*tic->t_ocnt;
-	else
-		need_bytes = tic->t_unit_res;
 
 redo:
-	if (XLOG_FORCED_SHUTDOWN(log))
+	if (XLOG_FORCED_SHUTDOWN(log)) {
+		spin_lock(&log->l_grant_reserve_lock);
 		goto error_return;
+	}
 
-	free_bytes = xlog_space_left(log, log->l_grant_reserve_cycle,
-				     log->l_grant_reserve_bytes);
+	free_bytes = xlog_space_left(atomic64_read(&log->l_tail_lsn),
+				log->l_logsize,
+				atomic64_read(&log->l_grant_reserve_lsn));
 	if (free_bytes < need_bytes) {
-		if ((tic->t_flags & XLOG_TIC_IN_Q) == 0)
-			xlog_ins_ticketq(&log->l_reserve_headq, tic);
+		spin_lock(&log->l_grant_reserve_lock);
+		if (list_empty(&tic->t_queue))
+			list_add_tail(&tic->t_queue, &log->l_reserveq);
 
-		trace_xfs_log_grant_sleep2(log, tic);
-
-		spin_unlock(&log->l_grant_lock);
-		xlog_grant_push_ail(log->l_mp, need_bytes);
-		spin_lock(&log->l_grant_lock);
+		xlog_grant_push_ail(log, atomic64_read(&log->l_tail_lsn),
+				    atomic64_read(&log->l_last_sync_lsn),
+				    need_bytes);
 
-		XFS_STATS_INC(xs_sleep_logspace);
-		sv_wait(&tic->t_wait, PINOD|PLTWAIT, &log->l_grant_lock, s);
+		trace_xfs_log_grant_sleep2(log, tic);
 
-		spin_lock(&log->l_grant_lock);
 		if (XLOG_FORCED_SHUTDOWN(log))
 			goto error_return;
 
+		XFS_STATS_INC(xs_sleep_logspace);
+		sv_wait(&tic->t_wait, PINOD|PLTWAIT,
+			&log->l_grant_reserve_lock, s);
+
 		trace_xfs_log_grant_wake2(log, tic);
 
 		goto redo;
-	} else if (tic->t_flags & XLOG_TIC_IN_Q)
-		xlog_del_ticketq(&log->l_reserve_headq, tic);
+	}
 
 	/* we've got enough space */
-	xlog_grant_add_space(log, need_bytes);
-#ifdef DEBUG
-	tail_lsn = log->l_tail_lsn;
-	/*
-	 * Check to make sure the grant write head didn't just over lap the
-	 * tail.  If the cycles are the same, we can't be overlapping.
-	 * Otherwise, make sure that the cycles differ by exactly one and
-	 * check the byte count.
-	 */
-	if (CYCLE_LSN(tail_lsn) != log->l_grant_write_cycle) {
-		ASSERT(log->l_grant_write_cycle-1 == CYCLE_LSN(tail_lsn));
-		ASSERT(log->l_grant_write_bytes <= BBTOB(BLOCK_LSN(tail_lsn)));
+	if (!list_empty(&tic->t_queue)) {
+		spin_lock(&log->l_grant_reserve_lock);
+		list_del_init(&tic->t_queue);
+		spin_unlock(&log->l_grant_reserve_lock);
 	}
-#endif
+	xlog_grant_add_space(log, need_bytes, &log->l_grant_reserve_lsn);
+	xlog_grant_add_space(log, need_bytes, &log->l_grant_write_lsn);
+
 	trace_xfs_log_grant_exit(log, tic);
+	xlog_verify_grant_tail(log);
 	xlog_verify_grant_head(log, 1);
-	spin_unlock(&log->l_grant_lock);
 	return 0;
 
  error_return:
-	if (tic->t_flags & XLOG_TIC_IN_Q)
-		xlog_del_ticketq(&log->l_reserve_headq, tic);
+	list_del_init(&tic->t_queue);
+	spin_unlock(&log->l_grant_reserve_lock);
 
 	trace_xfs_log_grant_error(log, tic);
 
@@ -2638,25 +2646,23 @@ redo:
 	 */
 	tic->t_curr_res = 0;
 	tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */
-	spin_unlock(&log->l_grant_lock);
 	return XFS_ERROR(EIO);
-}	/* xlog_grant_log_space */
+}
 
 
 /*
  * Replenish the byte reservation required by moving the grant write head.
  *
- *
+ * Regranting log space is not a particularly hot path, so not real effort has
+ * been made to make the fast path lock free. If contention on the
+ * l_grant_write_lock becomes evident, it shoul dbe easy to apply the same
+ * modifications made to xlog_grant_log_space to this function.
  */
 STATIC int
 xlog_regrant_write_log_space(xlog_t	   *log,
 			     xlog_ticket_t *tic)
 {
 	int		free_bytes, need_bytes;
-	xlog_ticket_t	*ntic;
-#ifdef DEBUG
-	xfs_lsn_t	tail_lsn;
-#endif
 
 	tic->t_curr_res = tic->t_unit_res;
 	xlog_tic_reset_res(tic);
@@ -2669,10 +2675,9 @@ xlog_regrant_write_log_space(xlog_t	   *log,
 		panic("regrant Recovery problem");
 #endif
 
-	spin_lock(&log->l_grant_lock);
-
 	trace_xfs_log_regrant_write_enter(log, tic);
 
+	spin_lock(&log->l_grant_write_lock);
 	if (XLOG_FORCED_SHUTDOWN(log))
 		goto error_return;
 
@@ -2683,36 +2688,43 @@ xlog_regrant_write_log_space(xlog_t	   *log,
 	 * this transaction.
 	 */
 	need_bytes = tic->t_unit_res;
-	if ((ntic = log->l_write_headq)) {
-		free_bytes = xlog_space_left(log, log->l_grant_write_cycle,
-					     log->l_grant_write_bytes);
-		do {
+	if (!list_empty(&log->l_writeq)) {
+		struct xlog_ticket *ntic;
+		free_bytes = xlog_space_left(atomic64_read(&log->l_tail_lsn),
+				log->l_logsize,
+				atomic64_read(&log->l_grant_write_lsn));
+		list_for_each_entry(ntic, &log->l_writeq, t_queue) {
 			ASSERT(ntic->t_flags & XLOG_TIC_PERM_RESERV);
 
 			if (free_bytes < ntic->t_unit_res)
 				break;
 			free_bytes -= ntic->t_unit_res;
 			sv_signal(&ntic->t_wait);
-			ntic = ntic->t_next;
-		} while (ntic != log->l_write_headq);
+		}
 
-		if (ntic != log->l_write_headq) {
-			if ((tic->t_flags & XLOG_TIC_IN_Q) == 0)
-				xlog_ins_ticketq(&log->l_write_headq, tic);
+		if (ntic != list_first_entry(&log->l_writeq,
+						struct xlog_ticket, t_queue)) {
+			if (list_empty(&tic->t_queue))
+				list_add_tail(&tic->t_queue, &log->l_writeq);
 
 			trace_xfs_log_regrant_write_sleep1(log, tic);
 
-			spin_unlock(&log->l_grant_lock);
-			xlog_grant_push_ail(log->l_mp, need_bytes);
-			spin_lock(&log->l_grant_lock);
+			spin_unlock(&log->l_grant_write_lock);
+
+			xlog_grant_push_ail(log,
+					atomic64_read(&log->l_tail_lsn),
+					atomic64_read(&log->l_last_sync_lsn),
+					need_bytes);
+
+			spin_lock(&log->l_grant_write_lock);
 
 			XFS_STATS_INC(xs_sleep_logspace);
 			sv_wait(&tic->t_wait, PINOD|PLTWAIT,
-				&log->l_grant_lock, s);
+				&log->l_grant_write_lock, s);
 
 			/* If we're shutting down, this tic is already
 			 * off the queue */
-			spin_lock(&log->l_grant_lock);
+			spin_lock(&log->l_grant_write_lock);
 			if (XLOG_FORCED_SHUTDOWN(log))
 				goto error_return;
 
@@ -2724,50 +2736,48 @@ redo:
 	if (XLOG_FORCED_SHUTDOWN(log))
 		goto error_return;
 
-	free_bytes = xlog_space_left(log, log->l_grant_write_cycle,
-				     log->l_grant_write_bytes);
+	free_bytes = xlog_space_left(atomic64_read(&log->l_tail_lsn),
+				log->l_logsize,
+				atomic64_read(&log->l_grant_write_lsn));
 	if (free_bytes < need_bytes) {
-		if ((tic->t_flags & XLOG_TIC_IN_Q) == 0)
-			xlog_ins_ticketq(&log->l_write_headq, tic);
-		spin_unlock(&log->l_grant_lock);
-		xlog_grant_push_ail(log->l_mp, need_bytes);
-		spin_lock(&log->l_grant_lock);
+		if (list_empty(&tic->t_queue))
+			list_add_tail(&tic->t_queue, &log->l_writeq);
 
+		spin_unlock(&log->l_grant_write_lock);
+
+		xlog_grant_push_ail(log, atomic64_read(&log->l_tail_lsn),
+					atomic64_read(&log->l_last_sync_lsn),
+					need_bytes);
+
+		spin_lock(&log->l_grant_write_lock);
 		XFS_STATS_INC(xs_sleep_logspace);
 		trace_xfs_log_regrant_write_sleep2(log, tic);
-
-		sv_wait(&tic->t_wait, PINOD|PLTWAIT, &log->l_grant_lock, s);
+		sv_wait(&tic->t_wait, PINOD|PLTWAIT,
+			&log->l_grant_write_lock, s);
 
 		/* If we're shutting down, this tic is already off the queue */
-		spin_lock(&log->l_grant_lock);
+		spin_lock(&log->l_grant_write_lock);
 		if (XLOG_FORCED_SHUTDOWN(log))
 			goto error_return;
 
 		trace_xfs_log_regrant_write_wake2(log, tic);
 		goto redo;
-	} else if (tic->t_flags & XLOG_TIC_IN_Q)
-		xlog_del_ticketq(&log->l_write_headq, tic);
+	}
 
 	/* we've got enough space */
-	xlog_grant_add_space_write(log, need_bytes);
-#ifdef DEBUG
-	tail_lsn = log->l_tail_lsn;
-	if (CYCLE_LSN(tail_lsn) != log->l_grant_write_cycle) {
-		ASSERT(log->l_grant_write_cycle-1 == CYCLE_LSN(tail_lsn));
-		ASSERT(log->l_grant_write_bytes <= BBTOB(BLOCK_LSN(tail_lsn)));
-	}
-#endif
+	list_del_init(&tic->t_queue);
+	spin_unlock(&log->l_grant_write_lock);
+	xlog_grant_add_space(log, need_bytes, &log->l_grant_write_lsn);
 
 	trace_xfs_log_regrant_write_exit(log, tic);
-
+	xlog_verify_grant_tail(log);
 	xlog_verify_grant_head(log, 1);
-	spin_unlock(&log->l_grant_lock);
 	return 0;
 
 
  error_return:
-	if (tic->t_flags & XLOG_TIC_IN_Q)
-		xlog_del_ticketq(&log->l_reserve_headq, tic);
+	list_del_init(&tic->t_queue);
+	spin_unlock(&log->l_grant_write_lock);
 
 	trace_xfs_log_regrant_write_error(log, tic);
 
@@ -2778,9 +2788,8 @@ redo:
 	 */
 	tic->t_curr_res = 0;
 	tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */
-	spin_unlock(&log->l_grant_lock);
 	return XFS_ERROR(EIO);
-}	/* xlog_regrant_write_log_space */
+}
 
 
 /* The first cnt-1 times through here we don't need to
@@ -2799,30 +2808,27 @@ xlog_regrant_reserve_log_space(xlog_t	     *log,
 	if (ticket->t_cnt > 0)
 		ticket->t_cnt--;
 
-	spin_lock(&log->l_grant_lock);
-	xlog_grant_sub_space(log, ticket->t_curr_res);
+	xlog_grant_sub_space(log, ticket->t_curr_res, &log->l_grant_write_lsn);
+	xlog_grant_sub_space(log, ticket->t_curr_res, &log->l_grant_reserve_lsn);
+
 	ticket->t_curr_res = ticket->t_unit_res;
 	xlog_tic_reset_res(ticket);
 
 	trace_xfs_log_regrant_reserve_sub(log, ticket);
-
 	xlog_verify_grant_head(log, 1);
 
 	/* just return if we still have some of the pre-reserved space */
-	if (ticket->t_cnt > 0) {
-		spin_unlock(&log->l_grant_lock);
+	if (ticket->t_cnt > 0)
 		return;
-	}
 
-	xlog_grant_add_space_reserve(log, ticket->t_unit_res);
+	xlog_grant_add_space(log, ticket->t_unit_res, &log->l_grant_reserve_lsn);
 
 	trace_xfs_log_regrant_reserve_exit(log, ticket);
-
 	xlog_verify_grant_head(log, 0);
-	spin_unlock(&log->l_grant_lock);
+
 	ticket->t_curr_res = ticket->t_unit_res;
 	xlog_tic_reset_res(ticket);
-}	/* xlog_regrant_reserve_log_space */
+}
 
 
 /*
@@ -2843,28 +2849,31 @@ STATIC void
 xlog_ungrant_log_space(xlog_t	     *log,
 		       xlog_ticket_t *ticket)
 {
-	if (ticket->t_cnt > 0)
-		ticket->t_cnt--;
+	int	space;
 
-	spin_lock(&log->l_grant_lock);
 	trace_xfs_log_ungrant_enter(log, ticket);
 
-	xlog_grant_sub_space(log, ticket->t_curr_res);
-
-	trace_xfs_log_ungrant_sub(log, ticket);
+	if (ticket->t_cnt > 0)
+		ticket->t_cnt--;
 
-	/* If this is a permanent reservation ticket, we may be able to free
+	/*
+	 * If this is a permanent reservation ticket, we may be able to free
 	 * up more space based on the remaining count.
 	 */
+	space = ticket->t_curr_res;
 	if (ticket->t_cnt > 0) {
 		ASSERT(ticket->t_flags & XLOG_TIC_PERM_RESERV);
-		xlog_grant_sub_space(log, ticket->t_unit_res*ticket->t_cnt);
+		space += ticket->t_unit_res * ticket->t_cnt;
 	}
 
-	trace_xfs_log_ungrant_exit(log, ticket);
+	trace_xfs_log_ungrant_sub(log, ticket);
+
+	xlog_grant_sub_space(log, space, &log->l_grant_write_lsn);
+	xlog_grant_sub_space(log, space, &log->l_grant_reserve_lsn);
 
+	trace_xfs_log_ungrant_exit(log, ticket);
 	xlog_verify_grant_head(log, 1);
-	spin_unlock(&log->l_grant_lock);
+
 	xfs_log_move_tail(log->l_mp, 1);
 }	/* xlog_ungrant_log_space */
 
@@ -2901,11 +2910,12 @@ xlog_state_release_iclog(
 
 	if (iclog->ic_state == XLOG_STATE_WANT_SYNC) {
 		/* update tail before writing to iclog */
-		xlog_assign_tail_lsn(log->l_mp);
+		xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp);
+
 		sync++;
 		iclog->ic_state = XLOG_STATE_SYNCING;
-		iclog->ic_header.h_tail_lsn = cpu_to_be64(log->l_tail_lsn);
-		xlog_verify_tail_lsn(log, iclog, log->l_tail_lsn);
+		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+		xlog_verify_tail_lsn(log, iclog, tail_lsn);
 		/* cycle incremented when incrementing curr_block */
 	}
 	spin_unlock(&log->l_icloglock);
@@ -3435,6 +3445,7 @@ xlog_ticket_alloc(
         }
 
 	atomic_set(&tic->t_ref, 1);
+	INIT_LIST_HEAD(&tic->t_queue);
 	tic->t_unit_res		= unit_bytes;
 	tic->t_curr_res		= unit_bytes;
 	tic->t_cnt		= cnt;
@@ -3484,18 +3495,48 @@ xlog_verify_dest_ptr(
 }
 
 STATIC void
-xlog_verify_grant_head(xlog_t *log, int equals)
+xlog_verify_grant_head(
+	struct log	*log,
+	int		equals)
 {
-    if (log->l_grant_reserve_cycle == log->l_grant_write_cycle) {
-	if (equals)
-	    ASSERT(log->l_grant_reserve_bytes >= log->l_grant_write_bytes);
-	else
-	    ASSERT(log->l_grant_reserve_bytes > log->l_grant_write_bytes);
-    } else {
-	ASSERT(log->l_grant_reserve_cycle-1 == log->l_grant_write_cycle);
-	ASSERT(log->l_grant_write_bytes >= log->l_grant_reserve_bytes);
-    }
-}	/* xlog_verify_grant_head */
+/* this is racy under work under concurrent modifications */
+#if 0
+	xfs_lsn_t reserve = atomic64_read(&log->l_grant_reserve_lsn);
+	xfs_lsn_t write = atomic64_read(&log->l_grant_write_lsn);
+
+	if (CYCLE_LSN(reserve) == CYCLE_LSN(write)) {
+		if (equals)
+			ASSERT(BLOCK_LSN(reserve) >= BLOCK_LSN(write));
+		else
+			ASSERT(BLOCK_LSN(reserve) > BLOCK_LSN(write));
+	} else {
+		ASSERT(CYCLE_LSN(reserve) - 1 == CYCLE_LSN(write));
+		ASSERT(BLOCK_LSN(write) >= BLOCK_LSN(reserve));
+	}
+#endif
+}
+
+STATIC void
+xlog_verify_grant_tail(
+	struct log	*log)
+{
+	xfs_lsn_t	 tail_lsn;
+	xfs_lsn_t	 write_lsn;
+
+	tail_lsn = atomic64_read(&log->l_tail_lsn);
+	write_lsn = atomic64_read(&log->l_grant_write_lsn);
+
+	/*
+	 * Check to make sure the grant write head didn't just over lap the
+	 * tail.  If the cycles are the same, we can't be overlapping.
+	 * Otherwise, make sure that the cycles differ by exactly one and
+	 * check the byte count.
+	 */
+	if (CYCLE_LSN(tail_lsn) != CYCLE_LSN(write_lsn)) {
+		ASSERT(CYCLE_LSN(write_lsn) - 1 == CYCLE_LSN(tail_lsn));
+		ASSERT(BLOCK_LSN(write_lsn) <= BBTOB(BLOCK_LSN(tail_lsn)));
+	}
+}
 
 /* check if it will fit */
 STATIC void
@@ -3721,7 +3762,6 @@ xfs_log_force_umount(
 	 * everybody up to tell the bad news.
 	 */
 	spin_lock(&log->l_icloglock);
-	spin_lock(&log->l_grant_lock);
 	mp->m_flags |= XFS_MOUNT_FS_SHUTDOWN;
 	if (mp->m_sb_bp)
 		XFS_BUF_DONE(mp->m_sb_bp);
@@ -3742,27 +3782,21 @@ xfs_log_force_umount(
 	spin_unlock(&log->l_icloglock);
 
 	/*
-	 * We don't want anybody waiting for log reservations
-	 * after this. That means we have to wake up everybody
-	 * queued up on reserve_headq as well as write_headq.
-	 * In addition, we make sure in xlog_{re}grant_log_space
-	 * that we don't enqueue anything once the SHUTDOWN flag
-	 * is set, and this action is protected by the GRANTLOCK.
+	 * We don't want anybody waiting for log reservations after this. That
+	 * means we have to wake up everybody queued up on reserveq as well as
+	 * writeq.  In addition, we make sure in xlog_{re}grant_log_space that
+	 * we don't enqueue anything once the SHUTDOWN flag is set, and this
+	 * action is protected by the grant locks.
 	 */
-	if ((tic = log->l_reserve_headq)) {
-		do {
-			sv_signal(&tic->t_wait);
-			tic = tic->t_next;
-		} while (tic != log->l_reserve_headq);
-	}
-
-	if ((tic = log->l_write_headq)) {
-		do {
-			sv_signal(&tic->t_wait);
-			tic = tic->t_next;
-		} while (tic != log->l_write_headq);
-	}
-	spin_unlock(&log->l_grant_lock);
+	spin_lock(&log->l_grant_reserve_lock);
+	list_for_each_entry(tic, &log->l_reserveq, t_queue)
+		sv_signal(&tic->t_wait);
+	spin_unlock(&log->l_grant_reserve_lock);
+
+	spin_lock(&log->l_grant_write_lock);
+	list_for_each_entry(tic, &log->l_writeq, t_queue)
+		sv_signal(&tic->t_wait);
+	spin_unlock(&log->l_grant_write_lock);
 
 	if (!(log->l_iclog->ic_state & XLOG_STATE_IOERROR)) {
 		ASSERT(!logerror);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index edcdfe0..4d6bf38 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -133,12 +133,10 @@ static inline uint xlog_get_client_id(__be32 i)
  */
 #define XLOG_TIC_INITED		0x1	/* has been initialized */
 #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */
-#define XLOG_TIC_IN_Q		0x4
 
 #define XLOG_TIC_FLAGS \
 	{ XLOG_TIC_INITED,	"XLOG_TIC_INITED" }, \
-	{ XLOG_TIC_PERM_RESERV,	"XLOG_TIC_PERM_RESERV" }, \
-	{ XLOG_TIC_IN_Q,	"XLOG_TIC_IN_Q" }
+	{ XLOG_TIC_PERM_RESERV,	"XLOG_TIC_PERM_RESERV" }
 
 #endif	/* __KERNEL__ */
 
@@ -245,8 +243,7 @@ typedef struct xlog_res {
 
 typedef struct xlog_ticket {
 	sv_t		   t_wait;	 /* ticket wait queue            : 20 */
-	struct xlog_ticket *t_next;	 /*			         :4|8 */
-	struct xlog_ticket *t_prev;	 /*				 :4|8 */
+	struct list_head   t_queue;	 /* reserve/write queue */
 	xlog_tid_t	   t_tid;	 /* transaction identifier	 : 4  */
 	atomic_t	   t_ref;	 /* ticket reference count       : 4  */
 	int		   t_curr_res;	 /* current reservation in bytes : 4  */
@@ -509,23 +506,34 @@ typedef struct log {
 						 * log entries" */
 	xlog_in_core_t		*l_iclog;       /* head log queue	*/
 	spinlock_t		l_icloglock;    /* grab to change iclog state */
-	xfs_lsn_t		l_tail_lsn;     /* lsn of 1st LR with unflushed
-						 * buffers */
-	xfs_lsn_t		l_last_sync_lsn;/* lsn of last LR on disk */
 	int			l_curr_cycle;   /* Cycle number of log writes */
 	int			l_prev_cycle;   /* Cycle number before last
 						 * block increment */
 	int			l_curr_block;   /* current logical log block */
 	int			l_prev_block;   /* previous logical log block */
 
-	/* The following block of fields are changed while holding grant_lock */
-	spinlock_t		l_grant_lock ____cacheline_aligned_in_smp;
-	xlog_ticket_t		*l_reserve_headq;
-	xlog_ticket_t		*l_write_headq;
-	int			l_grant_reserve_cycle;
-	int			l_grant_reserve_bytes;
-	int			l_grant_write_cycle;
-	int			l_grant_write_bytes;
+	/*
+	 * The l_tail_lsn and l_last_sync_lsn variables are set up as atomic
+	 * variables so they can be safely set and read without locking. While
+	 * they are often read together, they are updated differently with the
+	 * l_tail_lsn being quite hot, so place them on spearate cachelines.
+	 */
+	/* lsn of 1st LR with unflushed buffers */
+	atomic64_t		l_tail_lsn ____cacheline_aligned_in_smp;
+	/* lsn of last LR on disk */
+	atomic64_t		l_last_sync_lsn ____cacheline_aligned_in_smp;
+
+	/*
+	 * ticket grant locks, queues and accounting have their own cachlines
+	 * as these are quite hot and can be operated on concurrently.
+	 */
+	spinlock_t		l_grant_reserve_lock ____cacheline_aligned_in_smp;
+	struct list_head	l_reserveq;
+	atomic64_t		l_grant_reserve_lsn;
+
+	spinlock_t		l_grant_write_lock ____cacheline_aligned_in_smp;
+	struct list_head	l_writeq;
+	atomic64_t		l_grant_write_lsn;
 
 	/* The following field are used for debugging; need to hold icloglock */
 #ifdef DEBUG
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index baad94a..f73a215 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -925,12 +925,13 @@ xlog_find_tail(
 	log->l_curr_cycle = be32_to_cpu(rhead->h_cycle);
 	if (found == 2)
 		log->l_curr_cycle++;
-	log->l_tail_lsn = be64_to_cpu(rhead->h_tail_lsn);
-	log->l_last_sync_lsn = be64_to_cpu(rhead->h_lsn);
-	log->l_grant_reserve_cycle = log->l_curr_cycle;
-	log->l_grant_reserve_bytes = BBTOB(log->l_curr_block);
-	log->l_grant_write_cycle = log->l_curr_cycle;
-	log->l_grant_write_bytes = BBTOB(log->l_curr_block);
+	atomic64_set(&log->l_tail_lsn, be64_to_cpu(rhead->h_tail_lsn));
+	atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(rhead->h_lsn));
+
+	atomic64_set(&log->l_grant_reserve_lsn,
+		xlog_assign_lsn(log->l_curr_cycle, BBTOB(log->l_curr_block)));
+	atomic64_set(&log->l_grant_write_lsn,
+		xlog_assign_lsn(log->l_curr_cycle, BBTOB(log->l_curr_block)));
 
 	/*
 	 * Look for unmount record.  If we find it, then we know there
@@ -960,7 +961,7 @@ xlog_find_tail(
 	}
 	after_umount_blk = (i + hblks + (int)
 		BTOBB(be32_to_cpu(rhead->h_len))) % log->l_logBBsize;
-	tail_lsn = log->l_tail_lsn;
+	tail_lsn = atomic64_read(&log->l_tail_lsn);
 	if (*head_blk == after_umount_blk &&
 	    be32_to_cpu(rhead->h_num_logops) == 1) {
 		umount_data_blk = (i + hblks) % log->l_logBBsize;
@@ -975,12 +976,12 @@ xlog_find_tail(
 			 * log records will point recovery to after the
 			 * current unmount record.
 			 */
-			log->l_tail_lsn =
+			atomic64_set(&log->l_tail_lsn,
 				xlog_assign_lsn(log->l_curr_cycle,
-						after_umount_blk);
-			log->l_last_sync_lsn =
+						after_umount_blk));
+			atomic64_set(&log->l_last_sync_lsn,
 				xlog_assign_lsn(log->l_curr_cycle,
-						after_umount_blk);
+						after_umount_blk));
 			*tail_blk = after_umount_blk;
 
 			/*
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/16] patch xfs-inode-hash-fake
  2010-11-08  8:55 ` [PATCH 06/16] patch xfs-inode-hash-fake Dave Chinner
@ 2010-11-08  9:19   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08  9:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Missing a good subject and description.

Anyway, I think we should add an inode_mark_hashed header, similar to
the read-side inode_hashed to avoid exposing the exact list
implementation to filesystems.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking
  2010-11-08  8:55 ` [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking Dave Chinner
@ 2010-11-08  9:23   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08  9:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -853,6 +853,7 @@ restart:
>  		if (trylock) {
>  			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
>  				skipped++;
> +				xfs_perag_put(pag);
>  				continue;
>  			}
>  			first_index = pag->pag_ici_reclaim_cursor;

One way to make this loop more maintainable is to split the guts of it
into a xfs_reclaim_inodes_ag helper, and make the existing function a
wrapper around it (and remove the incorrect _ag prefix), but that's .38
material.

For .37 the patch looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/16] xfs: move delayed write buffer trace
  2010-11-08  8:55 ` [PATCH 02/16] xfs: move delayed write buffer trace Dave Chinner
@ 2010-11-08  9:24   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08  9:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Nov 08, 2010 at 07:55:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The delayed write buffer split trace currently issues a trace for
> every buffer it scans. These buffers are not necessarily queued for
> delayed write. Indeed, when buffers are pinned, there can be
> thousands of traces of buffers that aren't actually queued for
> delayed write and the ones that are are lost in the noise. Move the
> trace point to record only buffers that are split out for IO to be
> issued on.

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/16] xfs: connect up buffer reclaim priority hooks
  2010-11-08  8:55 ` [PATCH 11/16] xfs: connect up buffer reclaim priority hooks Dave Chinner
@ 2010-11-08 11:25   ` Christoph Hellwig
  2010-11-08 23:50     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 11:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> +/*
> + * buffer types 
> + */
> +#define	B_FS_DQUOT	1
> +#define	B_FS_AGFL	2
> +#define	B_FS_AGF	3
> +#define	B_FS_ATTR_BTREE	4
> +#define	B_FS_DIR_BTREE	5
> +#define	B_FS_MAP	6
> +#define	B_FS_INOMAP	7
> +#define	B_FS_AGI	8
> +#define	B_FS_INO	9

Is there any good reason to keep/reintroduce the buffer types?  In this
series we're only using the refcounts, and I can't see any good use for
the types either.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/16] xfs: reduce the number of AIL push wakeups
  2010-11-08  8:55 ` [PATCH 13/16] xfs: reduce the number of AIL push wakeups Dave Chinner
@ 2010-11-08 11:32   ` Christoph Hellwig
  2010-11-08 23:51     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 11:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>  STATIC int
> @@ -850,8 +853,17 @@ xfsaild(
>  	long		tout = 0; /* milliseconds */
>  
>  	while (!kthread_should_stop()) {
> -		schedule_timeout_interruptible(tout ?
> +		/*
> +		 * for short sleeps indicating congestion, don't allow us to
> +		 * get woken early. Otherwise all we do is bang on the AIL lock
> +		 * without making progress.
> +		 */
> +		if (tout && tout <= 20) {
> +			schedule_timeout_uninterruptible(msecs_to_jiffies(tout));
> +		} else {
> +			schedule_timeout_interruptible(tout ?
>  				msecs_to_jiffies(tout) : MAX_SCHEDULE_TIMEOUT);
> +		}

How about just setting the state ourselves and calling schedule_timeout?
That seems a lot more readable to me.  Also we can switch to
TASK_KILLABLE for the short sleeps, just to not introduce any delay
in shutting down aild when kthread_stop is called.  It would look
something like this:

		if (tout && tout <= 20)
			__set_current_state(TASK_KILLABLE);
		else
			__set_current_state(TASK_UNINTERRUPTIBLE);
		schedule_timeout(tout ?
				 msecs_to_jiffies(tout) : MAX_SCHEDULE_TIMEOUT);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 15/16] xfs: only run xfs_error_test if error injection is active
  2010-11-08  8:55 ` [PATCH 15/16] xfs: only run xfs_error_test if error injection is active Dave Chinner
@ 2010-11-08 11:33   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 11:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> Walking btree blocks during allocation checking them requires each
> block (a cache hit, so no I/O) call xfs_error_test(), which then
> does a random32() call as teh first operation.  IOWs, ~50% of the
> CPU is being consumed just testing whether we need to inject an
> error, even though error injection is not active.
> 
> Kill this overhead when error injection is not active by adding a
> global counter of active error traps and only calling into
> xfs_error_test when fault injection is active.

Looks good.  And a good reminder that we should optimize the code to not
call xfs_btree_check_block on catch matches once putting the CRC checks
into it later.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-08  8:55 ` [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
@ 2010-11-08 11:36   ` Christoph Hellwig
  2010-11-08 23:56     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 11:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

I'd be much more happy about fixing this properly in nfsd.  But I guess
the fix is simple enough that we can put it into XFS for now.  Any
reason you use up a whole int in the inode instead of using a flag in
i_flags?

> -
> -		ASSERT(ip->i_delayed_blks == 0);
> +		/*
> +		 * even after flushing the inode, there can still be delalloc
> +		 * blocks on the inode beyond EOF due to speculative
> +		 * preallocation. These are not removed until the release
> +		 * function is called or the inode is inactivated. Hence we
> +		 * cannot assert here that ip->i_delayed_blks == 0.
> +		 */

Shouldn't this be in a separate patch given that we can fail the flush
due to iolock contention?  I think this and the swapext fix are .37
material in fact.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/16] xfs: dynamic speculative EOF preallocation
  2010-11-08  8:55 ` [PATCH 04/16] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-11-08 11:43   ` Christoph Hellwig
  2010-11-09  0:08     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 11:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> For default settings, ???e size and the initial extents is determined

weird character.

> The allocsize mount option still controls the minimum preallocation size, so
> the smallest extent size can stil be bound in situations where this behaviour
> is not sufficient.

Do we also need a way to keep an upper boundary?  Think lots of slowly
growing log files on a filesystem not having tons of free space.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure
  2010-11-08  8:55 ` [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure Dave Chinner
@ 2010-11-08 12:13   ` Christoph Hellwig
  2010-11-09  0:20     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 12:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Nov 08, 2010 at 07:55:06PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> XFS has a per-cpu counter implementation for in-core superblock
> counters that pre-dated the generic implementation. It is complex
> and baroque as it is tailored directly to the needs of ENOSPC
> detection. Implement the complex accurate-compare-and-add
> calculation in the generic per-cpu counter code and convert the
> XFS counters to use the much simpler generic counter code.
> 
> Passes xfsqa on SMP system.

Some mostly cosmetic comments below.  I haven't looked at the more
hairy bits like the changes to the generic percpu code and the
reservation handling yet.

> 	1. kill the no-per-cpu-counter mode?

already done.

> 	3. do we need to factor xfs_mod_sb_incore()?

Doesn't exist anymore. 

> -	xfs_icsb_sync_counters(mp, XFS_ICSB_LAZY_COUNT);
> +	xfs_icsb_sync_counters(mp);
>  	spin_lock(&mp->m_sb_lock);

Can be moved inside the lock and use the unlocked version, too.

> +static inline int
> +xfs_icsb_add(
> +	struct xfs_mount	*mp,
> +	int			counter,
> +	int64_t			delta,
> +	int64_t			threshold)
> +{
> +	int			ret;
> +
> +	ret = percpu_counter_add_unless_lt(&mp->m_icsb[counter], delta,
> +								threshold);
> +	if (ret < 0)
> +		return -ENOSPC;
> +	return 0;
> +}
> +
> +static inline void
> +xfs_icsb_set(
> +	struct xfs_mount	*mp,
> +	int			counter,
> +	int64_t			value)
> +{
> +	percpu_counter_set(&mp->m_icsb[counter], value);
> +}
> +
> +static inline int64_t
> +xfs_icsb_sum(
> +	struct xfs_mount	*mp,
> +	int			counter)
> +{
> +	return percpu_counter_sum_positive(&mp->m_icsb[counter]);
> +}
> +
> +static inline int64_t
> +xfs_icsb_read(
> +	struct xfs_mount	*mp,
> +	int			counter)
> +{
> +	return percpu_counter_read_positive(&mp->m_icsb[counter]);
> +}

I would just opencode all these helpers in their callers.  There's
generally just one caller of each, which iterates over the three
counters anyway.


> +int
> +xfs_icsb_modify_counters(
> +	xfs_mount_t	*mp,
> +	xfs_sb_field_t	field,
> +	int64_t		delta,
> +	int		rsvd)

I can't see the point of keeping this multiplexer.  The inode counts
are handled entirely different from the block count, so they should
have separate functions.

> +{
> +	int64_t		lcounter;
> +	int64_t		res_used;
> +	int		ret = 0;
> +
> +
> +	switch (field) {
> +	case XFS_SBS_ICOUNT:
> +		ret = xfs_icsb_add(mp, XFS_ICSB_ICOUNT, delta, 0);
> +		if (ret < 0) {
> +			ASSERT(0);
> +			return XFS_ERROR(EINVAL);
> +		}
> +		return 0;
> +
> +	case XFS_SBS_IFREE:
> +		ret = xfs_icsb_add(mp, XFS_ICSB_IFREE, delta, 0);
> +		if (ret < 0) {
> +			ASSERT(0);
> +			return XFS_ERROR(EINVAL);
> +		}
> +		return 0;

If you're keeping a common helper for both inode counts this can be
simplified by sharing the code and just passing on the field instead
of having two cases.

> +	struct percpu_counter	m_icsb[XFS_ICSB_MAX];

I wonder if there's all that much of a point in keeping the array.
We basically only use the fact it's an array for the init/destroy
code.  Maybe it would be a tad cleaner to just have three separate
percpu counters.

> +static inline void
> +xfs_icsb_sync_counters(
> +	struct xfs_mount	*mp)
> +{
> +	spin_lock(&mp->m_sb_lock);
> +	xfs_icsb_sync_counters_locked(mp);
> +	spin_unlock(&mp->m_sb_lock);
> +}

There's only one callers of this left after my comment above is
addressed. I'd just make xfs_icsb_sync_counters the locked version,
throw in an assert_spin_locked and have the one remaining caller
take the lock opencoded as well.

> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -41,6 +41,8 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>  void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
>  s64 __percpu_counter_sum(struct percpu_counter *fbc);
>  int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs);
> +int percpu_counter_add_unless_lt(struct percpu_counter *fbc, s64 amount,
> +							s64 threshold);
>  
>  static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
>  {
> @@ -153,6 +155,20 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
>  	return 1;
>  }
>  
> +static inline int percpu_counter_test_and_add_delta(struct percpu_counter *fbc, s64 delta)

This doesn't match the function provided for CONFIG_SMP.

> +/**
> + *

spurious line.

> +int percpu_counter_add_unless_lt(struct percpu_counter *fbc, s64 amount, s64
> +threshold)

too long line

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/16] xfs: current patch stack for 2.6.38 window
  2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
                   ` (15 preceding siblings ...)
  2010-11-08  8:55 ` [PATCH 16/16] xfs: make xlog_space_left() independent of the grant lock Dave Chinner
@ 2010-11-08 14:17 ` Christoph Hellwig
  2010-11-09  0:21   ` Dave Chinner
  16 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 14:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Nov 08, 2010 at 07:55:03PM +1100, Dave Chinner wrote:
> My tree is currently based on the VFS locking changes I have out for review,
> so there's a couple fo patches that won't apply sanely to a mainline or OSS xfs
> dev tree. See below for a pointer to a git tree with all the patches in it.

The only think that should depend on it are the inode hash changes.  I
suspect it might be a better idea if we feed those via Al together with
the VFS scalability bits, and only feed the rest through the XFS tree to
avoid having nasty dependencies.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-08  8:55 ` [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking Dave Chinner
@ 2010-11-08 23:09   ` Christoph Hellwig
  2010-11-09  0:24     ` Dave Chinner
  2010-11-09  3:36     ` Paul E. McKenney
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 23:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: paulmck, eric.dumazet, xfs

This patch generally looks good to me, but with so much RCU magic I'd prefer
if Paul & Eric could look over it.

On Mon, Nov 08, 2010 at 07:55:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With delayed logging greatly increasing the sustained parallelism of inode
> operations, the inode cache locking is showing significant read vs write
> contention when inode reclaim runs at the same time as lookups. There is
> also a lot more write lock acquistions than there are read locks (4:1 ratio)
> so the read locking is not really buying us much in the way of parallelism.
> 
> To avoid the read vs write contention, change the cache to use RCU locking on
> the read side. To avoid needing to RCU free every single inode, use the built
> in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> freed inodes, so en??ure that ever freed inode has an inode number of zero and
> the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> lookup path, but also add a check for a zero inode number as well.
> 
> We canthen convert all the read locking lockups to use RCU read side locking
> and hence remove all read side locking.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Alex Elder <aelder@sgi.com>
> ---
>  fs/xfs/linux-2.6/xfs_iops.c    |    7 +++++-
>  fs/xfs/linux-2.6/xfs_sync.c    |   13 +++++++++--
>  fs/xfs/quota/xfs_qm_syscalls.c |    3 ++
>  fs/xfs/xfs_iget.c              |   44 ++++++++++++++++++++++++++++++---------
>  fs/xfs/xfs_inode.c             |   22 ++++++++++++-------
>  5 files changed, 67 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
> index 8b46867..909bd9c 100644
> --- a/fs/xfs/linux-2.6/xfs_iops.c
> +++ b/fs/xfs/linux-2.6/xfs_iops.c
> @@ -757,6 +757,8 @@ xfs_diflags_to_iflags(
>   * We don't use the VFS inode hash for lookups anymore, so make the inode look
>   * hashed to the VFS by faking it. This avoids needing to touch inode hash
>   * locks in this path, but makes the VFS believe the inode is validly hashed.
> + * We initialise i_state and i_hash under the i_lock so that we follow the same
> + * setup rules that the rest of the VFS follows.
>   */
>  void
>  xfs_setup_inode(
> @@ -765,10 +767,13 @@ xfs_setup_inode(
>  	struct inode		*inode = &ip->i_vnode;
>  
>  	inode->i_ino = ip->i_ino;
> +
> +	spin_lock(&inode->i_lock);
>  	inode->i_state = I_NEW;
> +	hlist_nulls_add_fake(&inode->i_hash);
> +	spin_unlock(&inode->i_lock);

This screams for another VFS helper, even if it's XFS-specific for now.
Having to duplicate inode.c-private locking rules in XFS seems a bit
nasty to me.

>  
>  	inode_sb_list_add(inode);
> -	hlist_nulls_add_fake(&inode->i_hash);
>  
>  	inode->i_mode	= ip->i_d.di_mode;
>  	inode->i_nlink	= ip->i_d.di_nlink;
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index afb0d7c..9a53cc9 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -53,6 +53,10 @@ xfs_inode_ag_walk_grab(
>  {
>  	struct inode		*inode = VFS_I(ip);
>  
> +	/* check for stale RCU freed inode */
> +	if (!ip->i_ino)
> +		return ENOENT;

Assuming i_ino is never 0 is fine for XFS, unlike for the generic VFS
code, so ACK.

>  	/* nothing to sync during shutdown */
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return EFSCORRUPTED;
> @@ -98,12 +102,12 @@ restart:
>  		int		error = 0;
>  		int		i;
>  
> -		read_lock(&pag->pag_ici_lock);
> +		rcu_read_lock();
>  		nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
>  					(void **)batch, first_index,
>  					XFS_LOOKUP_BATCH);
>  		if (!nr_found) {
> -			read_unlock(&pag->pag_ici_lock);
> +			rcu_read_unlock();
>  			break;
>  		}
>  
> @@ -129,7 +133,7 @@ restart:
>  		}
>  
>  		/* unlock now we've grabbed the inodes. */
> -		read_unlock(&pag->pag_ici_lock);
> +		rcu_read_unlock();
>  
>  		for (i = 0; i < nr_found; i++) {
>  			if (!batch[i])
> @@ -639,6 +643,9 @@ xfs_reclaim_inode_grab(
>  	struct xfs_inode	*ip,
>  	int			flags)
>  {
> +	/* check for stale RCU freed inode */
> +	if (!ip->i_ino)
> +		return 1;
>  
>  	/*
>  	 * do some unlocked checks first to avoid unnecceary lock traffic.
> diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
> index bdebc18..8b207fc 100644
> --- a/fs/xfs/quota/xfs_qm_syscalls.c
> +++ b/fs/xfs/quota/xfs_qm_syscalls.c
> @@ -875,6 +875,9 @@ xfs_dqrele_inode(
>  	struct xfs_perag	*pag,
>  	int			flags)
>  {
> +	if (!ip->i_ino)
> +		return ENOENT;
> +

Why do we need the check here again?  Having it in
xfs_inode_ag_walk_grab should be enough.

>  	/* skip quota inodes */
>  	if (ip == ip->i_mount->m_quotainfo->qi_uquotaip ||
>  	    ip == ip->i_mount->m_quotainfo->qi_gquotaip) {
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 18991a9..edeb918 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -69,6 +69,7 @@ xfs_inode_alloc(
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
>  	ASSERT(!spin_is_locked(&ip->i_flags_lock));
>  	ASSERT(completion_done(&ip->i_flush));
> +	ASSERT(ip->i_ino == 0);
>  
>  	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
>  
> @@ -86,9 +87,6 @@ xfs_inode_alloc(
>  	ip->i_new_size = 0;
>  	ip->i_dirty_releases = 0;
>  
> -	/* prevent anyone from using this yet */
> -	VFS_I(ip)->i_state = I_NEW;
> -
>  	return ip;
>  }
>  
> @@ -135,6 +133,16 @@ xfs_inode_free(
>  	ASSERT(!spin_is_locked(&ip->i_flags_lock));
>  	ASSERT(completion_done(&ip->i_flush));
>  
> +	/*
> +	 * because we use SLAB_DESTROY_BY_RCU freeing, ensure the inode
> +	 * always appears to be reclaimed with an invalid inode number
> +	 * when in the free state. The ip->i_flags_lock provides the barrier
> +	 * against lookup races.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	ip->i_flags = XFS_IRECLAIM;
> +	ip->i_ino = 0;
> +	spin_unlock(&ip->i_flags_lock);
>  	kmem_zone_free(xfs_inode_zone, ip);
>  }
>  
> @@ -146,12 +154,28 @@ xfs_iget_cache_hit(
>  	struct xfs_perag	*pag,
>  	struct xfs_inode	*ip,
>  	int			flags,
> -	int			lock_flags) __releases(pag->pag_ici_lock)
> +	int			lock_flags) __releases(RCU)
>  {
>  	struct inode		*inode = VFS_I(ip);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error;
>  
> +	/*
> +	 * check for re-use of an inode within an RCU grace period due to the
> +	 * radix tree nodes not being updated yet. We monitor for this by
> +	 * setting the inode number to zero before freeing the inode structure.
> +	 * We don't need to recheck this after taking the i_flags_lock because
> +	 * the check against XFS_IRECLAIM will catch a freed inode.
> +	 */
> +	if (ip->i_ino == 0) {
> +		trace_xfs_iget_skip(ip);
> +		XFS_STATS_INC(xs_ig_frecycle);
> +		rcu_read_unlock();
> +		/* Expire the grace period so we don't trip over it again. */
> +		synchronize_rcu();
> +		return EAGAIN;
> +	}
> +
>  	spin_lock(&ip->i_flags_lock);
>  
>  	/*
> @@ -195,7 +219,7 @@ xfs_iget_cache_hit(
>  		ip->i_flags |= XFS_IRECLAIM;
>  
>  		spin_unlock(&ip->i_flags_lock);
> -		read_unlock(&pag->pag_ici_lock);
> +		rcu_read_unlock();
>  
>  		error = -inode_init_always(mp->m_super, inode);
>  		if (error) {
> @@ -203,7 +227,7 @@ xfs_iget_cache_hit(
>  			 * Re-initializing the inode failed, and we are in deep
>  			 * trouble.  Try to re-add it to the reclaim list.
>  			 */
> -			read_lock(&pag->pag_ici_lock);
> +			rcu_read_lock();
>  			spin_lock(&ip->i_flags_lock);
>  
>  			ip->i_flags &= ~XFS_INEW;
> @@ -231,7 +255,7 @@ xfs_iget_cache_hit(
>  
>  		/* We've got a live one. */
>  		spin_unlock(&ip->i_flags_lock);
> -		read_unlock(&pag->pag_ici_lock);
> +		rcu_read_unlock();
>  		trace_xfs_iget_hit(ip);
>  	}
>  
> @@ -245,7 +269,7 @@ xfs_iget_cache_hit(
>  
>  out_error:
>  	spin_unlock(&ip->i_flags_lock);
> -	read_unlock(&pag->pag_ici_lock);
> +	rcu_read_unlock();
>  	return error;
>  }
>  
> @@ -376,7 +400,7 @@ xfs_iget(
>  
>  again:
>  	error = 0;
> -	read_lock(&pag->pag_ici_lock);
> +	rcu_read_lock();
>  	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
>  
>  	if (ip) {
> @@ -384,7 +408,7 @@ again:
>  		if (error)
>  			goto out_error_or_again;
>  	} else {
> -		read_unlock(&pag->pag_ici_lock);
> +		rcu_read_unlock();
>  		XFS_STATS_INC(xs_ig_missed);
>  
>  		error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip,
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 108c7a0..25becb1 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2000,13 +2000,14 @@ xfs_ifree_cluster(
>  		 */
>  		for (i = 0; i < ninodes; i++) {
>  retry:
> -			read_lock(&pag->pag_ici_lock);
> +			rcu_read_lock();
>  			ip = radix_tree_lookup(&pag->pag_ici_root,
>  					XFS_INO_TO_AGINO(mp, (inum + i)));
>  
>  			/* Inode not in memory or stale, nothing to do */
> -			if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) {
> -				read_unlock(&pag->pag_ici_lock);
> +			if (!ip || !ip->i_ino ||
> +			    xfs_iflags_test(ip, XFS_ISTALE)) {
> +				rcu_read_unlock();
>  				continue;
>  			}
>  
> @@ -2019,11 +2020,11 @@ retry:
>  			 */
>  			if (ip != free_ip &&
>  			    !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> -				read_unlock(&pag->pag_ici_lock);
> +				rcu_read_unlock();
>  				delay(1);
>  				goto retry;
>  			}
> -			read_unlock(&pag->pag_ici_lock);
> +			rcu_read_unlock();
>  
>  			xfs_iflock(ip);
>  			xfs_iflags_set(ip, XFS_ISTALE);
> @@ -2629,7 +2630,7 @@ xfs_iflush_cluster(
>  
>  	mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1);
>  	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> -	read_lock(&pag->pag_ici_lock);
> +	rcu_read_lock();
>  	/* really need a gang lookup range call here */
>  	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
>  					first_index, inodes_per_cluster);
> @@ -2640,6 +2641,11 @@ xfs_iflush_cluster(
>  		iq = ilist[i];
>  		if (iq == ip)
>  			continue;
> +
> +		/* check we've got a valid inode */
> +		if (!iq->i_ino)
> +			continue;
> +
>  		/* if the inode lies outside this cluster, we're done. */
>  		if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index)
>  			break;
> @@ -2692,7 +2698,7 @@ xfs_iflush_cluster(
>  	}
>  
>  out_free:
> -	read_unlock(&pag->pag_ici_lock);
> +	rcu_read_unlock();
>  	kmem_free(ilist);
>  out_put:
>  	xfs_perag_put(pag);
> @@ -2704,7 +2710,7 @@ cluster_corrupt_out:
>  	 * Corruption detected in the clustering loop.  Invalidate the
>  	 * inode buffer and shut down the filesystem.
>  	 */
> -	read_unlock(&pag->pag_ici_lock);
> +	rcu_read_unlock();
>  	/*
>  	 * Clean up the buffer.  If it was B_DELWRI, just release it --
>  	 * brelse can handle it with no problems.  If not, shut down the
> -- 
> 1.7.2.3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
---end quoted text---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock
  2010-11-08  8:55 ` [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock Dave Chinner
@ 2010-11-08 23:10   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 23:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Nov 08, 2010 at 07:55:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> now that we are using RCU protection for the inode cache lookups,
> the lock is only needed on the modification side. Hence it is not
> necessary for the lock to be a rwlock as there are no read side
> holders anymore. Convert it to a spin lock to reflect it's exclusive
> nature.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Alex Elder <aelder@sgi.com>

Looks good, and this will make Thomas happy given that XFS is now
rwlock_t-free.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/16] xfs: add a lru to the XFS buffer cache
  2010-11-08  8:55 ` [PATCH 10/16] xfs: add a lru to the XFS buffer cache Dave Chinner
@ 2010-11-08 23:19   ` Christoph Hellwig
  2010-11-08 23:45     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-11-08 23:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> @@ -471,6 +546,8 @@ _xfs_buf_find(
>  		/* the buffer keeps the perag reference until it is freed */
>  		new_bp->b_pag = pag;
>  		spin_unlock(&pag->pag_buf_lock);
> +
> +		xfs_buf_lru_add(new_bp);

Why do we add the buffer to the lru when we find it?  Normally we
would remove it here (unless we want a lazy lru scheme), and potentially
increment b_lru_ref - although that seems to be done by the callers
in the next patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/16] xfs: add a lru to the XFS buffer cache
  2010-11-08 23:19   ` Christoph Hellwig
@ 2010-11-08 23:45     ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08 23:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 06:19:28PM -0500, Christoph Hellwig wrote:
> > @@ -471,6 +546,8 @@ _xfs_buf_find(
> >  		/* the buffer keeps the perag reference until it is freed */
> >  		new_bp->b_pag = pag;
> >  		spin_unlock(&pag->pag_buf_lock);
> > +
> > +		xfs_buf_lru_add(new_bp);
> 
> Why do we add the buffer to the lru when we find it?  Normally we
> would remove it here (unless we want a lazy lru scheme),

Oh, I forgot to remove that from the patch when rewriting it to use
lazy updateѕ. Good catch! (*)

>
> and potentially increment b_lru_ref - although that seems to be
> done by the callers in the next patch.

b_lru_ref is initialised to 1 when the buffer is first initialised,
so it doesn't need to be done here. And yes, the next patch allows
the users of the buffers to set the reclaim reference count
themselves when the buffer is read for prioritisation. ideally I
don't want to have to touch the reclaim state of the buffer during
xfs_buf_find....

Cheers,

Dave.

(*) Catching this sort of bug is exactly why I posted the series
early. Little details are easy to miss in a forest of changes this
large....
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/16] xfs: connect up buffer reclaim priority hooks
  2010-11-08 11:25   ` Christoph Hellwig
@ 2010-11-08 23:50     ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08 23:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 06:25:09AM -0500, Christoph Hellwig wrote:
> > +/*
> > + * buffer types 
> > + */
> > +#define	B_FS_DQUOT	1
> > +#define	B_FS_AGFL	2
> > +#define	B_FS_AGF	3
> > +#define	B_FS_ATTR_BTREE	4
> > +#define	B_FS_DIR_BTREE	5
> > +#define	B_FS_MAP	6
> > +#define	B_FS_INOMAP	7
> > +#define	B_FS_AGI	8
> > +#define	B_FS_INO	9
> 
> Is there any good reason to keep/reintroduce the buffer types?  In this
> series we're only using the refcounts, and I can't see any good use for
> the types either.

I just wanted to use the existing hooks without modifying them to be
plain "set reference" hooks. I'm not really fussed - this was just
the simplest way to get the hookѕ working.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/16] xfs: reduce the number of AIL push wakeups
  2010-11-08 11:32   ` Christoph Hellwig
@ 2010-11-08 23:51     ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08 23:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 06:32:04AM -0500, Christoph Hellwig wrote:
> >  STATIC int
> > @@ -850,8 +853,17 @@ xfsaild(
> >  	long		tout = 0; /* milliseconds */
> >  
> >  	while (!kthread_should_stop()) {
> > -		schedule_timeout_interruptible(tout ?
> > +		/*
> > +		 * for short sleeps indicating congestion, don't allow us to
> > +		 * get woken early. Otherwise all we do is bang on the AIL lock
> > +		 * without making progress.
> > +		 */
> > +		if (tout && tout <= 20) {
> > +			schedule_timeout_uninterruptible(msecs_to_jiffies(tout));
> > +		} else {
> > +			schedule_timeout_interruptible(tout ?
> >  				msecs_to_jiffies(tout) : MAX_SCHEDULE_TIMEOUT);
> > +		}
> 
> How about just setting the state ourselves and calling schedule_timeout?
> That seems a lot more readable to me.  Also we can switch to
> TASK_KILLABLE for the short sleeps, just to not introduce any delay
> in shutting down aild when kthread_stop is called.  It would look
> something like this:
> 
> 		if (tout && tout <= 20)
> 			__set_current_state(TASK_KILLABLE);
> 		else
> 			__set_current_state(TASK_UNINTERRUPTIBLE);
> 		schedule_timeout(tout ?
> 				 msecs_to_jiffies(tout) : MAX_SCHEDULE_TIMEOUT);

Yes, seems reasonable. I'll convert it to do this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-08 11:36   ` Christoph Hellwig
@ 2010-11-08 23:56     ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-08 23:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 06:36:45AM -0500, Christoph Hellwig wrote:
> I'd be much more happy about fixing this properly in nfsd. 

So would I, but we've been saying that for years and it still ain't
done....

> But I guess
> the fix is simple enough that we can put it into XFS for now.  Any
> reason you use up a whole int in the inode instead of using a flag in
> i_flags?

I wasn't sure how many dirty releases we wanted before triggering
the change of behaviour. A single dirty release seems to be fine in
my testing so far, and if that continues then I think that , like
you suggest, changing it to a flag in i_flags is the right thing to
do.

> > -
> > -		ASSERT(ip->i_delayed_blks == 0);
> > +		/*
> > +		 * even after flushing the inode, there can still be delalloc
> > +		 * blocks on the inode beyond EOF due to speculative
> > +		 * preallocation. These are not removed until the release
> > +		 * function is called or the inode is inactivated. Hence we
> > +		 * cannot assert here that ip->i_delayed_blks == 0.
> > +		 */
> 
> Shouldn't this be in a separate patch given that we can fail the flush
> due to iolock contention?  I think this and the swapext fix are .37
> material in fact.

Agreed. I should have noted in the series preamble that I thought these
probably need splitting out into separate bug fixing patches rather
than being lumped in here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/16] xfs: dynamic speculative EOF preallocation
  2010-11-08 11:43   ` Christoph Hellwig
@ 2010-11-09  0:08     ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-09  0:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 06:43:25AM -0500, Christoph Hellwig wrote:
> > For default settings, ???e size and the initial extents is determined
> 
> weird character.
> 
> > The allocsize mount option still controls the minimum preallocation size, so
> > the smallest extent size can stil be bound in situations where this behaviour
> > is not sufficient.
> 
> Do we also need a way to keep an upper boundary?  Think lots of slowly
> growing log files on a filesystem not having tons of free space.

Perhaps - it's one of the things I've been debating backwards and
forwards and done nothing about yet.

It's hard to trim back preallocation before we hit ENOSPC via a
static threshold (e.g. 1% free space could be terabytes of space), but
once ENOSPC is hit we drop new preallocation completely. Perhaps a
gradual decrease in the maximum prealloc size based on freespace
remaining? e.g.

freespace	max prealloc size
  >5%		  full extent (8GB)
  4-5%		   2GB (8GB >> 2)
  3-4%		   1GB (8GB >> 3)
  2-3%		 512MB (8GB >> 4)
  1-2%		 256MB (8GB >> 5)
  <1%		 128MB (8GB >> 6)

I'm open to other ideas on what to do here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure
  2010-11-08 12:13   ` Christoph Hellwig
@ 2010-11-09  0:20     ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-09  0:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 07:13:22AM -0500, Christoph Hellwig wrote:
> On Mon, Nov 08, 2010 at 07:55:06PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > XFS has a per-cpu counter implementation for in-core superblock
> > counters that pre-dated the generic implementation. It is complex
> > and baroque as it is tailored directly to the needs of ENOSPC
> > detection. Implement the complex accurate-compare-and-add
> > calculation in the generic per-cpu counter code and convert the
> > XFS counters to use the much simpler generic counter code.
> > 
> > Passes xfsqa on SMP system.
> 
> Some mostly cosmetic comments below.  I haven't looked at the more
> hairy bits like the changes to the generic percpu code and the
> reservation handling yet.
> 
> > 	1. kill the no-per-cpu-counter mode?
> 
> already done.
> 
> > 	3. do we need to factor xfs_mod_sb_incore()?
> 
> Doesn't exist anymore. 

Ah, forgot to update the commit message ;)

> > -	xfs_icsb_sync_counters(mp, XFS_ICSB_LAZY_COUNT);
> > +	xfs_icsb_sync_counters(mp);
> >  	spin_lock(&mp->m_sb_lock);
> 
> Can be moved inside the lock and use the unlocked version, too.

OK, I just went for the straight transformation approach.

> > +static inline int
> > +xfs_icsb_add(
> > +	struct xfs_mount	*mp,
> > +	int			counter,
> > +	int64_t			delta,
> > +	int64_t			threshold)
> > +{
> > +	int			ret;
> > +
> > +	ret = percpu_counter_add_unless_lt(&mp->m_icsb[counter], delta,
> > +								threshold);
> > +	if (ret < 0)
> > +		return -ENOSPC;
> > +	return 0;
> > +}
> > +
> > +static inline void
> > +xfs_icsb_set(
> > +	struct xfs_mount	*mp,
> > +	int			counter,
> > +	int64_t			value)
> > +{
> > +	percpu_counter_set(&mp->m_icsb[counter], value);
> > +}
> > +
> > +static inline int64_t
> > +xfs_icsb_sum(
> > +	struct xfs_mount	*mp,
> > +	int			counter)
> > +{
> > +	return percpu_counter_sum_positive(&mp->m_icsb[counter]);
> > +}
> > +
> > +static inline int64_t
> > +xfs_icsb_read(
> > +	struct xfs_mount	*mp,
> > +	int			counter)
> > +{
> > +	return percpu_counter_read_positive(&mp->m_icsb[counter]);
> > +}
> 
> I would just opencode all these helpers in their callers.  There's
> generally just one caller of each, which iterates over the three
> counters anyway.

That seems reasonable, but I had is a good reason for adding the
wrappers. That is, I'm not sure that the fixed percpu counter batch
size (32) scales well enough for large systems. In the bdi code, a
custom batch size that is logarithmicaly scaled with the number of
CPUs is used and I suspect we'll need to do this here, too. Hence
I'd like to keep the wrappers to minimise the number of places we'd
need to modify to handle customised batch sizes.

> > +int
> > +xfs_icsb_modify_counters(
> > +	xfs_mount_t	*mp,
> > +	xfs_sb_field_t	field,
> > +	int64_t		delta,
> > +	int		rsvd)
> 
> I can't see the point of keeping this multiplexer.  The inode counts
> are handled entirely different from the block count, so they should
> have separate functions.

I just went for the simple approach - I wanted to get it working
without having to modify lots of other code. Now that it is working,
I can see why getting rid of the wrapper altogether would be good.

> 
> > +{
> > +	int64_t		lcounter;
> > +	int64_t		res_used;
> > +	int		ret = 0;
> > +
> > +
> > +	switch (field) {
> > +	case XFS_SBS_ICOUNT:
> > +		ret = xfs_icsb_add(mp, XFS_ICSB_ICOUNT, delta, 0);
> > +		if (ret < 0) {
> > +			ASSERT(0);
> > +			return XFS_ERROR(EINVAL);
> > +		}
> > +		return 0;
> > +
> > +	case XFS_SBS_IFREE:
> > +		ret = xfs_icsb_add(mp, XFS_ICSB_IFREE, delta, 0);
> > +		if (ret < 0) {
> > +			ASSERT(0);
> > +			return XFS_ERROR(EINVAL);
> > +		}
> > +		return 0;
> 
> If you're keeping a common helper for both inode counts this can be
> simplified by sharing the code and just passing on the field instead
> of having two cases.
> 
> > +	struct percpu_counter	m_icsb[XFS_ICSB_MAX];
> 
> I wonder if there's all that much of a point in keeping the array.
> We basically only use the fact it's an array for the init/destroy
> code.  Maybe it would be a tad cleaner to just have three separate
> percpu counters.

Not sure - I'd like to extend the per-cpu counters to more fields in
the superblock (e.g. the rt extent counter), and having an array
makes that pretty simple...


> > +++ b/include/linux/percpu_counter.h
> > @@ -41,6 +41,8 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
> >  void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
> >  s64 __percpu_counter_sum(struct percpu_counter *fbc);
> >  int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs);
> > +int percpu_counter_add_unless_lt(struct percpu_counter *fbc, s64 amount,
> > +							s64 threshold);
> >  
> >  static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
> >  {
> > @@ -153,6 +155,20 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
> >  	return 1;
> >  }
> >  
> > +static inline int percpu_counter_test_and_add_delta(struct percpu_counter *fbc, s64 delta)
> 
> This doesn't match the function provided for CONFIG_SMP.
> 

Doh - I hadn't retested UP since I renamed the function that did all
the work.

And I just realised that with UP using the icsb functions, I
can kill all the cases in the locked variant....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/16] xfs: current patch stack for 2.6.38 window
  2010-11-08 14:17 ` [PATCH 00/16] xfs: current patch stack for 2.6.38 window Christoph Hellwig
@ 2010-11-09  0:21   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-09  0:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Nov 08, 2010 at 09:17:46AM -0500, Christoph Hellwig wrote:
> On Mon, Nov 08, 2010 at 07:55:03PM +1100, Dave Chinner wrote:
> > My tree is currently based on the VFS locking changes I have out for review,
> > so there's a couple fo patches that won't apply sanely to a mainline or OSS xfs
> > dev tree. See below for a pointer to a git tree with all the patches in it.
> 
> The only think that should depend on it are the inode hash changes.  I
> suspect it might be a better idea if we feed those via Al together with
> the VFS scalability bits, and only feed the rest through the XFS tree to
> avoid having nasty dependencies.

Yes, sounds reasonable.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-08 23:09   ` Christoph Hellwig
@ 2010-11-09  0:24     ` Dave Chinner
  2010-11-09  3:36     ` Paul E. McKenney
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-09  0:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: paulmck, eric.dumazet, xfs

On Mon, Nov 08, 2010 at 06:09:29PM -0500, Christoph Hellwig wrote:
> This patch generally looks good to me, but with so much RCU magic I'd prefer
> if Paul & Eric could look over it.
> 
> On Mon, Nov 08, 2010 at 07:55:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> > 
> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so en??ure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> > 
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Alex Elder <aelder@sgi.com>
> > ---
> >  fs/xfs/linux-2.6/xfs_iops.c    |    7 +++++-
> >  fs/xfs/linux-2.6/xfs_sync.c    |   13 +++++++++--
> >  fs/xfs/quota/xfs_qm_syscalls.c |    3 ++
> >  fs/xfs/xfs_iget.c              |   44 ++++++++++++++++++++++++++++++---------
> >  fs/xfs/xfs_inode.c             |   22 ++++++++++++-------
> >  5 files changed, 67 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
> > index 8b46867..909bd9c 100644
> > --- a/fs/xfs/linux-2.6/xfs_iops.c
> > +++ b/fs/xfs/linux-2.6/xfs_iops.c
> > @@ -757,6 +757,8 @@ xfs_diflags_to_iflags(
> >   * We don't use the VFS inode hash for lookups anymore, so make the inode look
> >   * hashed to the VFS by faking it. This avoids needing to touch inode hash
> >   * locks in this path, but makes the VFS believe the inode is validly hashed.
> > + * We initialise i_state and i_hash under the i_lock so that we follow the same
> > + * setup rules that the rest of the VFS follows.
> >   */
> >  void
> >  xfs_setup_inode(
> > @@ -765,10 +767,13 @@ xfs_setup_inode(
> >  	struct inode		*inode = &ip->i_vnode;
> >  
> >  	inode->i_ino = ip->i_ino;
> > +
> > +	spin_lock(&inode->i_lock);
> >  	inode->i_state = I_NEW;
> > +	hlist_nulls_add_fake(&inode->i_hash);
> > +	spin_unlock(&inode->i_lock);
> 
> This screams for another VFS helper, even if it's XFS-specific for now.
> Having to duplicate inode.c-private locking rules in XFS seems a bit
> nasty to me.

Agreed. I was thinking that it would be a good idea to do this, but
I hadn't decided on how to do it yet....

> > diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
> > index bdebc18..8b207fc 100644
> > --- a/fs/xfs/quota/xfs_qm_syscalls.c
> > +++ b/fs/xfs/quota/xfs_qm_syscalls.c
> > @@ -875,6 +875,9 @@ xfs_dqrele_inode(
> >  	struct xfs_perag	*pag,
> >  	int			flags)
> >  {
> > +	if (!ip->i_ino)
> > +		return ENOENT;
> > +
> 
> Why do we need the check here again?  Having it in
> xfs_inode_ag_walk_grab should be enough.

Yes, you are right. I'll fix that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-08 23:09   ` Christoph Hellwig
  2010-11-09  0:24     ` Dave Chinner
@ 2010-11-09  3:36     ` Paul E. McKenney
  2010-11-09  5:04       ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2010-11-09  3:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: eric.dumazet, xfs

On Mon, Nov 08, 2010 at 06:09:29PM -0500, Christoph Hellwig wrote:
> This patch generally looks good to me, but with so much RCU magic I'd prefer
> if Paul & Eric could look over it.

Is there a git tree, tarball, or whatever?  For example, I don't see
how this patch handles the case of an inode being freed just as an RCU
reader gains a reference to it, but then reallocated as some other inode
(so that ->ino is nonzero) before the RCU reader gets a chance to actually
look at the inode.  But such a check might well be in the code that this
patch didn't change...

							Thanx, Paul

> On Mon, Nov 08, 2010 at 07:55:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> > 
> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so en??ure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> > 
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Alex Elder <aelder@sgi.com>
> > ---
> >  fs/xfs/linux-2.6/xfs_iops.c    |    7 +++++-
> >  fs/xfs/linux-2.6/xfs_sync.c    |   13 +++++++++--
> >  fs/xfs/quota/xfs_qm_syscalls.c |    3 ++
> >  fs/xfs/xfs_iget.c              |   44 ++++++++++++++++++++++++++++++---------
> >  fs/xfs/xfs_inode.c             |   22 ++++++++++++-------
> >  5 files changed, 67 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
> > index 8b46867..909bd9c 100644
> > --- a/fs/xfs/linux-2.6/xfs_iops.c
> > +++ b/fs/xfs/linux-2.6/xfs_iops.c
> > @@ -757,6 +757,8 @@ xfs_diflags_to_iflags(
> >   * We don't use the VFS inode hash for lookups anymore, so make the inode look
> >   * hashed to the VFS by faking it. This avoids needing to touch inode hash
> >   * locks in this path, but makes the VFS believe the inode is validly hashed.
> > + * We initialise i_state and i_hash under the i_lock so that we follow the same
> > + * setup rules that the rest of the VFS follows.
> >   */
> >  void
> >  xfs_setup_inode(
> > @@ -765,10 +767,13 @@ xfs_setup_inode(
> >  	struct inode		*inode = &ip->i_vnode;
> >  
> >  	inode->i_ino = ip->i_ino;
> > +
> > +	spin_lock(&inode->i_lock);
> >  	inode->i_state = I_NEW;
> > +	hlist_nulls_add_fake(&inode->i_hash);
> > +	spin_unlock(&inode->i_lock);
> 
> This screams for another VFS helper, even if it's XFS-specific for now.
> Having to duplicate inode.c-private locking rules in XFS seems a bit
> nasty to me.
> 
> >  
> >  	inode_sb_list_add(inode);
> > -	hlist_nulls_add_fake(&inode->i_hash);
> >  
> >  	inode->i_mode	= ip->i_d.di_mode;
> >  	inode->i_nlink	= ip->i_d.di_nlink;
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index afb0d7c..9a53cc9 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -53,6 +53,10 @@ xfs_inode_ag_walk_grab(
> >  {
> >  	struct inode		*inode = VFS_I(ip);
> >  
> > +	/* check for stale RCU freed inode */
> > +	if (!ip->i_ino)
> > +		return ENOENT;
> 
> Assuming i_ino is never 0 is fine for XFS, unlike for the generic VFS
> code, so ACK.
> 
> >  	/* nothing to sync during shutdown */
> >  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> >  		return EFSCORRUPTED;
> > @@ -98,12 +102,12 @@ restart:
> >  		int		error = 0;
> >  		int		i;
> >  
> > -		read_lock(&pag->pag_ici_lock);
> > +		rcu_read_lock();
> >  		nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
> >  					(void **)batch, first_index,
> >  					XFS_LOOKUP_BATCH);
> >  		if (!nr_found) {
> > -			read_unlock(&pag->pag_ici_lock);
> > +			rcu_read_unlock();
> >  			break;
> >  		}
> >  
> > @@ -129,7 +133,7 @@ restart:
> >  		}
> >  
> >  		/* unlock now we've grabbed the inodes. */
> > -		read_unlock(&pag->pag_ici_lock);
> > +		rcu_read_unlock();
> >  
> >  		for (i = 0; i < nr_found; i++) {
> >  			if (!batch[i])
> > @@ -639,6 +643,9 @@ xfs_reclaim_inode_grab(
> >  	struct xfs_inode	*ip,
> >  	int			flags)
> >  {
> > +	/* check for stale RCU freed inode */
> > +	if (!ip->i_ino)
> > +		return 1;
> >  
> >  	/*
> >  	 * do some unlocked checks first to avoid unnecceary lock traffic.
> > diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
> > index bdebc18..8b207fc 100644
> > --- a/fs/xfs/quota/xfs_qm_syscalls.c
> > +++ b/fs/xfs/quota/xfs_qm_syscalls.c
> > @@ -875,6 +875,9 @@ xfs_dqrele_inode(
> >  	struct xfs_perag	*pag,
> >  	int			flags)
> >  {
> > +	if (!ip->i_ino)
> > +		return ENOENT;
> > +
> 
> Why do we need the check here again?  Having it in
> xfs_inode_ag_walk_grab should be enough.
> 
> >  	/* skip quota inodes */
> >  	if (ip == ip->i_mount->m_quotainfo->qi_uquotaip ||
> >  	    ip == ip->i_mount->m_quotainfo->qi_gquotaip) {
> > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> > index 18991a9..edeb918 100644
> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> > @@ -69,6 +69,7 @@ xfs_inode_alloc(
> >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> >  	ASSERT(!spin_is_locked(&ip->i_flags_lock));
> >  	ASSERT(completion_done(&ip->i_flush));
> > +	ASSERT(ip->i_ino == 0);
> >  
> >  	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
> >  
> > @@ -86,9 +87,6 @@ xfs_inode_alloc(
> >  	ip->i_new_size = 0;
> >  	ip->i_dirty_releases = 0;
> >  
> > -	/* prevent anyone from using this yet */
> > -	VFS_I(ip)->i_state = I_NEW;
> > -
> >  	return ip;
> >  }
> >  
> > @@ -135,6 +133,16 @@ xfs_inode_free(
> >  	ASSERT(!spin_is_locked(&ip->i_flags_lock));
> >  	ASSERT(completion_done(&ip->i_flush));
> >  
> > +	/*
> > +	 * because we use SLAB_DESTROY_BY_RCU freeing, ensure the inode
> > +	 * always appears to be reclaimed with an invalid inode number
> > +	 * when in the free state. The ip->i_flags_lock provides the barrier
> > +	 * against lookup races.
> > +	 */
> > +	spin_lock(&ip->i_flags_lock);
> > +	ip->i_flags = XFS_IRECLAIM;
> > +	ip->i_ino = 0;
> > +	spin_unlock(&ip->i_flags_lock);
> >  	kmem_zone_free(xfs_inode_zone, ip);
> >  }
> >  
> > @@ -146,12 +154,28 @@ xfs_iget_cache_hit(
> >  	struct xfs_perag	*pag,
> >  	struct xfs_inode	*ip,
> >  	int			flags,
> > -	int			lock_flags) __releases(pag->pag_ici_lock)
> > +	int			lock_flags) __releases(RCU)
> >  {
> >  	struct inode		*inode = VFS_I(ip);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error;
> >  
> > +	/*
> > +	 * check for re-use of an inode within an RCU grace period due to the
> > +	 * radix tree nodes not being updated yet. We monitor for this by
> > +	 * setting the inode number to zero before freeing the inode structure.
> > +	 * We don't need to recheck this after taking the i_flags_lock because
> > +	 * the check against XFS_IRECLAIM will catch a freed inode.
> > +	 */
> > +	if (ip->i_ino == 0) {
> > +		trace_xfs_iget_skip(ip);
> > +		XFS_STATS_INC(xs_ig_frecycle);
> > +		rcu_read_unlock();
> > +		/* Expire the grace period so we don't trip over it again. */
> > +		synchronize_rcu();
> > +		return EAGAIN;
> > +	}
> > +
> >  	spin_lock(&ip->i_flags_lock);
> >  
> >  	/*
> > @@ -195,7 +219,7 @@ xfs_iget_cache_hit(
> >  		ip->i_flags |= XFS_IRECLAIM;
> >  
> >  		spin_unlock(&ip->i_flags_lock);
> > -		read_unlock(&pag->pag_ici_lock);
> > +		rcu_read_unlock();
> >  
> >  		error = -inode_init_always(mp->m_super, inode);
> >  		if (error) {
> > @@ -203,7 +227,7 @@ xfs_iget_cache_hit(
> >  			 * Re-initializing the inode failed, and we are in deep
> >  			 * trouble.  Try to re-add it to the reclaim list.
> >  			 */
> > -			read_lock(&pag->pag_ici_lock);
> > +			rcu_read_lock();
> >  			spin_lock(&ip->i_flags_lock);
> >  
> >  			ip->i_flags &= ~XFS_INEW;
> > @@ -231,7 +255,7 @@ xfs_iget_cache_hit(
> >  
> >  		/* We've got a live one. */
> >  		spin_unlock(&ip->i_flags_lock);
> > -		read_unlock(&pag->pag_ici_lock);
> > +		rcu_read_unlock();
> >  		trace_xfs_iget_hit(ip);
> >  	}
> >  
> > @@ -245,7 +269,7 @@ xfs_iget_cache_hit(
> >  
> >  out_error:
> >  	spin_unlock(&ip->i_flags_lock);
> > -	read_unlock(&pag->pag_ici_lock);
> > +	rcu_read_unlock();
> >  	return error;
> >  }
> >  
> > @@ -376,7 +400,7 @@ xfs_iget(
> >  
> >  again:
> >  	error = 0;
> > -	read_lock(&pag->pag_ici_lock);
> > +	rcu_read_lock();
> >  	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> >  
> >  	if (ip) {
> > @@ -384,7 +408,7 @@ again:
> >  		if (error)
> >  			goto out_error_or_again;
> >  	} else {
> > -		read_unlock(&pag->pag_ici_lock);
> > +		rcu_read_unlock();
> >  		XFS_STATS_INC(xs_ig_missed);
> >  
> >  		error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip,
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 108c7a0..25becb1 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2000,13 +2000,14 @@ xfs_ifree_cluster(
> >  		 */
> >  		for (i = 0; i < ninodes; i++) {
> >  retry:
> > -			read_lock(&pag->pag_ici_lock);
> > +			rcu_read_lock();
> >  			ip = radix_tree_lookup(&pag->pag_ici_root,
> >  					XFS_INO_TO_AGINO(mp, (inum + i)));
> >  
> >  			/* Inode not in memory or stale, nothing to do */
> > -			if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) {
> > -				read_unlock(&pag->pag_ici_lock);
> > +			if (!ip || !ip->i_ino ||
> > +			    xfs_iflags_test(ip, XFS_ISTALE)) {
> > +				rcu_read_unlock();
> >  				continue;
> >  			}
> >  
> > @@ -2019,11 +2020,11 @@ retry:
> >  			 */
> >  			if (ip != free_ip &&
> >  			    !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > -				read_unlock(&pag->pag_ici_lock);
> > +				rcu_read_unlock();
> >  				delay(1);
> >  				goto retry;
> >  			}
> > -			read_unlock(&pag->pag_ici_lock);
> > +			rcu_read_unlock();
> >  
> >  			xfs_iflock(ip);
> >  			xfs_iflags_set(ip, XFS_ISTALE);
> > @@ -2629,7 +2630,7 @@ xfs_iflush_cluster(
> >  
> >  	mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1);
> >  	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> > -	read_lock(&pag->pag_ici_lock);
> > +	rcu_read_lock();
> >  	/* really need a gang lookup range call here */
> >  	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
> >  					first_index, inodes_per_cluster);
> > @@ -2640,6 +2641,11 @@ xfs_iflush_cluster(
> >  		iq = ilist[i];
> >  		if (iq == ip)
> >  			continue;
> > +
> > +		/* check we've got a valid inode */
> > +		if (!iq->i_ino)
> > +			continue;
> > +
> >  		/* if the inode lies outside this cluster, we're done. */
> >  		if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index)
> >  			break;
> > @@ -2692,7 +2698,7 @@ xfs_iflush_cluster(
> >  	}
> >  
> >  out_free:
> > -	read_unlock(&pag->pag_ici_lock);
> > +	rcu_read_unlock();
> >  	kmem_free(ilist);
> >  out_put:
> >  	xfs_perag_put(pag);
> > @@ -2704,7 +2710,7 @@ cluster_corrupt_out:
> >  	 * Corruption detected in the clustering loop.  Invalidate the
> >  	 * inode buffer and shut down the filesystem.
> >  	 */
> > -	read_unlock(&pag->pag_ici_lock);
> > +	rcu_read_unlock();
> >  	/*
> >  	 * Clean up the buffer.  If it was B_DELWRI, just release it --
> >  	 * brelse can handle it with no problems.  If not, shut down the
> > -- 
> > 1.7.2.3
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> ---end quoted text---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-09  3:36     ` Paul E. McKenney
@ 2010-11-09  5:04       ` Dave Chinner
  2010-11-10  5:12         ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2010-11-09  5:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Christoph Hellwig, eric.dumazet, xfs

On Mon, Nov 08, 2010 at 07:36:28PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 08, 2010 at 06:09:29PM -0500, Christoph Hellwig wrote:
> > This patch generally looks good to me, but with so much RCU magic I'd prefer
> > if Paul & Eric could look over it.
> 
> Is there a git tree, tarball, or whatever? 

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working

contains the series that this patch is in.

> For example, I don't see
> how this patch handles the case of an inode being freed just as an RCU
> reader gains a reference to it,

XFS_IRECLAIM flag is set on inodes as they transition into the
reclaim state long before they are freed. The XFS_IRECLAIM flag is left there once
freed. Hence lookups in xfs_iget_cache_hit() will see this.

If the inode has been reallocated, the inode number will not yet be
set, or the inode state will have changed to XFS_INEW, both of which
xfs_iget_cache_hit() will also reject.

> but then reallocated as some other inode
> (so that ->ino is nonzero) before the RCU reader gets a chance to actually
> look at the inode.

XFS_INEW is not cleared until well after a new ->i_ino is set, so
the lookup should find trip over XFS_INEW in that case. I think that
I may need to move the inode number check under the i_flags_lock
after validating the flags - more to check that we've got the
correct inode than to validate we have a freed inode.

> But such a check might well be in the code that this
> patch didn't change...

Yeah, most of the XFS code is already in a form compatible with such
RCU use because inodes have always had a quiescent "reclaimable"
state between active and reclaim (XFS_INEW -> active ->
XFS_IRECLAIMABLE -> XFS_IRECLAIM) where the inode can be reused
before being freed. The result is that lookups have always had to
handle races with inodes that have just transitioned into the
XFS_IRECLAIM state and hence cannot be immediately reused...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-09  5:04       ` Dave Chinner
@ 2010-11-10  5:12         ` Paul E. McKenney
  2010-11-10  6:20           ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2010-11-10  5:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, eric.dumazet, xfs

On Tue, Nov 09, 2010 at 04:04:17PM +1100, Dave Chinner wrote:
> On Mon, Nov 08, 2010 at 07:36:28PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 08, 2010 at 06:09:29PM -0500, Christoph Hellwig wrote:
> > > This patch generally looks good to me, but with so much RCU magic I'd prefer
> > > if Paul & Eric could look over it.
> > 
> > Is there a git tree, tarball, or whatever? 
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working

Thank you -- I have downloaded this and will look it over.

Once the C++ guys get done grilling me on memory-model issues...

> contains the series that this patch is in.
> 
> > For example, I don't see
> > how this patch handles the case of an inode being freed just as an RCU
> > reader gains a reference to it,
> 
> XFS_IRECLAIM flag is set on inodes as they transition into the
> reclaim state long before they are freed. The XFS_IRECLAIM flag is left there once
> freed. Hence lookups in xfs_iget_cache_hit() will see this.
> 
> If the inode has been reallocated, the inode number will not yet be
> set, or the inode state will have changed to XFS_INEW, both of which
> xfs_iget_cache_hit() will also reject.
> 
> > but then reallocated as some other inode
> > (so that ->ino is nonzero) before the RCU reader gets a chance to actually
> > look at the inode.
> 
> XFS_INEW is not cleared until well after a new ->i_ino is set, so
> the lookup should find trip over XFS_INEW in that case. I think that
> I may need to move the inode number check under the i_flags_lock
> after validating the flags - more to check that we've got the
> correct inode than to validate we have a freed inode.

OK, this sounds promising.  Of course, the next question is "how quickly
can the inode number be available for reuse?"

> > But such a check might well be in the code that this
> > patch didn't change...
> 
> Yeah, most of the XFS code is already in a form compatible with such
> RCU use because inodes have always had a quiescent "reclaimable"
> state between active and reclaim (XFS_INEW -> active ->
> XFS_IRECLAIMABLE -> XFS_IRECLAIM) where the inode can be reused
> before being freed. The result is that lookups have always had to
> handle races with inodes that have just transitioned into the
> XFS_IRECLAIM state and hence cannot be immediately reused...

Cool!!!

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
  2010-11-10  5:12         ` Paul E. McKenney
@ 2010-11-10  6:20           ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2010-11-10  6:20 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Christoph Hellwig, eric.dumazet, xfs

On Tue, Nov 09, 2010 at 09:12:42PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 09, 2010 at 04:04:17PM +1100, Dave Chinner wrote:
> > On Mon, Nov 08, 2010 at 07:36:28PM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 08, 2010 at 06:09:29PM -0500, Christoph Hellwig wrote:
> > > > This patch generally looks good to me, but with so much RCU magic I'd prefer
> > > > if Paul & Eric could look over it.
> > > 
> > > Is there a git tree, tarball, or whatever? 
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working
> 
> Thank you -- I have downloaded this and will look it over.

fs/xfs/xfs_iget.c is the place to start - that's where the inode
cache lookups occur...

> Once the C++ guys get done grilling me on memory-model issues...
> 
> > contains the series that this patch is in.
> > 
> > > For example, I don't see
> > > how this patch handles the case of an inode being freed just as an RCU
> > > reader gains a reference to it,
> > 
> > XFS_IRECLAIM flag is set on inodes as they transition into the
> > reclaim state long before they are freed. The XFS_IRECLAIM flag is left there once
> > freed. Hence lookups in xfs_iget_cache_hit() will see this.
> > 
> > If the inode has been reallocated, the inode number will not yet be
> > set, or the inode state will have changed to XFS_INEW, both of which
> > xfs_iget_cache_hit() will also reject.
> > 
> > > but then reallocated as some other inode
> > > (so that ->ino is nonzero) before the RCU reader gets a chance to actually
> > > look at the inode.
> > 
> > XFS_INEW is not cleared until well after a new ->i_ino is set, so
> > the lookup should find trip over XFS_INEW in that case. I think that
> > I may need to move the inode number check under the i_flags_lock
> > after validating the flags - more to check that we've got the
> > correct inode than to validate we have a freed inode.
> 
> OK, this sounds promising.  Of course, the next question is "how quickly
> can the inode number be available for reuse?"

Immediately. Indeed, an inode number can be reused even before the
inode is reclaimed.  However, looking at the case of having already
freed the inode when the new lookup comes in, I think checking
everything under the i_flags_lock is safe.

That is, if we've freed inode #X (@ &A) and find &A during the RCU
protected lookup for inode #X, the only way the inode number in the
structure at &A would match #X is that if the new #X was reallocated
at &A again.  In that case, if the inode wasn't fully set up, we'd
find either XFS_INEW|XFS_IRECLAIM still set on it and we'd back off
and try the lookup again. However, if inode #X was reallocated at
address &B then the inode at &A would not match #X regardless of
whether &A had been reallocated or not.

Hence I think checking the inode number under the i_flags_lock after
checking XFS_INEW|XFS_IRECLAIM are not set is sufficient to validate
we have both an active inode and the correct inode.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2010-11-10  6:19 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-08  8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
2010-11-08  8:55 ` [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking Dave Chinner
2010-11-08  9:23   ` Christoph Hellwig
2010-11-08  8:55 ` [PATCH 02/16] xfs: move delayed write buffer trace Dave Chinner
2010-11-08  9:24   ` Christoph Hellwig
2010-11-08  8:55 ` [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure Dave Chinner
2010-11-08 12:13   ` Christoph Hellwig
2010-11-09  0:20     ` Dave Chinner
2010-11-08  8:55 ` [PATCH 04/16] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-11-08 11:43   ` Christoph Hellwig
2010-11-09  0:08     ` Dave Chinner
2010-11-08  8:55 ` [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-11-08 11:36   ` Christoph Hellwig
2010-11-08 23:56     ` Dave Chinner
2010-11-08  8:55 ` [PATCH 06/16] patch xfs-inode-hash-fake Dave Chinner
2010-11-08  9:19   ` Christoph Hellwig
2010-11-08  8:55 ` [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-11-08 23:09   ` Christoph Hellwig
2010-11-09  0:24     ` Dave Chinner
2010-11-09  3:36     ` Paul E. McKenney
2010-11-09  5:04       ` Dave Chinner
2010-11-10  5:12         ` Paul E. McKenney
2010-11-10  6:20           ` Dave Chinner
2010-11-08  8:55 ` [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-11-08 23:10   ` Christoph Hellwig
2010-11-08  8:55 ` [PATCH 09/16] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
2010-11-08  8:55 ` [PATCH 10/16] xfs: add a lru to the XFS buffer cache Dave Chinner
2010-11-08 23:19   ` Christoph Hellwig
2010-11-08 23:45     ` Dave Chinner
2010-11-08  8:55 ` [PATCH 11/16] xfs: connect up buffer reclaim priority hooks Dave Chinner
2010-11-08 11:25   ` Christoph Hellwig
2010-11-08 23:50     ` Dave Chinner
2010-11-08  8:55 ` [PATCH 12/16] xfs: bulk AIL insertion during transaction commit Dave Chinner
2010-11-08  8:55 ` [PATCH 13/16] xfs: reduce the number of AIL push wakeups Dave Chinner
2010-11-08 11:32   ` Christoph Hellwig
2010-11-08 23:51     ` Dave Chinner
2010-11-08  8:55 ` [PATCH 14/16] xfs: remove all the inodes on a buffer from the AIL in bulk Dave Chinner
2010-11-08  8:55 ` [PATCH 15/16] xfs: only run xfs_error_test if error injection is active Dave Chinner
2010-11-08 11:33   ` Christoph Hellwig
2010-11-08  8:55 ` [PATCH 16/16] xfs: make xlog_space_left() independent of the grant lock Dave Chinner
2010-11-08 14:17 ` [PATCH 00/16] xfs: current patch stack for 2.6.38 window Christoph Hellwig
2010-11-09  0:21   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox