fs: Inode cache scalability V2

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fs: Inode cache scalability V2
@ 2010-10-08  5:21 Dave Chinner
  2010-10-08  5:21 ` [PATCH 01/18] kernel: add bl_list Dave Chinner
                   ` (19 more replies)
  0 siblings, 20 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This patch set is derived from Nick Piggin's VFS scalability tree.
there doesn't appear to be any push to get that tree into shape for
.37, so this is an attempt to get finer grained review of the series
for upstream inclusion.  I'm hitting VFS lock contention problems
with XFS on 8-16p machines now, so I need to get this stuff moving.

This patch set is just the basic inode_lock breakup patches plus a
few more simple changes to the inode code. It stops short of
introducing RCU inode freeing because those changes are not
completely baked yet.

As a result, the full inode handling improvements of Nick's patch
set are not realised with this short series. However, my own testing
indicates that the amount of lock traffic and contention is down by
an order of magnitude on an 8-way box for parallel inode create and
unlink workloads, so there is still significant improvements from
just this patch set.

Version 2 of this series is a complete rework of the original patch
series.  Nick's original code nested list locks inside the the
inode->i_lock, resulting in a large mess of trylock operations to
get locks out of order all over the place. In many cases, the reason
fo this lock ordering is removed later on in Nick's series as
cleanups are introduced.

As a result I've pulled in several of the cleanups and re-ordered
the series such that cleanups, factoring and list splitting are done
before any of the locking changes. Instead of converting the inode
state flags first, I've converted them last, ensuring that
manipulations are kept inside other locks rather than outside them.

The series is made up of the following steps:

	- inode counters are made per-cpu
	- inode LRU manipulations are made lazy
	- i_list is split into two lists (grows inode by 2
	  pointers), one for tracking lru status, one for writeback
	  status
	- reference counting is factored, then renamed and locked
	  differently
	- inode hash operations are factored, then locked per bucket
	- superblock inode listis locked per-superblock
	- inode LRU is locked via a global lock
		- unclear what the best way to split this up from
		  here is, so no attempt is made to optimise
		  further.
		- Currently not showing signs of contention under
		  any workload on an 8p machine.
	- inode IO list are locked via a per-BDI lock
		- further analysis needed to determine the next step
		  in optimising this list. It is extremely contended
		  under parallel workloads because foreground
		  throttling (balance_dirty_pages) causes unbound
		  writeback parallelism and contention. Fixing the
		  unbound parallelism, I think, is a more important
		  first optimisation step than making the list
		  per-cpu.
	- lock i_state operations with i_lock
	- convert last_ino allocation to a percpu counter
	- protect iunique counter with it's own lock
	- remove inode_lock
	- kill dispose_list() and factor destroying an inode into
	  dispose_one_inode() which is called from reclaim, unmount
	  and iput_final.

None of the patcheѕ are unchanged, and several of them are new or
completely rewritten, so any previous testing is completely
invalidated. I have not tried to optimise locking by using trylock
loops - anywhere that requires out-of-order locking drops locks and
regains the locks needed for the next operation. This approach
simplified the code and lead to several improvments in the patch
series (e.g. moving inode->i_lock inside writeback_single_inode(),
and the dispose_one_inode factoring) that would have gone unnoticed
if I'd gone down the same trylock loop path that Nick used.

I've done some testing so far on ext3, ext4 and XFS (mostly sanity
and lock_stat profile testing), but I have not tested any other
filesystems. IOWs, it is light on testing at this point. I'm sending
out for review now that it passes basic sanity tests so that
comments on the reworked approach can be made.

Version 2:
- complete rework of series.

--

The following changes since commit cb655d0f3d57c23db51b981648e452988c0223f9:

  Linux 2.6.36-rc7 (2010-10-06 13:39:52 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Dave Chinner (11):
      fs: Convert nr_inodes and nr_unused to per-cpu counters
      fs: Clean up inode reference counting
      exofs: use iput() for inode reference count decrements
      fs: add inode reference coutn read accessor
      fs: rework icount to be a locked variable
      fs: Factor inode hash operations into functions
      fs: add a per-superblock lock for the inode list
      fs: split locking of inode writeback and LRU lists
      fs: Protect inode->i_state with th einode->i_lock
      fs: icache remove inode_lock
      fs: Reduce inode I_FREEING and factor inode disposal

Eric Dumazet (1):
      fs: introduce a per-cpu last_ino allocator

Nick Piggin (6):
      kernel: add bl_list
      fs: keep inode with backing-dev
      fs: Implement lazy LRU updates for inodes.
      fs: inode split IO and LRU lists
      fs: Introduce per-bucket inode hash locks
      fs: Make iunique independent of inode_lock

 Documentation/filesystems/Locking        |    2 +-
 Documentation/filesystems/porting        |   10 +-
 Documentation/filesystems/vfs.txt        |    2 +-
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/char/mem.c                       |    2 +-
 drivers/char/raw.c                       |    2 +-
 drivers/mtd/mtdchar.c                    |    2 +-
 drivers/staging/pohmelfs/inode.c         |   10 +-
 fs/9p/vfs_inode.c                        |    5 +-
 fs/affs/inode.c                          |    2 +-
 fs/afs/dir.c                             |    2 +-
 fs/afs/write.c                           |    6 +-
 fs/anon_inodes.c                         |    5 +-
 fs/bfs/dir.c                             |    2 +-
 fs/block_dev.c                           |   26 +-
 fs/btrfs/disk-io.c                       |    2 +-
 fs/btrfs/file.c                          |    2 +-
 fs/btrfs/inode.c                         |   28 +-
 fs/buffer.c                              |    4 +-
 fs/ceph/addr.c                           |    2 +-
 fs/ceph/inode.c                          |    4 +-
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/file.c                           |    2 +-
 fs/cifs/inode.c                          |    4 +-
 fs/coda/dir.c                            |    2 +-
 fs/configfs/inode.c                      |    3 +-
 fs/drop_caches.c                         |   19 +-
 fs/exofs/inode.c                         |    6 +-
 fs/exofs/namei.c                         |    2 +-
 fs/ext2/ialloc.c                         |    2 +-
 fs/ext2/namei.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext3/namei.c                          |    2 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/ext4/namei.c                          |    2 +-
 fs/fs-writeback.c                        |  184 ++++----
 fs/fuse/file.c                           |    6 +-
 fs/fuse/inode.c                          |    2 +-
 fs/gfs2/glock.c                          |    3 +-
 fs/gfs2/ops_inode.c                      |    2 +-
 fs/hfs/hfs_fs.h                          |    2 +-
 fs/hfs/inode.c                           |    2 +-
 fs/hfsplus/dir.c                         |    2 +-
 fs/hfsplus/hfsplus_fs.h                  |    2 +-
 fs/hfsplus/inode.c                       |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/hugetlbfs/inode.c                     |    3 +-
 fs/inode.c                               |  764 ++++++++++++++++++++----------
 fs/internal.h                            |    6 +
 fs/jffs2/dir.c                           |    4 +-
 fs/jfs/jfs_txnmgr.c                      |    2 +-
 fs/jfs/namei.c                           |    2 +-
 fs/libfs.c                               |    2 +-
 fs/locks.c                               |    2 +-
 fs/logfs/dir.c                           |    2 +-
 fs/logfs/inode.c                         |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/minix/namei.c                         |    2 +-
 fs/namei.c                               |    2 +-
 fs/nfs/dir.c                             |    2 +-
 fs/nfs/getroot.c                         |    2 +-
 fs/nfs/inode.c                           |    7 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nfs/write.c                           |    9 +-
 fs/nilfs2/btnode.c                       |    2 +-
 fs/nilfs2/gcdat.c                        |    1 +
 fs/nilfs2/gcinode.c                      |   22 +-
 fs/nilfs2/mdt.c                          |    7 +-
 fs/nilfs2/namei.c                        |    2 +-
 fs/nilfs2/segment.c                      |    2 +-
 fs/nilfs2/the_nilfs.c                    |    2 +-
 fs/nilfs2/the_nilfs.h                    |    2 +-
 fs/notify/inode_mark.c                   |   47 ++-
 fs/notify/mark.c                         |    1 -
 fs/notify/vfsmount_mark.c                |    1 -
 fs/ntfs/file.c                           |    2 +-
 fs/ntfs/inode.c                          |    4 +-
 fs/ntfs/super.c                          |    4 +-
 fs/ocfs2/dlmfs/dlmfs.c                   |    4 +-
 fs/ocfs2/file.c                          |    2 +-
 fs/ocfs2/inode.c                         |    2 +-
 fs/ocfs2/namei.c                         |    2 +-
 fs/quota/dquot.c                         |   32 +-
 fs/ramfs/inode.c                         |    2 +-
 fs/reiserfs/namei.c                      |    2 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/reiserfs/xattr.c                      |    2 +-
 fs/romfs/super.c                         |    4 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/super.c                               |    1 +
 fs/sysfs/inode.c                         |    2 +-
 fs/sysv/namei.c                          |    2 +-
 fs/ubifs/dir.c                           |    4 +-
 fs/ubifs/super.c                         |    4 +-
 fs/udf/namei.c                           |    2 +-
 fs/ufs/namei.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c               |    4 +-
 fs/xfs/linux-2.6/xfs_file.c              |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c              |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    4 +-
 include/linux/backing-dev.h              |   17 +-
 include/linux/fs.h                       |   34 +-
 include/linux/list_bl.h                  |  127 +++++
 include/linux/poison.h                   |    2 +
 include/linux/writeback.h                |   13 +-
 ipc/mqueue.c                             |    2 +-
 kernel/cgroup.c                          |    2 +-
 kernel/futex.c                           |    2 +-
 kernel/sysctl.c                          |    4 +-
 mm/backing-dev.c                         |   90 ++++-
 mm/fadvise.c                             |    4 +-
 mm/filemap.c                             |   10 +-
 mm/filemap_xip.c                         |    2 +-
 mm/page-writeback.c                      |   15 +-
 mm/readahead.c                           |    6 +-
 mm/rmap.c                                |    6 +-
 mm/shmem.c                               |    8 +-
 mm/swap.c                                |    2 +-
 mm/swap_state.c                          |    2 +-
 mm/swapfile.c                            |    2 +-
 mm/truncate.c                            |    3 +-
 mm/vmscan.c                              |    2 +-
 net/socket.c                             |    2 +-
 124 files changed, 1131 insertions(+), 616 deletions(-)
 create mode 100644 include/linux/list_bl.h

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 01/18] kernel: add bl_list
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  8:18   ` Andi Kleen
  2010-10-08  5:21 ` [PATCH 02/18] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Introduce a type of hlist that can support the use of the lowest bit
in the hlist_head. This will be subsequently used to implement
per-bucket bit spinlock for inode hashes.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_bl.h |  127 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/poison.h  |    2 +
 2 files changed, 129 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/list_bl.h

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
new file mode 100644
index 0000000..961bc89
--- /dev/null
+++ b/include/linux/list_bl.h
@@ -0,0 +1,127 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ *
+ * For modification operations, the 0 bit of hlist_bl_head->first
+ * pointer must be set.
+ */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define LIST_BL_LOCKMASK	1UL
+#else
+#define LIST_BL_LOCKMASK	0UL
+#endif
+
+#ifdef CONFIG_DEBUG_LIST
+#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#else
+#define LIST_BL_BUG_ON(x)
+#endif
+
+
+struct hlist_bl_head {
+	struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+	struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+	((ptr)->first = NULL)
+
+static inline void init_hlist_bl_node(struct hlist_bl_node *h)
+{
+	h->next = NULL;
+	h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr, type, member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+	return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+	return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+	hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+	struct hlist_bl_node *next = n->next;
+	struct hlist_bl_node **pprev = n->pprev;
+
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	*pprev = (struct hlist_bl_node *)
+			((unsigned long)next |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->next = BL_LIST_POISON1;
+	n->pprev = BL_LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		init_hlist_bl_node(n);
+	}
+}
+
+/**
+ * hlist_bl_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first(head);				\
+	     pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+	     pos = pos->next)
+
+#endif
diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..d367d39 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -22,6 +22,8 @@
 #define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
 #define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
 
+#define BL_LIST_POISON1  ((void *) 0x00300300 + POISON_POINTER_DELTA)
+#define BL_LIST_POISON2  ((void *) 0x00400400 + POISON_POINTER_DELTA)
 /********** include/linux/timer.h **********/
 /*
  * Magic number "tsta" to indicate a static timer initializer
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 01/18] kernel: add bl_list
  2010-10-08  5:21 ` [PATCH 01/18] kernel: add bl_list Dave Chinner
@ 2010-10-08  8:18   ` Andi Kleen
  2010-10-08 10:33     ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Andi Kleen @ 2010-10-08  8:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Dave Chinner <david@fromorbit.com> writes:

> +static inline void __hlist_bl_del(struct hlist_bl_node *n)
> +{
> +	struct hlist_bl_node *next = n->next;
> +	struct hlist_bl_node **pprev = n->pprev;
> +
> +	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
> +
> +	/* pprev may be `first`, so be careful not to lose the lock bit */
> +	*pprev = (struct hlist_bl_node *)
> +			((unsigned long)next |
> +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
> +	if (next)
> +		next->pprev = pprev;
> +}

Should this set n->pprev to NULL so that unhashed returns true
afterwards?


> +
> +static inline void hlist_bl_del(struct hlist_bl_node *n)
> +{
> +	__hlist_bl_del(n);
> +	n->next = BL_LIST_POISON1;
> +	n->pprev = BL_LIST_POISON2;
> +}

Ok so unhashed only works once. Seems unsymmetric.

Other than that looks good to me.

Reviewed-by: Andi Kleen <ak@linux.intel.com>

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 01/18] kernel: add bl_list
  2010-10-08  8:18   ` Andi Kleen
@ 2010-10-08 10:33     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08 10:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:18:40AM +0200, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > +static inline void __hlist_bl_del(struct hlist_bl_node *n)
> > +{
> > +	struct hlist_bl_node *next = n->next;
> > +	struct hlist_bl_node **pprev = n->pprev;
> > +
> > +	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
> > +
> > +	/* pprev may be `first`, so be careful not to lose the lock bit */
> > +	*pprev = (struct hlist_bl_node *)
> > +			((unsigned long)next |
> > +			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
> > +	if (next)
> > +		next->pprev = pprev;
> > +}
> 
> Should this set n->pprev to NULL so that unhashed returns true
> afterwards?

No, I think the callers set that appropriately.

> > +
> > +static inline void hlist_bl_del(struct hlist_bl_node *n)
> > +{
> > +	__hlist_bl_del(n);
> > +	n->next = BL_LIST_POISON1;
> > +	n->pprev = BL_LIST_POISON2;
> > +}
> 
> Ok so unhashed only works once. Seems unsymmetric.

Exactly the same behaviour as hlist_del(). If you want
hlist_bl_unhashed() to work, you need to call hlist_bl_del_init().

/me makes a note to check all the inode hash code uses hlist_bl_del_init()
as there are unhashed checks in many places.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 02/18] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
  2010-10-08  5:21 ` [PATCH 01/18] kernel: add bl_list Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:01   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 03/18] fs: keep inode with backing-dev Dave Chinner
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c  |    5 +--
 fs/inode.c         |   65 ++++++++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |    4 ++-
 kernel/sysctl.c    |    4 +-
 4 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ab38fef..58a95b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -723,7 +723,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			get_nr_dirty_inodes();
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1090,8 +1090,7 @@ void writeback_inodes_sb(struct super_block *sb)
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	work.nr_pages = nr_dirty + nr_unstable + get_nr_dirty_inodes();
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..f04d501 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -103,8 +103,41 @@ static DECLARE_RWSEM(iprune_sem);
  */
 struct inodes_stat_t inodes_stat;
 
+static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+static struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
+
 static struct kmem_cache *inode_cachep __read_mostly;
 
+static inline int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+static inline int get_nr_inodes_unused(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes_unused);
+}
+
+int get_nr_dirty_inodes(void)
+{
+	int nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
+	return nr_dirty > 0 ? nr_dirty : 0;
+
+}
+
+/*
+ * Handle nr_inode sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	inodes_stat.nr_unused = get_nr_inodes_unused();
+	return proc_dointvec(table, write, buffer, lenp, ppos);
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -192,6 +225,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 
+	percpu_counter_inc(&nr_inodes);
+
 	return 0;
 out:
 	return -ENOMEM;
@@ -232,6 +267,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	percpu_counter_dec(&nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
@@ -286,7 +322,7 @@ void __iget(struct inode *inode)
 
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
-	inodes_stat.nr_unused--;
+	percpu_counter_dec(&nr_inodes_unused);
 }
 
 void end_writeback(struct inode *inode)
@@ -327,8 +363,6 @@ static void evict(struct inode *inode)
  */
 static void dispose_list(struct list_head *head)
 {
-	int nr_disposed = 0;
-
 	while (!list_empty(head)) {
 		struct inode *inode;
 
@@ -344,11 +378,7 @@ static void dispose_list(struct list_head *head)
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
-		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -357,7 +387,7 @@ static void dispose_list(struct list_head *head)
 static int invalidate_list(struct list_head *head, struct list_head *dispose)
 {
 	struct list_head *next;
-	int busy = 0, count = 0;
+	int busy = 0;
 
 	next = head->next;
 	for (;;) {
@@ -383,13 +413,11 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-			count++;
+			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		busy = 1;
 	}
-	/* only unused inodes may be cached with i_count zero */
-	inodes_stat.nr_unused -= count;
 	return busy;
 }
 
@@ -448,7 +476,6 @@ static int can_unuse(struct inode *inode)
 static void prune_icache(int nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_pruned = 0;
 	int nr_scanned;
 	unsigned long reap = 0;
 
@@ -484,9 +511,8 @@ static void prune_icache(int nr_to_scan)
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
-		nr_pruned++;
+		percpu_counter_dec(&nr_inodes_unused);
 	}
-	inodes_stat.nr_unused -= nr_pruned;
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -518,7 +544,7 @@ static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 			return -1;
 		prune_icache(nr);
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+	return (get_nr_inodes_unused() / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
@@ -595,7 +621,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
 	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
@@ -1215,7 +1240,7 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
+		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode_lock);
 			return;
@@ -1227,14 +1252,13 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
+		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
@@ -1489,6 +1513,7 @@ void __init inode_init_early(void)
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
 		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+
 }
 
 void __init inode_init(void)
@@ -1503,6 +1528,8 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	percpu_counter_init(&nr_inodes, 0);
+	percpu_counter_init(&nr_inodes_unused, 0);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d069b..1fb92f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -407,6 +407,7 @@ extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
 extern int sysctl_nr_open;
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_dirty_inodes(void);
 extern int leases_enable, lease_break_time;
 
 struct buffer_head;
@@ -2474,7 +2475,8 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf,
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
-
+int proc_nr_inodes(struct ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..33d1733 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1340,14 +1340,14 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "inode-state",
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 02/18] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-08  5:21 ` [PATCH 02/18] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
@ 2010-10-08  7:01   ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:16PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The number of inodes allocated does not need to be tied to the
> addition or removal of an inode to/from a list. If we are not tied
> to a list lock, we could update the counters when inodes are
> initialised or destroyed, but to do that we need to convert the
> counters to be per-cpu (i.e. independent of a lock). This means that
> we have the freedom to change the list/locking implementation
> without needing to care about the counters.
> 
> Based on a patch originally from Eric Dumazet.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good except for a spurious whitespace change in inode_init_early.


Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 03/18] fs: keep inode with backing-dev
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
  2010-10-08  5:21 ` [PATCH 01/18] kernel: add bl_list Dave Chinner
  2010-10-08  5:21 ` [PATCH 02/18] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:01   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 04/18] fs: Implement lazy LRU updates for inodes Dave Chinner
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Having inode on writeback lists of a different bdi than
inode->i_mapping->backing_dev_info makes it very difficult to do
per-bdi locking of the writeback lists. Add functions to move these
inodes over when the mapping backing dev is changed.

Also, rename i_mapping.backing_dev_info to i_mapping.a_bdi while we're
here. Succinct is nice, and it catches conversion errors.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/char/mem.c          |    2 +-
 drivers/char/raw.c          |    2 +-
 drivers/mtd/mtdchar.c       |    2 +-
 fs/afs/write.c              |    6 ++--
 fs/block_dev.c              |   13 +++++----
 fs/btrfs/disk-io.c          |    2 +-
 fs/btrfs/file.c             |    2 +-
 fs/btrfs/inode.c            |   10 +++---
 fs/buffer.c                 |    2 +-
 fs/ceph/addr.c              |    2 +-
 fs/ceph/inode.c             |    4 +-
 fs/cifs/file.c              |    2 +-
 fs/cifs/inode.c             |    2 +-
 fs/configfs/inode.c         |    3 +-
 fs/ext2/ialloc.c            |    2 +-
 fs/fs-writeback.c           |    2 +-
 fs/fuse/file.c              |    6 ++--
 fs/fuse/inode.c             |    2 +-
 fs/gfs2/glock.c             |    3 +-
 fs/hugetlbfs/inode.c        |    3 +-
 fs/inode.c                  |    6 ++--
 fs/nfs/inode.c              |    3 +-
 fs/nfs/write.c              |    7 ++---
 fs/nilfs2/btnode.c          |    2 +-
 fs/nilfs2/mdt.c             |    2 +-
 fs/nilfs2/the_nilfs.c       |    2 +-
 fs/ntfs/file.c              |    2 +-
 fs/ocfs2/dlmfs/dlmfs.c      |    4 +-
 fs/ocfs2/file.c             |    2 +-
 fs/ramfs/inode.c            |    2 +-
 fs/romfs/super.c            |    4 +-
 fs/sysfs/inode.c            |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/ubifs/super.c            |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    4 +-
 fs/xfs/linux-2.6/xfs_file.c |    2 +-
 include/linux/backing-dev.h |   16 ++++++++---
 include/linux/fs.h          |    2 +-
 kernel/cgroup.c             |    2 +-
 mm/backing-dev.c            |   61 ++++++++++++++++++++++++++++++++++++++++--
 mm/fadvise.c                |    4 +-
 mm/filemap.c                |    4 +-
 mm/filemap_xip.c            |    2 +-
 mm/page-writeback.c         |   15 +++++-----
 mm/readahead.c              |    6 ++--
 mm/shmem.c                  |    2 +-
 mm/swap.c                   |    2 +-
 mm/swap_state.c             |    2 +-
 mm/swapfile.c               |    2 +-
 mm/truncate.c               |    3 +-
 mm/vmscan.c                 |    2 +-
 51 files changed, 155 insertions(+), 90 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 1f528fa..2285c1e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -872,7 +872,7 @@ static int memory_open(struct inode *inode, struct file *filp)
 
 	filp->f_op = dev->fops;
 	if (dev->dev_info)
-		filp->f_mapping->backing_dev_info = dev->dev_info;
+		mapping_set_bdi(filp->f_mapping, dev->dev_info);
 
 	if (dev->fops->open)
 		return dev->fops->open(inode, filp);
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index b38942f..5baa83f 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -109,7 +109,7 @@ static int raw_release(struct inode *inode, struct file *filp)
 	if (--raw_devices[minor].inuse == 0) {
 		/* Here  inode->i_mapping == bdev->bd_inode->i_mapping  */
 		inode->i_mapping = &inode->i_data;
-		inode->i_mapping->backing_dev_info = &default_backing_dev_info;
+		mapping_set_bdi(inode->i_mapping, &default_backing_dev_info);
 	}
 	mutex_unlock(&raw_mutex);
 
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index a825002..26af8b1 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -113,7 +113,7 @@ static int mtd_open(struct inode *inode, struct file *file)
 	if (mtd_ino->i_state & I_NEW) {
 		mtd_ino->i_private = mtd;
 		mtd_ino->i_mode = S_IFCHR;
-		mtd_ino->i_data.backing_dev_info = mtd->backing_dev_info;
+		mapping_new_set_bdi(&mtd_ino->i_data, mtd->backing_dev_info);
 		unlock_new_inode(mtd_ino);
 	}
 	file->f_mapping = mtd_ino->i_mapping;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 722743b..b321bfc 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -438,7 +438,7 @@ no_more:
  */
 int afs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+	struct backing_dev_info *bdi = page->mapping->a_bdi;
 	struct afs_writeback *wb;
 	int ret;
 
@@ -469,7 +469,7 @@ static int afs_writepages_region(struct address_space *mapping,
 				 struct writeback_control *wbc,
 				 pgoff_t index, pgoff_t end, pgoff_t *_next)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	struct afs_writeback *wb;
 	struct page *page;
 	int ret, n;
@@ -548,7 +548,7 @@ static int afs_writepages_region(struct address_space *mapping,
 int afs_writepages(struct address_space *mapping,
 		   struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	pgoff_t start, end, next;
 	int ret;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 50e8c85..ac070d7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -533,7 +533,7 @@ struct block_device *bdget(dev_t dev)
 		inode->i_bdev = bdev;
 		inode->i_data.a_ops = &def_blk_aops;
 		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
-		inode->i_data.backing_dev_info = &default_backing_dev_info;
+		mapping_new_set_bdi(&inode->i_data, &default_backing_dev_info);
 		spin_lock(&bdev_lock);
 		list_add(&bdev->bd_list, &all_bdevs);
 		spin_unlock(&bdev_lock);
@@ -1390,7 +1390,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				bdi = blk_get_backing_dev_info(bdev);
 				if (bdi == NULL)
 					bdi = &default_backing_dev_info;
-				bdev->bd_inode->i_data.backing_dev_info = bdi;
+				mapping_set_bdi(&bdev->bd_inode->i_data, bdi);
 			}
 			if (bdev->bd_invalidated)
 				rescan_partitions(disk, bdev);
@@ -1405,8 +1405,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			if (ret)
 				goto out_clear;
 			bdev->bd_contains = whole;
-			bdev->bd_inode->i_data.backing_dev_info =
-			   whole->bd_inode->i_data.backing_dev_info;
+			mapping_set_bdi(&bdev->bd_inode->i_data,
+			   whole->bd_inode->i_data.a_bdi);
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
 			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
@@ -1439,7 +1439,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+	mapping_set_bdi(&bdev->bd_inode->i_data, &default_backing_dev_info);
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
@@ -1533,7 +1533,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
-		bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+		mapping_set_bdi(&bdev->bd_inode->i_data,
+				&default_backing_dev_info);
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 64f1008..05c3fc7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1636,7 +1636,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	 */
 	fs_info->btree_inode->i_size = OFFSET_MAX;
 	fs_info->btree_inode->i_mapping->a_ops = &btree_aops;
-	fs_info->btree_inode->i_mapping->backing_dev_info = &fs_info->bdi;
+	mapping_new_set_bdi(fs_info->btree_inode->i_mapping, &fs_info->bdi);
 
 	RB_CLEAR_NODE(&BTRFS_I(fs_info->btree_inode)->rb_node);
 	extent_io_tree_init(&BTRFS_I(fs_info->btree_inode)->io_tree,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e354c33..96e3883 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -872,7 +872,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 		goto out;
 	count = ocount;
 
-	current->backing_dev_info = inode->i_mapping->backing_dev_info;
+	current->backing_dev_info = inode->i_mapping->a_bdi;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
 		goto out;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..c646c0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2475,7 +2475,7 @@ static void btrfs_read_locked_inode(struct inode *inode)
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
 		inode->i_mapping->a_ops = &btrfs_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
@@ -2490,7 +2490,7 @@ static void btrfs_read_locked_inode(struct inode *inode)
 	case S_IFLNK:
 		inode->i_op = &btrfs_symlink_inode_operations;
 		inode->i_mapping->a_ops = &btrfs_symlink_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		break;
 	default:
 		inode->i_op = &btrfs_special_inode_operations;
@@ -4705,7 +4705,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 		drop_inode = 1;
 	else {
 		inode->i_mapping->a_ops = &btrfs_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
@@ -6699,7 +6699,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 		drop_inode = 1;
 	else {
 		inode->i_mapping->a_ops = &btrfs_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
@@ -6739,7 +6739,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 
 	inode->i_op = &btrfs_symlink_inode_operations;
 	inode->i_mapping->a_ops = &btrfs_symlink_aops;
-	inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+	mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 	inode_set_bytes(inode, name_len);
 	btrfs_i_size_write(inode, name_len - 1);
 	err = btrfs_update_inode(trans, root, inode);
diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..b5c4153 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3161,7 +3161,7 @@ void block_sync_page(struct page *page)
 	smp_mb();
 	mapping = page_mapping(page);
 	if (mapping)
-		blk_run_backing_dev(mapping->backing_dev_info, page);
+		blk_run_backing_dev(mapping->a_bdi, page);
 }
 EXPORT_SYMBOL(block_sync_page);
 
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index efbc604..448400a 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -588,7 +588,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 				 struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_client *client;
 	pgoff_t index, start, end;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 62377ec..e427082 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -624,8 +624,8 @@ static int fill_inode(struct inode *inode,
 	}
 
 	inode->i_mapping->a_ops = &ceph_aops;
-	inode->i_mapping->backing_dev_info =
-		&ceph_sb_to_client(inode->i_sb)->backing_dev_info;
+	mapping_new_set_bdi(inode->i_mapping,
+			&ceph_sb_to_client(inode->i_sb)->backing_dev_info);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFIFO:
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index de748c6..3673e66 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1337,7 +1337,7 @@ static int cifs_partialpagewrite(struct page *page, unsigned from, unsigned to)
 static int cifs_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	unsigned int bytes_to_write;
 	unsigned int bytes_written;
 	struct cifs_sb_info *cifs_sb;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 53cce8c..63a0bdb 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -802,7 +802,7 @@ retry_iget5_locked:
 		if (inode->i_state & I_NEW) {
 			inode->i_ino = hash;
 			if (S_ISREG(inode->i_mode))
-				inode->i_data.backing_dev_info = sb->s_bdi;
+				inode->i_data.a_bdi = sb->s_bdi;
 #ifdef CONFIG_CIFS_FSCACHE
 			/* initialize per-inode cache cookie pointer */
 			CIFS_I(inode)->fscache = NULL;
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index cf78d44..40b2bec 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -136,7 +136,8 @@ struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent * sd)
 	struct inode * inode = new_inode(configfs_sb);
 	if (inode) {
 		inode->i_mapping->a_ops = &configfs_aops;
-		inode->i_mapping->backing_dev_info = &configfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping,
+				&configfs_backing_dev_info);
 		inode->i_op = &configfs_inode_operations;
 
 		if (sd->s_iattr) {
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index ad70479..29942f0 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -172,7 +172,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct ext2_group_desc * gdp;
 	struct backing_dev_info *bdi;
 
-	bdi = inode->i_mapping->backing_dev_info;
+	bdi = inode->i_mapping->a_bdi;
 	if (bdi_read_congested(bdi))
 		return;
 	if (bdi_write_congested(bdi))
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 58a95b7..3209aff 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -74,7 +74,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 	struct super_block *sb = inode->i_sb;
 
 	if (strcmp(sb->s_type->name, "bdev") == 0)
-		return inode->i_mapping->backing_dev_info;
+		return inode->i_mapping->a_bdi;
 
 	return sb->s_bdi;
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c822458..193a0d1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -945,7 +945,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
@@ -1133,7 +1133,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	struct backing_dev_info *bdi = inode->i_mapping->a_bdi;
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
@@ -1247,7 +1247,7 @@ static int fuse_writepage_locked(struct page *page)
 	req->end = fuse_writepage_end;
 	req->inode = inode;
 
-	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	inc_bdi_stat(mapping->a_bdi, BDI_WRITEBACK);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index da9e6e1..5cf105c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -256,7 +256,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	if ((inode->i_state & I_NEW)) {
 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
 		inode->i_generation = generation;
-		inode->i_data.backing_dev_info = &fc->bdi;
+		mapping_new_set_bdi(&inode->i_data, &fc->bdi);
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 9adf8f9..c8f4c50 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -8,6 +8,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/backing-dev.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/buffer_head.h>
@@ -797,7 +798,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
 		mapping->flags = 0;
 		mapping_set_gfp_mask(mapping, GFP_NOFS);
 		mapping->assoc_mapping = NULL;
-		mapping->backing_dev_info = s->s_bdi;
+		mapping_new_set_bdi(mapping, s->s_bdi);
 		mapping->writeback_index = 0;
 	}
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6e5bd42..a37920a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -459,7 +459,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 		inode->i_uid = uid;
 		inode->i_gid = gid;
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
-		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping,
+				&hugetlbfs_backing_dev_info);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
diff --git a/fs/inode.c b/fs/inode.c
index f04d501..22ef3f1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -201,7 +201,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->flags = 0;
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->assoc_mapping = NULL;
-	mapping->backing_dev_info = &default_backing_dev_info;
+	mapping_new_set_bdi(mapping, &default_backing_dev_info);
 	mapping->writeback_index = 0;
 
 	/*
@@ -212,8 +212,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	if (sb->s_bdev) {
 		struct backing_dev_info *bdi;
 
-		bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
-		mapping->backing_dev_info = bdi;
+		bdi = sb->s_bdev->bd_inode->i_mapping->a_bdi;
+		mapping_new_set_bdi(mapping, bdi);
 	}
 	inode->i_private = NULL;
 	inode->i_mapping = mapping;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..886be68 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -287,7 +287,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 		if (S_ISREG(inode->i_mode)) {
 			inode->i_fop = &nfs_file_operations;
 			inode->i_data.a_ops = &nfs_file_aops;
-			inode->i_data.backing_dev_info = &NFS_SB(sb)->backing_dev_info;
+			mapping_new_set_bdi(&inode->i_data,
+						&NFS_SB(sb)->backing_dev_info);
 		} else if (S_ISDIR(inode->i_mode)) {
 			inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
 			inode->i_fop = &nfs_dir_operations;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 874972d..a8baf4b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -455,7 +455,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+	inc_bdi_stat(req->wb_page->mapping->a_bdi, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -466,7 +466,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+		dec_bdi_stat(page->mapping->a_bdi, BDI_RECLAIMABLE);
 		return 1;
 	}
 	return 0;
@@ -1321,8 +1321,7 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
-				BDI_RECLAIMABLE);
+		dec_bdi_stat(req->wb_page->mapping->a_bdi, BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
 	nfs_commit_clear_lock(NFS_I(inode));
diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index f78ab10..d74ed8f 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -59,7 +59,7 @@ void nilfs_btnode_cache_init(struct address_space *btnc,
 	btnc->flags = 0;
 	mapping_set_gfp_mask(btnc, GFP_NOFS);
 	btnc->assoc_mapping = NULL;
-	btnc->backing_dev_info = bdi;
+	mapping_new_set_bdi(btnc, bdi);
 	btnc->a_ops = &def_btnode_aops;
 }
 
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index d01aff4..7713861 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -517,7 +517,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		mapping->flags = 0;
 		mapping_set_gfp_mask(mapping, gfp_mask);
 		mapping->assoc_mapping = NULL;
-		mapping->backing_dev_info = nilfs->ns_bdi;
+		mapping_new_set_bdi(mapping, nilfs->ns_bdi);
 
 		inode->i_mapping = mapping;
 	}
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index ba7c10c..cb81695 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -729,7 +729,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct nilfs_sb_info *sbi, char *data)
 
 	nilfs->ns_mount_state = le16_to_cpu(sbp->s_state);
 
-	bdi = nilfs->ns_bdev->bd_inode->i_mapping->backing_dev_info;
+	bdi = nilfs->ns_bdev->bd_inode->i_mapping->a_bdi;
 	nilfs->ns_bdi = bdi ? : &default_backing_dev_info;
 
 	err = nilfs_store_log_cursor(nilfs, sbp);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 113ebd9..19f9447 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2088,7 +2088,7 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
 	pos = *ppos;
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 	/* We can write back this queue in page reclaim. */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 	written = 0;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index c2903b8..6b931db 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -403,7 +403,7 @@ static struct inode *dlmfs_get_root_inode(struct super_block *sb)
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
-		inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &dlmfs_backing_dev_info);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inc_nlink(inode);
 
@@ -428,7 +428,7 @@ static struct inode *dlmfs_get_inode(struct inode *parent,
 	inode->i_mode = mode;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
-	inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
+	mapping_new_set_bdi(inode->i_mapping, &dlmfs_backing_dev_info);
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 
 	ip = DLMFS_I(inode);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 9a03c15..863e016 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2327,7 +2327,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		current->backing_dev_info = file->f_mapping->backing_dev_info;
+		current->backing_dev_info = file->f_mapping->a_bdi;
 		written = generic_file_buffered_write(iocb, iov, nr_segs, *ppos,
 						      ppos, count, 0);
 		current->backing_dev_info = NULL;
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a5ebae7..02d8ffb 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -60,7 +60,7 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 	if (inode) {
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
-		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &ramfs_backing_dev_info);
 		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 42d2135..bb4b195 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -356,8 +356,8 @@ static struct inode *romfs_iget(struct super_block *sb, unsigned long pos)
 		i->i_fop = &romfs_ro_fops;
 		i->i_data.a_ops = &romfs_aops;
 		if (i->i_sb->s_mtd)
-			i->i_data.backing_dev_info =
-				i->i_sb->s_mtd->backing_dev_info;
+			mapping_new_set_bdi(&i->i_data,
+				i->i_sb->s_mtd->backing_dev_info);
 		if (nextfh & ROMFH_EXEC)
 			mode |= S_IXUGO;
 		break;
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index cffb1fd..3d049e5 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -251,7 +251,7 @@ static void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode)
 
 	inode->i_private = sysfs_get(sd);
 	inode->i_mapping->a_ops = &sysfs_aops;
-	inode->i_mapping->backing_dev_info = &sysfs_backing_dev_info;
+	mapping_new_set_bdi(inode->i_mapping, &sysfs_backing_dev_info);
 	inode->i_op = &sysfs_inode_operations;
 
 	set_default_inode_attr(inode, sd->s_mode);
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 87ebcce..d669260 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -109,7 +109,7 @@ struct inode *ubifs_new_inode(struct ubifs_info *c, const struct inode *dir,
 			 ubifs_current_time(inode);
 	inode->i_mapping->nrpages = 0;
 	/* Disable readahead */
-	inode->i_mapping->backing_dev_info = &c->bdi;
+	mapping_new_set_bdi(inode->i_mapping, &c->bdi);
 
 	switch (mode & S_IFMT) {
 	case S_IFREG:
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cd5900b..45888fb 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -157,7 +157,7 @@ struct inode *ubifs_iget(struct super_block *sb, unsigned long inum)
 		goto out_invalid;
 
 	/* Disable read-ahead */
-	inode->i_mapping->backing_dev_info = &c->bdi;
+	mapping_new_set_bdi(inode->i_mapping, &c->bdi);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 286e36e..7038d77 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -630,7 +630,7 @@ xfs_buf_readahead(
 {
 	struct backing_dev_info *bdi;
 
-	bdi = target->bt_mapping->backing_dev_info;
+	bdi = target->bt_mapping->a_bdi;
 	if (bdi_read_congested(bdi))
 		return;
 
@@ -1580,7 +1580,7 @@ xfs_mapping_buftarg(
 		bdi = &default_backing_dev_info;
 	mapping = &inode->i_data;
 	mapping->a_ops = &mapping_aops;
-	mapping->backing_dev_info = bdi;
+	mapping_new_set_bdi(mapping, bdi);
 	mapping_set_gfp_mask(mapping, GFP_NOFS);
 	btp->bt_mapping = mapping;
 	return 0;
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index ba8ad42..94cf85b 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -679,7 +679,7 @@ start:
 		goto out_unlock_internal;
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 
 	if ((ioflags & IO_ISDIRECT)) {
 		if (mapping->nrpages) {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..31e1346 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -314,19 +314,27 @@ static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
 	return bdi == &default_backing_dev_info;
 }
 
+void mapping_set_bdi(struct address_space *mapping,
+					struct backing_dev_info *bdi);
+static inline void mapping_new_set_bdi(struct address_space *mapping,
+					struct backing_dev_info *bdi)
+{
+	mapping->a_bdi = bdi;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
-	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
+	return bdi_cap_writeback_dirty(mapping->a_bdi);
 }
 
 static inline bool mapping_cap_account_dirty(struct address_space *mapping)
 {
-	return bdi_cap_account_dirty(mapping->backing_dev_info);
+	return bdi_cap_account_dirty(mapping->a_bdi);
 }
 
 static inline bool mapping_cap_swap_backed(struct address_space *mapping)
 {
-	return bdi_cap_swap_backed(mapping->backing_dev_info);
+	return bdi_cap_swap_backed(mapping->a_bdi);
 }
 
 static inline int bdi_sched_wait(void *word)
@@ -345,7 +353,7 @@ static inline void blk_run_backing_dev(struct backing_dev_info *bdi,
 static inline void blk_run_address_space(struct address_space *mapping)
 {
 	if (mapping)
-		blk_run_backing_dev(mapping->backing_dev_info, NULL);
+		blk_run_backing_dev(mapping->a_bdi, NULL);
 }
 
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fb92f9..6f0b07f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -633,7 +633,7 @@ struct address_space {
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
-	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct backing_dev_info *a_bdi;		/* device readahead, etc */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c9483d8..8f1952b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -782,7 +782,7 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		inode->i_mapping->backing_dev_info = &cgroup_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &cgroup_backing_dev_info);
 	}
 	return inode;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 65d4204..0188d99 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -671,6 +671,48 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
+void mapping_set_bdi(struct address_space *mapping,
+				struct backing_dev_info *bdi)
+{
+	struct inode *inode = mapping->host;
+	struct backing_dev_info *old = mapping->a_bdi;
+
+	if (unlikely(old == bdi))
+		return;
+
+	spin_lock(&inode_lock);
+	if (!list_empty(&inode->i_list)) {
+		struct inode *i;
+
+		list_for_each_entry(i, &old->wb.b_dirty, i_list) {
+			if (inode == i) {
+				list_del(&inode->i_list);
+				list_add(&inode->i_list, &bdi->wb.b_dirty);
+				goto found;
+			}
+		}
+		list_for_each_entry(i, &old->wb.b_io, i_list) {
+			if (inode == i) {
+				list_del(&inode->i_list);
+				list_add(&inode->i_list, &bdi->wb.b_io);
+				goto found;
+			}
+		}
+		list_for_each_entry(i, &old->wb.b_more_io, i_list) {
+			if (inode == i) {
+				list_del(&inode->i_list);
+				list_add(&inode->i_list, &bdi->wb.b_more_io);
+				goto found;
+			}
+		}
+		BUG();
+	}
+found:
+	mapping->a_bdi = bdi;
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(mapping_set_bdi);
+
 void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
@@ -681,11 +723,24 @@ void bdi_destroy(struct backing_dev_info *bdi)
 	 */
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
+		struct inode *i, *tmp;
 
 		spin_lock(&inode_lock);
-		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
-		list_splice(&bdi->wb.b_io, &dst->b_io);
-		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_list) {
+			list_del(&i->i_list);
+			list_add_tail(&i->i_list, &dst->b_dirty);
+			i->i_mapping->a_bdi = bdi;
+		}
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_io, i_list) {
+			list_del(&i->i_list);
+			list_add_tail(&i->i_list, &dst->b_io);
+			i->i_mapping->a_bdi = bdi;
+		}
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_more_io, i_list) {
+			list_del(&i->i_list);
+			list_add_tail(&i->i_list, &dst->b_more_io);
+			i->i_mapping->a_bdi = bdi;
+		}
 		spin_unlock(&inode_lock);
 	}
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 8d723c9..72e3ac5 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -72,7 +72,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 	else
 		endbyte--;		/* inclusive */
 
-	bdi = mapping->backing_dev_info;
+	bdi = mapping->a_bdi;
 
 	switch (advice) {
 	case POSIX_FADV_NORMAL:
@@ -116,7 +116,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
+		if (!bdi_write_congested(mapping->a_bdi))
 			filemap_flush(mapping);
 
 		/* First and last FULL page! */
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..454d5ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -136,7 +136,7 @@ void __remove_from_page_cache(struct page *page)
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
 		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 	}
 }
 
@@ -2373,7 +2373,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 	written = 0;
 
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 83364df..cdca914 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -409,7 +409,7 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 
 	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
 	if (ret)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e3bccac..e2d50b1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -489,7 +489,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
 	bool dirty_exceeded = false;
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 
 	for (;;) {
 		struct writeback_control wbc = {
@@ -633,7 +633,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	if (mapping->a_bdi->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -964,7 +964,7 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
-			trace_wbc_writepage(wbc, mapping->backing_dev_info);
+			trace_wbc_writepage(wbc, mapping->a_bdi);
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
@@ -1121,7 +1121,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
-		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
@@ -1297,8 +1297,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_bdi_stat(mapping->backing_dev_info,
-					BDI_RECLAIMABLE);
+			dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
@@ -1313,7 +1312,7 @@ int test_clear_page_writeback(struct page *page)
 	int ret;
 
 	if (mapping) {
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
+		struct backing_dev_info *bdi = mapping->a_bdi;
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
@@ -1342,7 +1341,7 @@ int test_set_page_writeback(struct page *page)
 	int ret;
 
 	if (mapping) {
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
+		struct backing_dev_info *bdi = mapping->a_bdi;
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
diff --git a/mm/readahead.c b/mm/readahead.c
index 77506a2..831b927 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -25,7 +25,7 @@
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = mapping->backing_dev_info->ra_pages;
+	ra->ra_pages = mapping->a_bdi->ra_pages;
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -549,7 +549,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_read_congested(mapping->a_bdi))
 		return;
 
 	/* do read-ahead */
@@ -564,7 +564,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	 * explicitly kick off the IO.
 	 */
 	if (PageUptodate(page))
-		blk_run_backing_dev(mapping->backing_dev_info, NULL);
+		blk_run_backing_dev(mapping->a_bdi, NULL);
 #endif
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..fbee46d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1588,7 +1588,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	if (inode) {
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
-		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &shmem_backing_dev_info);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inode->i_generation = get_seconds();
 		info = SHMEM_I(inode);
diff --git a/mm/swap.c b/mm/swap.c
index 3ce7bc3..9352a37 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ void __init swap_setup(void)
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
 
 #ifdef CONFIG_SWAP
-	bdi_init(swapper_space.backing_dev_info);
+	bdi_init(swapper_space.a_bdi);
 #endif
 
 	/* Use a smaller cluster for small-memory machines */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e10f583..6496074 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -45,7 +45,7 @@ struct address_space swapper_space = {
 	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
-	.backing_dev_info = &swap_backing_dev_info,
+	.a_bdi		= &swap_backing_dev_info,
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7c703ff..c14b755 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -116,7 +116,7 @@ void swap_unplug_io_fn(struct backing_dev_info *unused_bdi, struct page *page)
 		 */
 		WARN_ON(page_count(page) <= 1);
 
-		bdi = bdev->bd_inode->i_mapping->backing_dev_info;
+		bdi = bdev->bd_inode->i_mapping->a_bdi;
 		blk_run_backing_dev(bdi, page);
 	}
 	up_read(&swap_unplug_sem);
diff --git a/mm/truncate.c b/mm/truncate.c
index ba887bf..bb79cef 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -75,8 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_bdi_stat(mapping->backing_dev_info,
-					BDI_RECLAIMABLE);
+			dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..8f58773 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -366,7 +366,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->a_bdi))
 		return PAGE_KEEP;
 
 	if (clear_page_dirty_for_io(page)) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 03/18] fs: keep inode with backing-dev
  2010-10-08  5:21 ` [PATCH 03/18] fs: keep inode with backing-dev Dave Chinner
@ 2010-10-08  7:01   ` Christoph Hellwig
  2010-10-08  7:27     ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:17PM +1100, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Having inode on writeback lists of a different bdi than
> inode->i_mapping->backing_dev_info makes it very difficult to do
> per-bdi locking of the writeback lists. Add functions to move these
> inodes over when the mapping backing dev is changed.
> 
> Also, rename i_mapping.backing_dev_info to i_mapping.a_bdi while we're
> here. Succinct is nice, and it catches conversion errors.

NAK.  This is fixed by my patch to always use s_bdi for writeback
purposed that hit Linus' tree two days ago.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 03/18] fs: keep inode with backing-dev
  2010-10-08  7:01   ` Christoph Hellwig
@ 2010-10-08  7:27     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  7:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:01:54AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:17PM +1100, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Having inode on writeback lists of a different bdi than
> > inode->i_mapping->backing_dev_info makes it very difficult to do
> > per-bdi locking of the writeback lists. Add functions to move these
> > inodes over when the mapping backing dev is changed.
> > 
> > Also, rename i_mapping.backing_dev_info to i_mapping.a_bdi while we're
> > here. Succinct is nice, and it catches conversion errors.
> 
> NAK.  This is fixed by my patch to always use s_bdi for writeback
> purposed that hit Linus' tree two days ago.

Great. I'll drop it from the series then.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 04/18] fs: Implement lazy LRU updates for inodes.
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 03/18] fs: keep inode with backing-dev Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:08   ` Christoph Hellwig
  2010-10-08  9:08   ` Al Viro
  2010-10-08  5:21 ` [PATCH 05/18] fs: inode split IO and LRU lists Dave Chinner
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |    9 +------
 fs/inode.c                |   56 +++++++++++++++++++++++++++++----------------
 include/linux/fs.h        |   13 +++++-----
 include/linux/writeback.h |    1 -
 4 files changed, 44 insertions(+), 35 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3209aff..2a61300 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,15 +408,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
-			/*
-			 * The inode is clean, unused
-			 */
+			/* The inode is clean */
 			list_move(&inode->i_list, &inode_unused);
 		}
 	}
diff --git a/fs/inode.c b/fs/inode.c
index 22ef3f1..e76d398 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -72,7 +72,6 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 
@@ -291,6 +290,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
+	INIT_LIST_HEAD(&inode->i_list);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -317,12 +317,7 @@ static void init_once(void *foo)
  */
 void __iget(struct inode *inode)
 {
-	if (atomic_inc_return(&inode->i_count) != 1)
-		return;
-
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-		list_move(&inode->i_list, &inode_in_use);
-	percpu_counter_dec(&nr_inodes_unused);
+	atomic_inc(&inode->i_count);
 }
 
 void end_writeback(struct inode *inode)
@@ -367,7 +362,7 @@ static void dispose_list(struct list_head *head)
 		struct inode *inode;
 
 		inode = list_first_entry(head, struct inode, i_list);
-		list_del(&inode->i_list);
+		list_del_init(&inode->i_list);
 
 		evict(inode);
 
@@ -489,8 +484,15 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		if (atomic_read(&inode->i_count) ||
+		    (inode->i_state & ~I_REFERENCED)) {
+			list_del_init(&inode->i_list);
+			percpu_counter_dec(&nr_inodes_unused);
+			continue;
+		}
+		if (inode->i_state & I_REFERENCED) {
 			list_move(&inode->i_list, &inode_unused);
+			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -502,11 +504,15 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 
-			if (inode != list_entry(inode_unused.next,
-						struct inode, i_list))
-				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
+			/*
+			 * if we can't reclaim this inod immediately, give it
+			 * another pass through the free list so we don't spin
+			 * on it.
+			 */
+			if (!can_unuse(inode)) {
+				list_move(&inode->i_list, &inode_unused);
 				continue;
+			}
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
@@ -621,7 +627,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
@@ -1238,10 +1243,12 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-			list_move(&inode->i_list, &inode_unused);
-		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
+			inode->i_state |= I_REFERENCED;
+			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+				list_move(inode->i_list, &inode_unused);
+				percpu_counter_inc(&nr_inodes_unused);
+			}
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1252,13 +1259,22 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
-	list_del_init(&inode->i_list);
-	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+
+	/*
+	 * We avoid moving dirty inodes back onto the LRU now because I_FREEING
+	 * is set and hence writeback_single_inode() won't move the inode
+	 * around.
+	 */
+	if (!list_empty(&inode->i_list)) {
+		list_del_init(&inode->i_list);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
+
+	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f0b07f..8ff7b6b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1632,16 +1632,17 @@ struct super_operations {
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
+#define I_DIRTY_SYNC		0x01
+#define I_DIRTY_DATASYNC	0x02
+#define I_DIRTY_PAGES		0x04
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_WILL_FREE		0x10
+#define I_FREEING		0x20
+#define I_CLEAR			0x40
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
+#define I_REFERENCED		0x100
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..f956b66 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 04/18] fs: Implement lazy LRU updates for inodes.
  2010-10-08  5:21 ` [PATCH 04/18] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-10-08  7:08   ` Christoph Hellwig
  2010-10-08  7:31     ` Dave Chinner
  2010-10-08  9:08   ` Al Viro
  1 sibling, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

A few nipicks on the comments or lack thereof below:

> @@ -489,8 +484,15 @@ static void prune_icache(int nr_to_scan)
>  
>  		inode = list_entry(inode_unused.prev, struct inode, i_list);
>  
> -		if (inode->i_state || atomic_read(&inode->i_count)) {
> +		if (atomic_read(&inode->i_count) ||
> +		    (inode->i_state & ~I_REFERENCED)) {
> +			list_del_init(&inode->i_list);
> +			percpu_counter_dec(&nr_inodes_unused);
> +			continue;
> +		}
> +		if (inode->i_state & I_REFERENCED) {
>  			list_move(&inode->i_list, &inode_unused);
> +			inode->i_state &= ~I_REFERENCED;
>  			continue;

I think this code could use some comments explaining the lazy LRU
scheme.

> -			if (inode != list_entry(inode_unused.next,
> -						struct inode, i_list))
> -				continue;	/* wrong inode or list_empty */
> -			if (!can_unuse(inode))
> +			/*
> +			 * if we can't reclaim this inod immediately, give it
> +			 * another pass through the free list so we don't spin
> +			 * on it.

s/inod/inode/

> +
> +	/*
> +	 * We avoid moving dirty inodes back onto the LRU now because I_FREEING
> +	 * is set and hence writeback_single_inode() won't move the inode
> +	 * around.
> +	 */
> +	if (!list_empty(&inode->i_list)) {
> +		list_del_init(&inode->i_list);
> +		percpu_counter_dec(&nr_inodes_unused);
> +	}
> +

The comment is a bit misleading.  We do not only avoid moving it to the
LRU, but actively delete the inode from the LRU here.  I don't think the
I_FREEING check isn't the only reason - the LRU code traditionally
couldn't deal with unlinked inodes at all, although the switch to
->evict_inode probably has fixed that.



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 04/18] fs: Implement lazy LRU updates for inodes.
  2010-10-08  7:08   ` Christoph Hellwig
@ 2010-10-08  7:31     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  7:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:08:56AM -0400, Christoph Hellwig wrote:
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> A few nipicks on the comments or lack thereof below:
> 
> > @@ -489,8 +484,15 @@ static void prune_icache(int nr_to_scan)
> >  
> >  		inode = list_entry(inode_unused.prev, struct inode, i_list);
> >  
> > -		if (inode->i_state || atomic_read(&inode->i_count)) {
> > +		if (atomic_read(&inode->i_count) ||
> > +		    (inode->i_state & ~I_REFERENCED)) {
> > +			list_del_init(&inode->i_list);
> > +			percpu_counter_dec(&nr_inodes_unused);
> > +			continue;
> > +		}
> > +		if (inode->i_state & I_REFERENCED) {
> >  			list_move(&inode->i_list, &inode_unused);
> > +			inode->i_state &= ~I_REFERENCED;
> >  			continue;
> 
> I think this code could use some comments explaining the lazy LRU
> scheme.

Ok. I'll add some to it.

> > -			if (inode != list_entry(inode_unused.next,
> > -						struct inode, i_list))
> > -				continue;	/* wrong inode or list_empty */
> > -			if (!can_unuse(inode))
> > +			/*
> > +			 * if we can't reclaim this inod immediately, give it
> > +			 * another pass through the free list so we don't spin
> > +			 * on it.
> 
> s/inod/inode/

Woops, missed that one again.

> > +
> > +	/*
> > +	 * We avoid moving dirty inodes back onto the LRU now because I_FREEING
> > +	 * is set and hence writeback_single_inode() won't move the inode
> > +	 * around.
> > +	 */
> > +	if (!list_empty(&inode->i_list)) {
> > +		list_del_init(&inode->i_list);
> > +		percpu_counter_dec(&nr_inodes_unused);
> > +	}
> > +
> 
> The comment is a bit misleading.  We do not only avoid moving it to the
> LRU, but actively delete the inode from the LRU here.

Right. My intent was that "after the inode is deleted from the LRU,
we avoid moving dirty inodes....". I'll add that clarification to
the comment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 04/18] fs: Implement lazy LRU updates for inodes.
  2010-10-08  5:21 ` [PATCH 04/18] fs: Implement lazy LRU updates for inodes Dave Chinner
  2010-10-08  7:08   ` Christoph Hellwig
@ 2010-10-08  9:08   ` Al Viro
  2010-10-08  9:51     ` Dave Chinner
  1 sibling, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:18PM +1100, Dave Chinner wrote:

>  void __iget(struct inode *inode)
>  {
> -	if (atomic_inc_return(&inode->i_count) != 1)
> -		return;
> -
> -	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
> -		list_move(&inode->i_list, &inode_in_use);
> -	percpu_counter_dec(&nr_inodes_unused);
> +	atomic_inc(&inode->i_count);
>  }

Umm...  Are you sure we don't rely on implict barriers present in the current
version?

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 04/18] fs: Implement lazy LRU updates for inodes.
  2010-10-08  9:08   ` Al Viro
@ 2010-10-08  9:51     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  9:51 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:08:02AM +0100, Al Viro wrote:
> On Fri, Oct 08, 2010 at 04:21:18PM +1100, Dave Chinner wrote:
> 
> >  void __iget(struct inode *inode)
> >  {
> > -	if (atomic_inc_return(&inode->i_count) != 1)
> > -		return;
> > -
> > -	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
> > -		list_move(&inode->i_list, &inode_in_use);
> > -	percpu_counter_dec(&nr_inodes_unused);
> > +	atomic_inc(&inode->i_count);
> >  }
> 
> Umm...  Are you sure we don't rely on implict barriers present in the current
> version?

I'll confess that I have no idea what you are talking about, Al.

Instead, I'll ask if the conversion later one where all accesses and
modifications to the reference count are moved under the
inode->i_lock is sufficient to provide the necessary memory
barriers?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 05/18] fs: inode split IO and LRU lists
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (3 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 04/18] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:14   ` Christoph Hellwig
  2010-10-08  9:16   ` Al Viro
  2010-10-08  5:21 ` [PATCH 06/18] fs: Clean up inode reference counting Dave Chinner
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |   30 +++++++++++++++++-------------
 fs/inode.c                |   36 +++++++++++++++++++++---------------
 fs/nilfs2/mdt.c           |    3 ++-
 include/linux/fs.h        |    3 ++-
 include/linux/writeback.h |    3 +++
 mm/backing-dev.c          |   44 ++++++++++++++++++++++----------------------
 6 files changed, 67 insertions(+), 52 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2a61300..78aaaa8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,11 +172,11 @@ static void redirty_tail(struct inode *inode)
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_io);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &wb->b_dirty);
+	list_move(&inode->i_io, &wb->b_dirty);
 }
 
 /*
@@ -186,7 +186,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
-	list_move(&inode->i_list, &wb->b_more_io);
+	list_move(&inode->i_io, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -227,14 +227,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	int do_sb_sort = 0;
 
 	while (!list_empty(delaying_queue)) {
-		inode = list_entry(delaying_queue->prev, struct inode, i_list);
+		inode = list_entry(delaying_queue->prev, struct inode, i_io);
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_list, &tmp);
+		list_move(&inode->i_io, &tmp);
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
@@ -245,12 +245,12 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		inode = list_entry(tmp.prev, struct inode, i_list);
+		inode = list_entry(tmp.prev, struct inode, i_io);
 		sb = inode->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = list_entry(pos, struct inode, i_list);
+			inode = list_entry(pos, struct inode, i_io);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_list, dispatch_queue);
+				list_move(&inode->i_io, dispatch_queue);
 		}
 	}
 }
@@ -410,7 +410,11 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			redirty_tail(inode);
 		} else {
 			/* The inode is clean */
-			list_move(&inode->i_list, &inode_unused);
+			list_del_init(&inode->i_io);
+			if (list_empty(&inode->i_lru)) {
+				list_add(&inode->i_lru, &inode_unused);
+				percpu_counter_inc(&nr_inodes_unused);
+			}
 		}
 	}
 	inode_sync_complete(inode);
@@ -459,7 +463,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_io);
 
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
@@ -530,7 +534,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_io);
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
@@ -669,7 +673,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_list);
+						struct inode, i_io);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -983,7 +987,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			list_move(&inode->i_io, &bdi->wb.b_dirty);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index e76d398..98f8963 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -102,8 +102,8 @@ static DECLARE_RWSEM(iprune_sem);
  */
 struct inodes_stat_t inodes_stat;
 
-static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
-static struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
+struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
 
 static struct kmem_cache *inode_cachep __read_mostly;
 
@@ -272,6 +272,7 @@ EXPORT_SYMBOL(__destroy_inode);
 
 void destroy_inode(struct inode *inode)
 {
+	BUG_ON(!list_empty(&inode->i_lru));
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
@@ -290,7 +291,8 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_list);
+	INIT_LIST_HEAD(&inode->i_io);
+	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -361,8 +363,8 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_list);
-		list_del_init(&inode->i_list);
+		inode = list_first_entry(head, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
 
 		evict(inode);
 
@@ -405,7 +407,8 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			continue;
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
-			list_move(&inode->i_list, dispose);
+			list_move(&inode->i_lru, dispose);
+			list_del_init(&inode->i_io);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			percpu_counter_dec(&nr_inodes_unused);
@@ -482,16 +485,16 @@ static void prune_icache(int nr_to_scan)
 		if (list_empty(&inode_unused))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(inode_unused.prev, struct inode, i_lru);
 
 		if (atomic_read(&inode->i_count) ||
 		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_lru, &inode_unused);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
@@ -510,11 +513,12 @@ static void prune_icache(int nr_to_scan)
 			 * on it.
 			 */
 			if (!can_unuse(inode)) {
-				list_move(&inode->i_list, &inode_unused);
+				list_move(&inode->i_lru, &inode_unused);
 				continue;
 			}
 		}
-		list_move(&inode->i_list, &freeable);
+		list_move(&inode->i_lru, &freeable);
+		list_del_init(&inode->i_io);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		percpu_counter_dec(&nr_inodes_unused);
@@ -1245,8 +1249,9 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
-			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-				list_move(inode->i_list, &inode_unused);
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
+			    list_empty(&inode->i_lru)) {
+				list_add(&inode->i_lru, &inode_unused);
 				percpu_counter_inc(&nr_inodes_unused);
 			}
 			spin_unlock(&inode_lock);
@@ -1261,6 +1266,7 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
 	}
+	list_del_init(&inode->i_io);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
@@ -1269,8 +1275,8 @@ static void iput_final(struct inode *inode)
 	 * is set and hence writeback_single_inode() won't move the inode
 	 * around.
 	 */
-	if (!list_empty(&inode->i_list)) {
-		list_del_init(&inode->i_list);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 7713861..2ee524f 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -504,7 +504,8 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 #endif
 		inode->dirtied_when = 0;
 
-		INIT_LIST_HEAD(&inode->i_list);
+		INIT_LIST_HEAD(&inode->i_io);
+		INIT_LIST_HEAD(&inode->i_lru);
 		INIT_LIST_HEAD(&inode->i_sb_list);
 		inode->i_state = 0;
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8ff7b6b..11c7ad4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -725,7 +725,8 @@ struct posix_acl;
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;		/* backing dev IO list */
+	struct list_head	i_io;		/* backing dev IO list */
+	struct list_head	i_lru;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index f956b66..f7ed2a0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -12,6 +12,9 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 extern struct list_head inode_unused;
 
+extern struct percpu_counter nr_inodes;
+extern struct percpu_counter nr_inodes_unused;
+
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0188d99..a124991 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_list)
+	list_for_each_entry(inode, &wb->b_io, i_io)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
@@ -681,27 +681,27 @@ void mapping_set_bdi(struct address_space *mapping,
 		return;
 
 	spin_lock(&inode_lock);
-	if (!list_empty(&inode->i_list)) {
+	if (!list_empty(&inode->i_io)) {
 		struct inode *i;
 
-		list_for_each_entry(i, &old->wb.b_dirty, i_list) {
+		list_for_each_entry(i, &old->wb.b_dirty, i_io) {
 			if (inode == i) {
-				list_del(&inode->i_list);
-				list_add(&inode->i_list, &bdi->wb.b_dirty);
+				list_del(&inode->i_io);
+				list_add(&inode->i_io, &bdi->wb.b_dirty);
 				goto found;
 			}
 		}
-		list_for_each_entry(i, &old->wb.b_io, i_list) {
+		list_for_each_entry(i, &old->wb.b_io, i_io) {
 			if (inode == i) {
-				list_del(&inode->i_list);
-				list_add(&inode->i_list, &bdi->wb.b_io);
+				list_del(&inode->i_io);
+				list_add(&inode->i_io, &bdi->wb.b_io);
 				goto found;
 			}
 		}
-		list_for_each_entry(i, &old->wb.b_more_io, i_list) {
+		list_for_each_entry(i, &old->wb.b_more_io, i_io) {
 			if (inode == i) {
-				list_del(&inode->i_list);
-				list_add(&inode->i_list, &bdi->wb.b_more_io);
+				list_del(&inode->i_io);
+				list_add(&inode->i_io, &bdi->wb.b_more_io);
 				goto found;
 			}
 		}
@@ -726,19 +726,19 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct inode *i, *tmp;
 
 		spin_lock(&inode_lock);
-		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_list) {
-			list_del(&i->i_list);
-			list_add_tail(&i->i_list, &dst->b_dirty);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_io) {
+			list_del(&i->i_io);
+			list_add_tail(&i->i_io, &dst->b_dirty);
 			i->i_mapping->a_bdi = bdi;
 		}
-		list_for_each_entry_safe(i, tmp, &bdi->wb.b_io, i_list) {
-			list_del(&i->i_list);
-			list_add_tail(&i->i_list, &dst->b_io);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_io, i_io) {
+			list_del(&i->i_io);
+			list_add_tail(&i->i_io, &dst->b_io);
 			i->i_mapping->a_bdi = bdi;
 		}
-		list_for_each_entry_safe(i, tmp, &bdi->wb.b_more_io, i_list) {
-			list_del(&i->i_list);
-			list_add_tail(&i->i_list, &dst->b_more_io);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_more_io, i_io) {
+			list_del(&i->i_io);
+			list_add_tail(&i->i_io, &dst->b_more_io);
 			i->i_mapping->a_bdi = bdi;
 		}
 		spin_unlock(&inode_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 05/18] fs: inode split IO and LRU lists
  2010-10-08  5:21 ` [PATCH 05/18] fs: inode split IO and LRU lists Dave Chinner
@ 2010-10-08  7:14   ` Christoph Hellwig
  2010-10-08  7:38     ` Dave Chinner
  2010-10-08  9:16   ` Al Viro
  1 sibling, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:19PM +1100, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> The use of the same inode list structure (inode->i_list) for two
> different list constructs with different lifecycles and purposes
> makes it impossible to separate the locking of the different
> operations. Therefore, to enable the separation of the locking of
> the writeback and reclaim lists, split the inode->i_list into two
> separate lists dedicated to their specific tracking functions.

> @@ -410,7 +410,11 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  			redirty_tail(inode);
>  		} else {
>  			/* The inode is clean */
> -			list_move(&inode->i_list, &inode_unused);
> +			list_del_init(&inode->i_io);
> +			if (list_empty(&inode->i_lru)) {
> +				list_add(&inode->i_lru, &inode_unused);
> +				percpu_counter_inc(&nr_inodes_unused);
> +			}

This looks like it belongs into the earlier patch.  Also instead of
making nr_inodes_unused non-static a helper to manipulate it might
be a better idea.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 05/18] fs: inode split IO and LRU lists
  2010-10-08  7:14   ` Christoph Hellwig
@ 2010-10-08  7:38     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  7:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:14:17AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:19PM +1100, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > The use of the same inode list structure (inode->i_list) for two
> > different list constructs with different lifecycles and purposes
> > makes it impossible to separate the locking of the different
> > operations. Therefore, to enable the separation of the locking of
> > the writeback and reclaim lists, split the inode->i_list into two
> > separate lists dedicated to their specific tracking functions.
> 
> > @@ -410,7 +410,11 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> >  			redirty_tail(inode);
> >  		} else {
> >  			/* The inode is clean */
> > -			list_move(&inode->i_list, &inode_unused);
> > +			list_del_init(&inode->i_io);
> > +			if (list_empty(&inode->i_lru)) {
> > +				list_add(&inode->i_lru, &inode_unused);
> > +				percpu_counter_inc(&nr_inodes_unused);
> > +			}
> 
> This looks like it belongs into the earlier patch.

I'm not sure it can be moved to an earlier patch. Until the LRU is
separated, we cannot tell what list the inode is on when we get
here. Yes, it means that the nr_inodes_unused counter is probably
broken for a couple of patches in this series. I'll look at it a bit
more, but I don't think it's a huge deal....

> Also instead of
> making nr_inodes_unused non-static a helper to manipulate it might
> be a better idea.

That happens later in the series as more code gets converted to be
identical.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 05/18] fs: inode split IO and LRU lists
  2010-10-08  5:21 ` [PATCH 05/18] fs: inode split IO and LRU lists Dave Chinner
  2010-10-08  7:14   ` Christoph Hellwig
@ 2010-10-08  9:16   ` Al Viro
  2010-10-08  9:58     ` Dave Chinner
  1 sibling, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> -	struct list_head	i_list;		/* backing dev IO list */
> +	struct list_head	i_io;		/* backing dev IO list */
> +	struct list_head	i_lru;		/* backing dev IO list */

a) that pair of comments would be disqualified in IOCCC ;-)
b) have a pity on folks who will have to talk about the code.  I mean,
how would you say that?  Ai-Ai-Oh?

> +extern struct percpu_counter nr_inodes;
> +extern struct percpu_counter nr_inodes_unused;

Ehh...  At least take that to fs/internal.h.  Preferably don't expose at
all.

> -				list_del(&inode->i_list);
> -				list_add(&inode->i_list, &bdi->wb.b_dirty);
> +				list_del(&inode->i_io);
> +				list_add(&inode->i_io, &bdi->wb.b_dirty);

list_move()?  Ditto for the next few.  And, while that's not directed
at you, this kind of loops is Not Nice(tm)...

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 05/18] fs: inode split IO and LRU lists
  2010-10-08  9:16   ` Al Viro
@ 2010-10-08  9:58     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  9:58 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:16:17AM +0100, Al Viro wrote:
> > -	struct list_head	i_list;		/* backing dev IO list */
> > +	struct list_head	i_io;		/* backing dev IO list */
> > +	struct list_head	i_lru;		/* backing dev IO list */
> 
> a) that pair of comments would be disqualified in IOCCC ;-)

Oops.

> b) have a pity on folks who will have to talk about the code.  I mean,
> how would you say that?  Ai-Ai-Oh?

Fair call. How about i_wb_list?

> > +extern struct percpu_counter nr_inodes;
> > +extern struct percpu_counter nr_inodes_unused;
> 
> Ehh...  At least take that to fs/internal.h.  Preferably don't expose at
> all.

That get's cleaned up later with helpers. As Christoph suggested, I need
to move the helpers forward in the series.

> 
> > -				list_del(&inode->i_list);
> > -				list_add(&inode->i_list, &bdi->wb.b_dirty);
> > +				list_del(&inode->i_io);
> > +				list_add(&inode->i_io, &bdi->wb.b_dirty);
> 
> list_move()?  Ditto for the next few.  And, while that's not directed
> at you, this kind of loops is Not Nice(tm)...

Not a great fan of them myself, but Christoph pointed out that the
inode <-> bdi fix of his that just landed in mainline should remove
the need for these loops.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 06/18] fs: Clean up inode reference counting
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (4 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 05/18] fs: inode split IO and LRU lists Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:20   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 07/18] exofs: use iput() for inode reference count decrements Dave Chinner
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Lots of filesystem code open codes the act of getting a reference to
an inode.  Factor the open coded inode lock, increment, unlock into
a function iref().  Then rename __iget to iref_locked so that nothing
is directly incrementing the inode reference count for trivial
operations.

Originally based on a patch from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/9p/vfs_inode.c           |    5 +++--
 fs/affs/inode.c             |    2 +-
 fs/afs/dir.c                |    2 +-
 fs/anon_inodes.c            |    5 ++---
 fs/bfs/dir.c                |    2 +-
 fs/block_dev.c              |   13 ++++++-------
 fs/btrfs/inode.c            |    2 +-
 fs/coda/dir.c               |    2 +-
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    2 +-
 fs/exofs/namei.c            |    2 +-
 fs/ext2/namei.c             |    2 +-
 fs/ext3/namei.c             |    2 +-
 fs/ext4/namei.c             |    2 +-
 fs/fs-writeback.c           |    6 +++---
 fs/gfs2/ops_inode.c         |    2 +-
 fs/hfsplus/dir.c            |    2 +-
 fs/inode.c                  |   29 +++++++++++++++++++----------
 fs/jffs2/dir.c              |    4 ++--
 fs/jfs/jfs_txnmgr.c         |    2 +-
 fs/jfs/namei.c              |    2 +-
 fs/libfs.c                  |    2 +-
 fs/logfs/dir.c              |    2 +-
 fs/minix/namei.c            |    2 +-
 fs/namei.c                  |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/nfs/getroot.c            |    2 +-
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    2 +-
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 ++--
 fs/ocfs2/namei.c            |    2 +-
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    2 +-
 fs/sysv/namei.c             |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/udf/namei.c              |    2 +-
 fs/ufs/namei.c              |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c |    2 +-
 fs/xfs/xfs_inode.h          |    2 +-
 include/linux/fs.h          |    3 ++-
 ipc/mqueue.c                |    2 +-
 kernel/futex.c              |    2 +-
 mm/shmem.c                  |    2 +-
 net/socket.c                |    2 +-
 45 files changed, 79 insertions(+), 70 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 9e670d5..1f76624 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1789,9 +1789,10 @@ v9fs_vfs_link_dotl(struct dentry *old_dentry, struct inode *dir,
 		kfree(st);
 	} else {
 		/* Caching disabled. No need to get upto date stat info.
-		 * This dentry will be released immediately. So, just i_count++
+		 * This dentry will be released immediately. So, just take
+		 * a reference.
 		 */
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 
 	dentry->d_op = old_dentry->d_op;
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 3a0fdec..2100852 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -388,7 +388,7 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 0d38c09..87d8c03 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -1045,7 +1045,7 @@ static int afs_link(struct dentry *from, struct inode *dir,
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	iref(&vnode->vfs_inode);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index e4b75d6..55a825f 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -111,10 +111,9 @@ struct file *anon_inode_getfile(const char *name,
 	path.mnt = mntget(anon_inode_mnt);
 	/*
 	 * We know the anon_inode inode count is always greater than zero,
-	 * so we can avoid doing an igrab() and we can use an open-coded
-	 * atomic_inc().
+	 * so we can avoid doing an igrab() by using iref().
 	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	iref(anon_inode_inode);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index d967e05..6e93a37 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -176,7 +176,7 @@ static int bfs_link(struct dentry *old, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ac070d7..b7d1534 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -550,7 +550,7 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	atomic_inc(&bdev->bd_inode->i_count);
+	iref(bdev->bd_inode);
 	return bdev;
 }
 
@@ -580,7 +580,7 @@ static struct block_device *bd_acquire(struct inode *inode)
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		bdgrab(bdev);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -591,12 +591,11 @@ static struct block_device *bd_acquire(struct inode *inode)
 		spin_lock(&bdev_lock);
 		if (!inode->i_bdev) {
 			/*
-			 * We take an additional bd_inode->i_count for inode,
-			 * and it's released in clear_inode() of inode.
-			 * So, we can access it via ->i_mapping always
-			 * without igrab().
+			 * We take an additional bdev reference here so
+			 * we can access it via ->i_mapping always
+			 * without first needing to grab a reference.
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			bdgrab(bdev);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c646c0c..0c3a35b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4758,7 +4758,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ccd98b0..ac8b913 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -303,7 +303,7 @@ static int coda_link(struct dentry *source_de, struct inode *dir_inode,
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 2195c21..c4f3e06 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -22,7 +22,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index eb7368e..b631ff3 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1154,7 +1154,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	ios->done = create_done;
 	ios->private = inode;
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index b7dd0c2..f2a30a0 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -153,7 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return exofs_add_nondir(dentry, inode);
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..b15435f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -206,7 +206,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 2b35ddb..6c7a5d6 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2260,7 +2260,7 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 314c0d3..a406a85 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2312,7 +2312,7 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 78aaaa8..1bf8a28 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -297,7 +297,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 
 /*
  * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
+ * caller has ref on the inode (either via iref_locked or via syscall against an fd)
  * or the inode has I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
@@ -496,7 +496,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
-		__iget(inode);
+		iref_locked(inode);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1042,7 +1042,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 1009be2..508407d 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -253,7 +253,7 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 764fd1b..e2ce54d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -301,7 +301,7 @@ static int hfsplus_link(struct dentry *src_dentry, struct inode *dst_dir,
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
diff --git a/fs/inode.c b/fs/inode.c
index 98f8963..aa66e07 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -313,11 +313,20 @@ static void init_once(void *foo)
 
 	inode_init_once(inode);
 }
+EXPORT_SYMBOL_GPL(iref_locked);
+
+void iref(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	iref_locked(inode);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL_GPL(iref);
 
 /*
  * inode_lock must be held
  */
-void __iget(struct inode *inode)
+void iref_locked(struct inode *inode)
 {
 	atomic_inc(&inode->i_count);
 }
@@ -499,7 +508,7 @@ static void prune_icache(int nr_to_scan)
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
+			iref_locked(inode);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -565,7 +574,7 @@ static struct shrinker icache_shrinker = {
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call __iget()
+ * NOTE: we are not increasing the inode-refcount, you must call iref_locked()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
@@ -769,7 +778,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -816,7 +825,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -869,7 +878,7 @@ struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		__iget(inode);
+		iref_locked(inode);
 	else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
@@ -910,7 +919,7 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -943,7 +952,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1126,7 +1135,7 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1165,7 +1174,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index ed78a3c..797a034 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -289,7 +289,7 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 	return ret;
 }
@@ -864,7 +864,7 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index d945ea7..3e6dd08 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,7 @@ int txCommit(tid_t tid,		/* transaction identifier */
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		iref(tblk->u.ip);
 		/*
 		 * Avoid a rare deadlock
 		 *
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index a9cf8e8..3d3566e 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -839,7 +839,7 @@ static int jfs_link(struct dentry *old_dentry,
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	iref(ip);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0a9da95..f190d73 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -255,7 +255,7 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 9777eb5..8522edc 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -569,7 +569,7 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index f3f3578..7563a82 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -101,7 +101,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	return add_nondir(dentry, inode);
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 24896e8..5fb93f3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2291,7 +2291,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 			goto slashes;
 		inode = dentry->d_inode;
 		if (inode)
-			atomic_inc(&inode->i_count);
+			iref(inode);
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e257172..5482ede 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1580,7 +1580,7 @@ nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_add(dentry, inode);
 	}
 	return error;
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index a70e446..5aaa2be 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -55,7 +55,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a8baf4b..75bc1a3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		igrab(inode);
+		iref(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index ad6ed2c..fbd3348 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -219,7 +219,7 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 33297c0..8096a9e 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -244,7 +244,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		struct inode *need_iput_tmp;
 
 		/*
-		 * We cannot __iget() an inode in state I_FREEING,
+		 * We cannot iref() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
@@ -253,7 +253,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * doing an iref/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
@@ -265,7 +265,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			__iget(inode);
+			iref_locked(inode);
 		else
 			need_iput_tmp = NULL;
 
@@ -273,7 +273,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		if ((&next_i->i_sb_list != list) &&
 		    atomic_read(&next_i->i_count) &&
 		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			__iget(next_i);
+			iref_locked(next_i);
 			need_iput = next_i;
 		}
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 5128061..52b48e3 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2929,8 +2929,8 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 		goto unl_upcase_iput_tmp_ino_err_out_now;
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
-		/* We increment i_count simulating an ntfs_iget(). */
-		atomic_inc(&vol->root_ino->i_count);
+		/* Simulate an ntfs_iget() call */
+		iref(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index a00dda2..0e002f6 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -741,7 +741,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index aad1316..5199418 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		if (!dqinit_needed(inode, type))
 			continue;
 
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index ee78d4a..f19bb3d 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1156,7 +1156,7 @@ static int reiserfs_link(struct dentry *old_dentry, struct inode *dir,
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 33e047b..765974f 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -126,7 +126,7 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return add_nondir(dentry, inode);
 }
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index d669260..6a6393b 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -550,7 +550,7 @@ static int ubifs_link(struct dentry *old_dentry, struct inode *dir,
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..f6e232a 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1101,7 +1101,7 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index b056f02..2a598eb 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -180,7 +180,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b1fc2a6..b7ec465 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -352,7 +352,7 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0898c54..cbb4791 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -482,7 +482,7 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 #define IHOLD(ip) \
 do { \
 	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11c7ad4..2e971f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2184,7 +2184,8 @@ extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struc
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
-extern void __iget(struct inode * inode);
+extern void iref(struct inode *inode);
+extern void iref_locked(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c60e519..d53a2c1 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -769,7 +769,7 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 
 	inode = dentry->d_inode;
 	if (inode)
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
diff --git a/kernel/futex.c b/kernel/futex.c
index 6a3a5fa..3bb418c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -168,7 +168,7 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		iref(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
diff --git a/mm/shmem.c b/mm/shmem.c
index fbee46d..4daaa24 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,7 +1903,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	iref(inode);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
diff --git a/net/socket.c b/net/socket.c
index 2270b94..715ca57 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,7 +377,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		atomic_inc(&path.dentry->d_inode->i_count);
+		iref(path.dentry->d_inode);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 06/18] fs: Clean up inode reference counting
  2010-10-08  5:21 ` [PATCH 06/18] fs: Clean up inode reference counting Dave Chinner
@ 2010-10-08  7:20   ` Christoph Hellwig
  2010-10-08  7:46     ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lots of filesystem code open codes the act of getting a reference to
> an inode.  Factor the open coded inode lock, increment, unlock into
> a function iref().  Then rename __iget to iref_locked so that nothing
> is directly incrementing the inode reference count for trivial
> operations.
> 
> Originally based on a patch from Nick Piggin.

> +++ b/fs/anon_inodes.c
> @@ -111,10 +111,9 @@ struct file *anon_inode_getfile(const char *name,
>  	path.mnt = mntget(anon_inode_mnt);
>  	/*
>  	 * We know the anon_inode inode count is always greater than zero,
> -	 * so we can avoid doing an igrab() and we can use an open-coded
> -	 * atomic_inc().
> +	 * so we can avoid doing an igrab() by using iref().

I don't think there's a point keeping this comment.

> @@ -297,7 +297,7 @@ static void inode_wait_for_writeback(struct inode *inode)
>  
>  /*
>   * Write out an inode's dirty pages.  Called under inode_lock.  Either the
> - * caller has ref on the inode (either via __iget or via syscall against an fd)
> + * caller has ref on the inode (either via iref_locked or via syscall against an fd)

I'd say just drop the mentioning of how we got a reference to the inode,
it's just too confusing in this context.

> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -313,11 +313,20 @@ static void init_once(void *foo)
>  
>  	inode_init_once(inode);
>  }
> +EXPORT_SYMBOL_GPL(iref_locked);

I think the export is placed incorrectly here.

> +
> +void iref(struct inode *inode)
> +{
> +	spin_lock(&inode_lock);
> +	iref_locked(inode);
> +	spin_unlock(&inode_lock);
> +}
> +EXPORT_SYMBOL_GPL(iref);


> +void iref_locked(struct inode *inode)
>  {
>  	atomic_inc(&inode->i_count);
>  }

Please add a kerneldoc comment for both exported functions.
Also what's the point of taking inode_lock in iref when the only thing
we do is an atomic_in?  It's probably better only having iref for now
and only introduce iref_locked once the non-atomic increment needs
i_lock.

Also any chance to get an assert under a debug option the the reference
count really is non-zero?



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 06/18] fs: Clean up inode reference counting
  2010-10-08  7:20   ` Christoph Hellwig
@ 2010-10-08  7:46     ` Dave Chinner
  2010-10-08  8:15       ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  7:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:20:51AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:20PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lots of filesystem code open codes the act of getting a reference to
> > an inode.  Factor the open coded inode lock, increment, unlock into
> > a function iref().  Then rename __iget to iref_locked so that nothing
> > is directly incrementing the inode reference count for trivial
> > operations.
> > 
> > Originally based on a patch from Nick Piggin.
> 
> > +++ b/fs/anon_inodes.c
> > @@ -111,10 +111,9 @@ struct file *anon_inode_getfile(const char *name,
> >  	path.mnt = mntget(anon_inode_mnt);
> >  	/*
> >  	 * We know the anon_inode inode count is always greater than zero,
> > -	 * so we can avoid doing an igrab() and we can use an open-coded
> > -	 * atomic_inc().
> > +	 * so we can avoid doing an igrab() by using iref().
> 
> I don't think there's a point keeping this comment.

OK.

> 
> > @@ -297,7 +297,7 @@ static void inode_wait_for_writeback(struct inode *inode)
> >  
> >  /*
> >   * Write out an inode's dirty pages.  Called under inode_lock.  Either the
> > - * caller has ref on the inode (either via __iget or via syscall against an fd)
> > + * caller has ref on the inode (either via iref_locked or via syscall against an fd)
> 
> I'd say just drop the mentioning of how we got a reference to the inode,

OK.

> it's just too confusing in this context.
> 
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -313,11 +313,20 @@ static void init_once(void *foo)
> >  
> >  	inode_init_once(inode);
> >  }
> > +EXPORT_SYMBOL_GPL(iref_locked);
> 
> I think the export is placed incorrectly here.

Fmeh - guilt has an annoying habit of applying patches silently
when there are context mismatches. I've fixed this mismatch about 5
times in the past 2 days, and it keeps creeping back in as I update
patches earlier in the series. I'll fix it up in the next pass.

> > +
> > +void iref(struct inode *inode)
> > +{
> > +	spin_lock(&inode_lock);
> > +	iref_locked(inode);
> > +	spin_unlock(&inode_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(iref);
> 
> 
> > +void iref_locked(struct inode *inode)
> >  {
> >  	atomic_inc(&inode->i_count);
> >  }
> 
> Please add a kerneldoc comment for both exported functions.

OK.

> Also what's the point of taking inode_lock in iref when the only thing
> we do is an atomic_in?  It's probably better only having iref for now
> and only introduce iref_locked once the non-atomic increment needs
> i_lock.

Because in the next couple of patches the atomic-ness goes away, and
the inode lock keeps everything "sane" until all the locking
conversion is completed.

> Also any chance to get an assert under a debug option the the reference
> count really is non-zero?

For iref()? Sure, but I think WARN_ON_ONCE() is better for the moment,
though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 06/18] fs: Clean up inode reference counting
  2010-10-08  7:46     ` Dave Chinner
@ 2010-10-08  8:15       ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 06:46:43PM +1100, Dave Chinner wrote:
> > Also any chance to get an assert under a debug option the the reference
> > count really is non-zero?
> 
> For iref()? Sure, but I think WARN_ON_ONCE() is better for the moment,
> though.

I don't think a WARN_ON_ONCE is too helpful - there could be all kinds
of different filesystems having that issue.  Also I think a plain
WARN_ON is cheaper than a WARN_ON_ONCE and this is a rather hot
codepath.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 07/18] exofs: use iput() for inode reference count decrements
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (5 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 06/18] fs: Clean up inode reference counting Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:21   ` Christoph Hellwig
  2010-10-16  7:56   ` Nick Piggin
  2010-10-08  5:21 ` [PATCH 08/18] fs: add inode reference coutn read accessor Dave Chinner
                   ` (12 subsequent siblings)
  19 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Direct modification of the inode reference count is a no-no. Convert
the exofs decrements to call iput() instead of acting directly on
i_count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/exofs/inode.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b631ff3..0fb4d4c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1101,7 +1101,7 @@ static void create_done(struct exofs_io_state *ios, void *p)
 
 	set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	iput(inode);
 	wake_up(&oi->i_wq);
 }
 
@@ -1161,7 +1161,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	ios->cred = oi->i_cred;
 	ret = exofs_sbi_create(ios);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		iput(inode);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 07/18] exofs: use iput() for inode reference count decrements
  2010-10-08  5:21 ` [PATCH 07/18] exofs: use iput() for inode reference count decrements Dave Chinner
@ 2010-10-08  7:21   ` Christoph Hellwig
  2010-10-16  7:56   ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Direct modification of the inode reference count is a no-no. Convert
> the exofs decrements to call iput() instead of acting directly on
> i_count.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

The real question is why exofs got along with this so long.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 07/18] exofs: use iput() for inode reference count decrements
  2010-10-08  5:21 ` [PATCH 07/18] exofs: use iput() for inode reference count decrements Dave Chinner
  2010-10-08  7:21   ` Christoph Hellwig
@ 2010-10-16  7:56   ` Nick Piggin
  2010-10-16 16:29     ` Christoph Hellwig
  1 sibling, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  7:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Direct modification of the inode reference count is a no-no. Convert
> the exofs decrements to call iput() instead of acting directly on
> i_count.

Could this go to exofs maintainer and get merged as a bugfix?

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/exofs/inode.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
> index b631ff3..0fb4d4c 100644
> --- a/fs/exofs/inode.c
> +++ b/fs/exofs/inode.c
> @@ -1101,7 +1101,7 @@ static void create_done(struct exofs_io_state *ios, void *p)
>  
>  	set_obj_created(oi);
>  
> -	atomic_dec(&inode->i_count);
> +	iput(inode);
>  	wake_up(&oi->i_wq);
>  }
>  

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 07/18] exofs: use iput() for inode reference count decrements
  2010-10-16  7:56   ` Nick Piggin
@ 2010-10-16 16:29     ` Christoph Hellwig
  2010-10-17 15:41       ` Boaz Harrosh
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:56:50PM +1100, Nick Piggin wrote:
> On Fri, Oct 08, 2010 at 04:21:21PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Direct modification of the inode reference count is a no-no. Convert
> > the exofs decrements to call iput() instead of acting directly on
> > i_count.
> 
> Could this go to exofs maintainer and get merged as a bugfix?

I already pinged Boaz.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 07/18] exofs: use iput() for inode reference count decrements
  2010-10-16 16:29     ` Christoph Hellwig
@ 2010-10-17 15:41       ` Boaz Harrosh
  0 siblings, 0 replies; 162+ messages in thread
From: Boaz Harrosh @ 2010-10-17 15:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On 10/16/2010 06:29 PM, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 06:56:50PM +1100, Nick Piggin wrote:
>> On Fri, Oct 08, 2010 at 04:21:21PM +1100, Dave Chinner wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> Direct modification of the inode reference count is a no-no. Convert
>>> the exofs decrements to call iput() instead of acting directly on
>>> i_count.
>>
>> Could this go to exofs maintainer and get merged as a bugfix?
> 
> I already pinged Boaz.
> 

Thanks guys

It'll be in linux-next in a few minutes, for 2.6.37 merge window.

I'm also testing if it could be removed all together. If I find
that it is needed, after all, I'll also CC stable@.

Boaz

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 08/18] fs: add inode reference coutn read accessor
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (6 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 07/18] exofs: use iput() for inode reference count decrements Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:24   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 09/18] fs: rework icount to be a locked variable Dave Chinner
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

To remove most of the remaining direct references to the inode
reference count, add an iref_read() accessor function to read the
current reference count.  New users of this function should be
frowned upon, as there is rarely a good reason for looking at the
current reference count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/staging/pohmelfs/inode.c         |   10 +++++-----
 fs/btrfs/inode.c                         |    6 +++---
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 ++--
 fs/ext4/ialloc.c                         |    4 ++--
 fs/fs-writeback.c                        |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   10 ++++++++++
 fs/locks.c                               |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/nfs/inode.c                           |    4 ++--
 fs/notify/inode_mark.c                   |   11 +++++------
 fs/reiserfs/stree.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    2 +-
 include/linux/fs.h                       |    1 +
 20 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 1a40da9..2e4263c 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1549,7 +1549,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (iref_read(inode) != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 97dae29..d8a308d 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -1289,11 +1289,11 @@ static void pohmelfs_put_super(struct super_block *sb)
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, count: %u.\n",
 				__func__, pi->ino, pi, inode, count);
 
-		if (atomic_read(&inode->i_count) != count) {
+		if (iref_read(inode) != count) {
 			printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_count: %d.\n",
 					__func__, pi->ino, pi, inode, count,
-					atomic_read(&inode->i_count));
-			count = atomic_read(&inode->i_count);
+					iref_read(inode));
+			count = iref_read(inode);
 			in_drop_list++;
 		}
 
@@ -1305,7 +1305,7 @@ static void pohmelfs_put_super(struct super_block *sb)
 		pi = POHMELFS_I(inode);
 
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, i_count: %u.\n",
-				__func__, pi->ino, pi, inode, atomic_read(&inode->i_count));
+				__func__, pi->ino, pi, inode, iref_read(inode));
 
 		/*
 		 * These are special inodes, they were created during
@@ -1313,7 +1313,7 @@ static void pohmelfs_put_super(struct super_block *sb)
 		 * so they live here with reference counter being 1 and prevent
 		 * umount from succeed since it believes that they are busy.
 		 */
-		count = atomic_read(&inode->i_count);
+		count = iref_read(inode);
 		if (count) {
 			list_del_init(&inode->i_sb_list);
 			while (count--)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c3a35b..2953e9f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2718,10 +2718,10 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+	if (S_ISDIR(inode->i_mode) && iref_read(inode) > 1)
 		return ERR_PTR(-ENOSPC);
 
-	if (atomic_read(&inode->i_count) > 2)
+	if (iref_read(inode) > 2)
 		return ERR_PTR(-ENOSPC);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3939,7 +3939,7 @@ again:
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (iref_read(inode) > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index fad95f8..b6d0ef1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1102,7 +1102,7 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, iref_read(inode));
 		return 0;
 	}
 
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 63a0bdb..74cb762 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1641,7 +1641,7 @@ int cifs_revalidate_dentry(struct dentry *dentry)
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count.counter,
+		 "jiffies %ld", full_path, inode, iref_read(inode),
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4ab72db..64669aa 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (iref_read(inode) > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					iref_read(inode));
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..38ac6e5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -189,9 +189,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (iref_read(inode) > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       iref_read(inode));
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1bf8a28..ec7a689 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -315,7 +315,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
+	if (!iref_read(inode))
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 56f0da1..05b5d79 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !iref_read(i)) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
diff --git a/fs/inode.c b/fs/inode.c
index aa66e07..b1dc6dc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -331,6 +331,16 @@ void iref_locked(struct inode *inode)
 	atomic_inc(&inode->i_count);
 }
 
+/*
+ * Nobody outside of core code should really be looking at the inode reference
+ * count. Please don't add new users of this function.
+ */
+int iref_read(struct inode *inode)
+{
+	return atomic_read(&inode->i_count);
+}
+EXPORT_SYMBOL_GPL(iref_read);
+
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
diff --git a/fs/locks.c b/fs/locks.c
index ab24d49..cbf3114 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1376,7 +1376,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 			goto out;
 		if ((arg == F_WRLCK)
 		    && ((atomic_read(&dentry->d_count) > 1)
-			|| (atomic_read(&inode->i_count) > 1)))
+			|| (iref_read(inode) > 1)))
 			goto out;
 	}
 
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 6127baf..8beb842 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -1002,7 +1002,7 @@ static int __logfs_is_valid_block(struct inode *inode, u64 bix, u64 ofs)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+	if ((inode->i_nlink == 0) && iref_read(inode) == 1)
 		return 0;
 
 	if (bix < I0_BLOCKS)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 886be68..387f4dc 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -385,7 +385,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		iref_read(inode));
 
 out:
 	return inode;
@@ -1191,7 +1191,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			iref_read(inode), fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 8096a9e..6c54e02 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -252,12 +252,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			continue;
 
 		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an iref/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
+		 * If the inode is not referenced, the inode cannot have any
+		 * watches and doing an iref/iput with MS_ACTIVE clear would
+		 * actually evict all unreferenced inodes from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		if (!iref_read(inode))
 			continue;
 
 		need_iput_tmp = need_iput;
@@ -270,8 +270,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			need_iput_tmp = NULL;
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-		    atomic_read(&next_i->i_count) &&
+		if ((&next_i->i_sb_list != list) && iref_read(inode) &&
 		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
 			iref_locked(next_i);
 			need_iput = next_i;
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 313d39d..55c3ad3 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1477,7 +1477,7 @@ static int maybe_indirect_to_direct(struct reiserfs_transaction_handle *th,
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (iref_read(inode) > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index 450c919..792593b 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -320,7 +320,7 @@ out:
 }
 
 /*
- * This routine is called when i_nlink == 0 and i_count goes to 0.
+ * This routine is called when i_nlink == 0 and the reference count goes to 0.
  * All blocking cleanup operations need to go here to avoid races.
  */
 static void
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 45888fb..a1b109c 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -342,7 +342,7 @@ static void ubifs_evict_inode(struct inode *inode)
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!iref_read(inode));
 
 	truncate_inode_pages(&inode->i_data, 0);
 
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..c3940ab 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -599,7 +599,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = iref_read(VFS_I(ip));
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index cbb4791..5000660 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -481,7 +481,7 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
+	ASSERT(iref_read(VFS_I(ip)) > 0) ; \
 	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2e971f2..6f0df2a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2186,6 +2186,7 @@ extern void unlock_new_inode(struct inode *);
 
 extern void iref(struct inode *inode);
 extern void iref_locked(struct inode *inode);
+extern int iref_read(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 08/18] fs: add inode reference coutn read accessor
  2010-10-08  5:21 ` [PATCH 08/18] fs: add inode reference coutn read accessor Dave Chinner
@ 2010-10-08  7:24   ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:22PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To remove most of the remaining direct references to the inode
> reference count, add an iref_read() accessor function to read the
> current reference count.  New users of this function should be
> frowned upon, as there is rarely a good reason for looking at the
> current reference count.

IMHO adding this is rather pointless, it's not really buying us
anything.

But at least it looks correct.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (7 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 08/18] fs: add inode reference coutn read accessor Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:27   ` Christoph Hellwig
  2010-10-08  9:32   ` Al Viro
  2010-10-08  5:21 ` [PATCH 10/18] fs: Factor inode hash operations into functions Dave Chinner
                   ` (10 subsequent siblings)
  19 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The inode reference count is currently an atomic variable so that it can be
sampled/modified outside the inode_lock. However, the inode_lock is still
needed to synchronise the final reference count and checks against the inode
state.

To avoid needing the protection of the inode lock, protect the inode reference
count with the per-inode i_lock and convert it to a normal variable. To avoid
existing out-of-tree code accidentally compiling against the new method, rename
the i_count field to i_ref. This is relatively straight forward as there
are limited external references to the i_count field remaining.

Based on work originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/btrfs/inode.c       |    8 ++++-
 fs/inode.c             |   83 ++++++++++++++++++++++++++++++++++++-----------
 fs/nfs/nfs4state.c     |    2 +-
 fs/nilfs2/mdt.c        |    2 +-
 fs/notify/inode_mark.c |   16 ++++++---
 include/linux/fs.h     |    2 +-
 6 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2953e9f..9f04478 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1964,8 +1964,14 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct delayed_iput *delayed;
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	/* XXX: filesystems should not play refcount games like this */
+	spin_lock(&inode->i_lock);
+	if (inode->i_ref > 1) {
+		inode->i_ref--;
+		spin_unlock(&inode->i_lock);
 		return;
+	}
+	spin_unlock(&inode->i_lock);
 
 	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 	delayed->inode = inode;
diff --git a/fs/inode.c b/fs/inode.c
index b1dc6dc..5c8a3ea 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -26,6 +26,13 @@
 #include <linux/posix_acl.h>
 
 /*
+ * Locking rules.
+ *
+ * inode->i_lock protects:
+ *   i_ref
+ */
+
+/*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
@@ -64,9 +71,9 @@ static unsigned int i_hash_shift __read_mostly;
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
- *  "in_use" - valid inode, i_count > 0, i_nlink > 0
+ *  "in_use" - valid inode, i_ref > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
- *  "unused" - valid inode, i_count = 0
+ *  "unused" - valid inode, i_ref = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
@@ -164,7 +171,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_ref = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -313,31 +320,38 @@ static void init_once(void *foo)
 
 	inode_init_once(inode);
 }
+
+/*
+ * inode_lock must be held
+ */
+void iref_locked(struct inode *inode)
+{
+	inode->i_ref++;
+}
 EXPORT_SYMBOL_GPL(iref_locked);
 
 void iref(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	iref_locked(inode);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
 
 /*
- * inode_lock must be held
- */
-void iref_locked(struct inode *inode)
-{
-	atomic_inc(&inode->i_count);
-}
-
-/*
  * Nobody outside of core code should really be looking at the inode reference
  * count. Please don't add new users of this function.
  */
 int iref_read(struct inode *inode)
 {
-	return atomic_read(&inode->i_count);
+	int ref;
+
+	spin_lock(&inode->i_lock);
+	ref = inode->i_ref;
+	spin_unlock(&inode->i_lock);
+	return ref;
 }
 EXPORT_SYMBOL_GPL(iref_read);
 
@@ -425,7 +439,9 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (inode->i_state & I_NEW)
 			continue;
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, dispose);
 			list_del_init(&inode->i_io);
 			WARN_ON(inode->i_state & I_NEW);
@@ -433,6 +449,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	return busy;
@@ -470,7 +487,7 @@ static int can_unuse(struct inode *inode)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (atomic_read(&inode->i_count))
+	if (iref_read(inode))
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -506,19 +523,22 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_lru);
 
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
+			spin_unlock(&inode->i_lock);
 			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
+			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_unused);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			iref_locked(inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -535,7 +555,8 @@ static void prune_icache(int nr_to_scan)
 				list_move(&inode->i_lru, &inode_unused);
 				continue;
 			}
-		}
+		} else
+			spin_unlock(&inode->i_lock);
 		list_move(&inode->i_lru, &freeable);
 		list_del_init(&inode->i_io);
 		WARN_ON(inode->i_state & I_NEW);
@@ -788,7 +809,9 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -835,7 +858,9 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -887,9 +912,11 @@ EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
+	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
-	else
+		spin_unlock(&inode->i_lock);
+	} else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
@@ -929,7 +956,9 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -962,7 +991,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1145,7 +1176,9 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1184,7 +1217,9 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1324,8 +1359,16 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
+		inode->i_ref--;
+		if (inode->i_ref == 0) {
+			spin_unlock(&inode->i_lock);
 			iput_final(inode);
+			return;
+		}
+		spin_unlock(&inode->i_lock);
+		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 3e2f19b..d7fc5d0 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -506,8 +506,8 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
 		state->owner = owner;
 		atomic_inc(&owner->so_count);
 		list_add(&state->inode_states, &nfsi->open_states);
-		state->inode = igrab(inode);
 		spin_unlock(&inode->i_lock);
+		state->inode = igrab(inode);
 		/* Note: The reclaim code dictates that we add stateless
 		 * and read-only stateids to the end of the list */
 		list_add_tail(&state->open_states, &owner->so_states);
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 2ee524f..435ba11 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -480,7 +480,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_ref = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 6c54e02..2fe319b 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -257,7 +257,8 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * actually evict all unreferenced inodes from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!iref_read(inode))
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref)
 			continue;
 
 		need_iput_tmp = need_iput;
@@ -268,12 +269,17 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			iref_locked(inode);
 		else
 			need_iput_tmp = NULL;
+		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) && iref_read(inode) &&
-		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			iref_locked(next_i);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (inode->i_ref &&
+			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
+				iref_locked(next_i);
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f0df2a..1162c10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -730,7 +730,7 @@ struct inode {
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_ref;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  5:21 ` [PATCH 09/18] fs: rework icount to be a locked variable Dave Chinner
@ 2010-10-08  7:27   ` Christoph Hellwig
  2010-10-08  7:50     ` Dave Chinner
  2010-10-08  9:32   ` Al Viro
  1 sibling, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, chris.mason, linux-btrfs

> index 2953e9f..9f04478 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1964,8 +1964,14 @@ void btrfs_add_delayed_iput(struct inode *inode)
>  	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
>  	struct delayed_iput *delayed;
>  
> -	if (atomic_add_unless(&inode->i_count, -1, 1))
> +	/* XXX: filesystems should not play refcount games like this */
> +	spin_lock(&inode->i_lock);
> +	if (inode->i_ref > 1) {
> +		inode->i_ref--;
> +		spin_unlock(&inode->i_lock);
>  		return;
> +	}
> +	spin_unlock(&inode->i_lock);

Yeah, all that i_count/i_ref mess in btrfs needs some serious work.
Chris?

> +
> +/*
> + * inode_lock must be held
> + */
> +void iref_locked(struct inode *inode)
> +{
> +	inode->i_ref++;
> +}
>  EXPORT_SYMBOL_GPL(iref_locked);

I'm a big fan of _GPL exports, but adding this for a trivial counter
increment seems a bit weird. 

>  int iref_read(struct inode *inode)
>  {
> -	return atomic_read(&inode->i_count);
> +	int ref;
> +
> +	spin_lock(&inode->i_lock);
> +	ref = inode->i_ref;
> +	spin_unlock(&inode->i_lock);
> +	return ref;
>  }

There's no need to lock a normal 32-bit variable for readers.

> +		inode->i_ref--;
> +		if (inode->i_ref == 0) {

		if (--inode->i_ref == 0) {

might be a bit more idiomatic.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  7:27   ` Christoph Hellwig
@ 2010-10-08  7:50     ` Dave Chinner
  2010-10-08  8:17       ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  7:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel, chris.mason, linux-btrfs

On Fri, Oct 08, 2010 at 03:27:49AM -0400, Christoph Hellwig wrote:
> > index 2953e9f..9f04478 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1964,8 +1964,14 @@ void btrfs_add_delayed_iput(struct inode *inode)
> >  	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> >  	struct delayed_iput *delayed;
> >  
> > -	if (atomic_add_unless(&inode->i_count, -1, 1))
> > +	/* XXX: filesystems should not play refcount games like this */
> > +	spin_lock(&inode->i_lock);
> > +	if (inode->i_ref > 1) {
> > +		inode->i_ref--;
> > +		spin_unlock(&inode->i_lock);
> >  		return;
> > +	}
> > +	spin_unlock(&inode->i_lock);
> 
> Yeah, all that i_count/i_ref mess in btrfs needs some serious work.
> Chris?
> 
> > +
> > +/*
> > + * inode_lock must be held
> > + */
> > +void iref_locked(struct inode *inode)
> > +{
> > +	inode->i_ref++;
> > +}
> >  EXPORT_SYMBOL_GPL(iref_locked);
> 
> I'm a big fan of _GPL exports, but adding this for a trivial counter
> increment seems a bit weird. 

OK, will drop the _GPL.
> 
> >  int iref_read(struct inode *inode)
> >  {
> > -	return atomic_read(&inode->i_count);
> > +	int ref;
> > +
> > +	spin_lock(&inode->i_lock);
> > +	ref = inode->i_ref;
> > +	spin_unlock(&inode->i_lock);
> > +	return ref;
> >  }
> 
> There's no need to lock a normal 32-bit variable for readers.

Ok, but will need a memory barrier instead?

> 
> > +		inode->i_ref--;
> > +		if (inode->i_ref == 0) {
> 
> 		if (--inode->i_ref == 0) {
> 
> might be a bit more idiomatic.

OK.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  7:50     ` Dave Chinner
@ 2010-10-08  8:17       ` Christoph Hellwig
  2010-10-08 13:16         ` Chris Mason
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, chris.mason,
	linux-btrfs

On Fri, Oct 08, 2010 at 06:50:01PM +1100, Dave Chinner wrote:
> > There's no need to lock a normal 32-bit variable for readers.
> 
> Ok, but will need a memory barrier instead?

Isn't spin_unlock supposed to be one?  I'll need some of the locking
experts to shime in.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  8:17       ` Christoph Hellwig
@ 2010-10-08 13:16         ` Chris Mason
  0 siblings, 0 replies; 162+ messages in thread
From: Chris Mason @ 2010-10-08 13:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel, linux-btrfs

On Fri, Oct 08, 2010 at 10:17:14AM +0200, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 06:50:01PM +1100, Dave Chinner wrote:
> > > There's no need to lock a normal 32-bit variable for readers.
> > 
> > Ok, but will need a memory barrier instead?
> 
> Isn't spin_unlock supposed to be one?  I'll need some of the locking
> experts to shime in.

Not really a locking expert, but the locking operations are supposed to
have an implicit barrier.

-chris


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  5:21 ` [PATCH 09/18] fs: rework icount to be a locked variable Dave Chinner
  2010-10-08  7:27   ` Christoph Hellwig
@ 2010-10-08  9:32   ` Al Viro
  2010-10-08 10:15     ` Dave Chinner
  1 sibling, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode reference count is currently an atomic variable so that it can be
> sampled/modified outside the inode_lock. However, the inode_lock is still
> needed to synchronise the final reference count and checks against the inode
> state.
> 
> To avoid needing the protection of the inode lock, protect the inode reference
> count with the per-inode i_lock and convert it to a normal variable. To avoid
> existing out-of-tree code accidentally compiling against the new method, rename
> the i_count field to i_ref. This is relatively straight forward as there
> are limited external references to the i_count field remaining.

You are overdoing the information hiding here; _way_ too many small
functions that don't buy you anything so far, AFAICS.  Moreover, why
the hell not make them static inlines and get rid of the exports?

> -	if (atomic_add_unless(&inode->i_count, -1, 1))
> +	/* XXX: filesystems should not play refcount games like this */
> +	spin_lock(&inode->i_lock);
> +	if (inode->i_ref > 1) {
> +		inode->i_ref--;
> +		spin_unlock(&inode->i_lock);
>  		return;
> +	}
> +	spin_unlock(&inode->i_lock);

... or, perhaps, they needs a helper along the lines of "try to do iput()
if it's known to hit easy case".

I really don't like the look of code around -ENOSPC returns, though.
What exactly is going on there?  Can it e.g. interfere with that
delayed iput stuff?

>  void iref(struct inode *inode)
>  {
>  	spin_lock(&inode_lock);
> +	spin_lock(&inode->i_lock);
>  	iref_locked(inode);
> +	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_lock);
>  }

*cringe*

>  int iref_read(struct inode *inode)
>  {
> -	return atomic_read(&inode->i_count);
> +	int ref;
> +
> +	spin_lock(&inode->i_lock);
> +	ref = inode->i_ref;
> +	spin_unlock(&inode->i_lock);
> +	return ref;

What's the point of locking here?

> @@ -1324,8 +1359,16 @@ void iput(struct inode *inode)
>  	if (inode) {
>  		BUG_ON(inode->i_state & I_CLEAR);
>  
> -		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
> +		spin_lock(&inode_lock);
> +		spin_lock(&inode->i_lock);
> +		inode->i_ref--;
> +		if (inode->i_ref == 0) {
> +			spin_unlock(&inode->i_lock);
>  			iput_final(inode);
> +			return;
> +		}

*UGH*  So you take inode_lock on every damn iput()?
>  		state->owner = owner;
>  		atomic_inc(&owner->so_count);
>  		list_add(&state->inode_states, &nfsi->open_states);
> -		state->inode = igrab(inode);
>  		spin_unlock(&inode->i_lock);
> +		state->inode = igrab(inode);

Why is that safe?

> --- a/fs/notify/inode_mark.c
> +++ b/fs/notify/inode_mark.c
> @@ -257,7 +257,8 @@ void fsnotify_unmount_inodes(struct list_head *list)
>  		 * actually evict all unreferenced inodes from icache which is
>  		 * unnecessarily violent and may in fact be illegal to do.
>  		 */
> -		if (!iref_read(inode))
> +		spin_lock(&inode->i_lock);
> +		if (!inode->i_ref)
>  			continue;

Really?

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08  9:32   ` Al Viro
@ 2010-10-08 10:15     ` Dave Chinner
  2010-10-08 13:14       ` Chris Mason
  2010-10-08 13:53       ` Christoph Hellwig
  0 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08 10:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:32:02AM +0100, Al Viro wrote:
> On Fri, Oct 08, 2010 at 04:21:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The inode reference count is currently an atomic variable so that it can be
> > sampled/modified outside the inode_lock. However, the inode_lock is still
> > needed to synchronise the final reference count and checks against the inode
> > state.
> > 
> > To avoid needing the protection of the inode lock, protect the inode reference
> > count with the per-inode i_lock and convert it to a normal variable. To avoid
> > existing out-of-tree code accidentally compiling against the new method, rename
> > the i_count field to i_ref. This is relatively straight forward as there
> > are limited external references to the i_count field remaining.
> 
> You are overdoing the information hiding here; _way_ too many small
> functions that don't buy you anything so far, AFAICS.

See akpm's comments on the previous version of the series.

> Moreover, why
> the hell not make them static inlines and get rid of the exports?

Yes, that is probably sensible.

> 
> > -	if (atomic_add_unless(&inode->i_count, -1, 1))
> > +	/* XXX: filesystems should not play refcount games like this */
> > +	spin_lock(&inode->i_lock);
> > +	if (inode->i_ref > 1) {
> > +		inode->i_ref--;
> > +		spin_unlock(&inode->i_lock);
> >  		return;
> > +	}
> > +	spin_unlock(&inode->i_lock);
> 
> ... or, perhaps, they needs a helper along the lines of "try to do iput()
> if it's known to hit easy case".
> 
> I really don't like the look of code around -ENOSPC returns, though.
> What exactly is going on there?  Can it e.g. interfere with that
> delayed iput stuff?

I have no idea what the btrfs code is doing, hence I haven't tried
to clean it up or provide any helpers for it. It looks like a hack
around a problem in the btrfs reference counting model to me...

> 
> >  void iref(struct inode *inode)
> >  {
> >  	spin_lock(&inode_lock);
> > +	spin_lock(&inode->i_lock);
> >  	iref_locked(inode);
> > +	spin_unlock(&inode->i_lock);
> >  	spin_unlock(&inode_lock);
> >  }
> 
> *cringe*
> 
> >  int iref_read(struct inode *inode)
> >  {
> > -	return atomic_read(&inode->i_count);
> > +	int ref;
> > +
> > +	spin_lock(&inode->i_lock);
> > +	ref = inode->i_ref;
> > +	spin_unlock(&inode->i_lock);
> > +	return ref;
> 
> What's the point of locking here?

It can be replaced with a memory barrier, right?

> > @@ -1324,8 +1359,16 @@ void iput(struct inode *inode)
> >  	if (inode) {
> >  		BUG_ON(inode->i_state & I_CLEAR);
> >  
> > -		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
> > +		spin_lock(&inode_lock);
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_ref--;
> > +		if (inode->i_ref == 0) {
> > +			spin_unlock(&inode->i_lock);
> >  			iput_final(inode);
> > +			return;
> > +		}
> 
> *UGH*  So you take inode_lock on every damn iput()?

Only until the inode_lock is removed completely.

> >  		state->owner = owner;
> >  		atomic_inc(&owner->so_count);
> >  		list_add(&state->inode_states, &nfsi->open_states);
> > -		state->inode = igrab(inode);
> >  		spin_unlock(&inode->i_lock);
> > +		state->inode = igrab(inode);
> 
> Why is that safe?

Why wouldn't it be?  This is code inherited from Nick's patches, so
I haven't looked this particular hunk in great detail. I've made the
assumption that if the inode passed in doesn't already have a
reference, then that code is already broken.

Instead, it probably should be converted to a iref_locked() call
instead of igrab().

> 
> > --- a/fs/notify/inode_mark.c
> > +++ b/fs/notify/inode_mark.c
> > @@ -257,7 +257,8 @@ void fsnotify_unmount_inodes(struct list_head *list)
> >  		 * actually evict all unreferenced inodes from icache which is
> >  		 * unnecessarily violent and may in fact be illegal to do.
> >  		 */
> > -		if (!iref_read(inode))
> > +		spin_lock(&inode->i_lock);
> > +		if (!inode->i_ref)
> >  			continue;
> 
> Really?

Good catch. It looks like a change split across 2 patches - it is correct when
all patches are applied. Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08 10:15     ` Dave Chinner
@ 2010-10-08 13:14       ` Chris Mason
  2010-10-08 13:53       ` Christoph Hellwig
  1 sibling, 0 replies; 162+ messages in thread
From: Chris Mason @ 2010-10-08 13:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 09:15:49PM +1100, Dave Chinner wrote:
> On Fri, Oct 08, 2010 at 10:32:02AM +0100, Al Viro wrote:
> > On Fri, Oct 08, 2010 at 04:21:23PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The inode reference count is currently an atomic variable so that it can be
> > > sampled/modified outside the inode_lock. However, the inode_lock is still
> > > needed to synchronise the final reference count and checks against the inode
> > > state.
> > > 
> > > To avoid needing the protection of the inode lock, protect the inode reference
> > > count with the per-inode i_lock and convert it to a normal variable. To avoid
> > > existing out-of-tree code accidentally compiling against the new method, rename
> > > the i_count field to i_ref. This is relatively straight forward as there
> > > are limited external references to the i_count field remaining.
> > 
> > You are overdoing the information hiding here; _way_ too many small
> > functions that don't buy you anything so far, AFAICS.
> 
> See akpm's comments on the previous version of the series.
> 
> > Moreover, why
> > the hell not make them static inlines and get rid of the exports?
> 
> Yes, that is probably sensible.
> 
> > 
> > > -	if (atomic_add_unless(&inode->i_count, -1, 1))
> > > +	/* XXX: filesystems should not play refcount games like this */
> > > +	spin_lock(&inode->i_lock);
> > > +	if (inode->i_ref > 1) {
> > > +		inode->i_ref--;
> > > +		spin_unlock(&inode->i_lock);
> > >  		return;
> > > +	}
> > > +	spin_unlock(&inode->i_lock);
> > 
> > ... or, perhaps, they needs a helper along the lines of "try to do iput()
> > if it's known to hit easy case".
> > 
> > I really don't like the look of code around -ENOSPC returns, though.
> > What exactly is going on there?  Can it e.g. interfere with that
> > delayed iput stuff?
> 
> I have no idea what the btrfs code is doing, hence I haven't tried
> to clean it up or provide any helpers for it. It looks like a hack
> around a problem in the btrfs reference counting model to me...

The problem is that we're not allowed to do the final iput for one
specific caller because it can deadlock on inode deletion.  That one
specific caller doesn't happen very often.

For the deadlock avoidance case, we do the fast atomic_dec as long as we
aren't the last holder and the slow iput-by-a-thread-at-a-safe-time if
we are.  Lots of different filesystem code dances around avoiding iput
inode deletion, should we make this a more generic setup?

-chris

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08 10:15     ` Dave Chinner
  2010-10-08 13:14       ` Chris Mason
@ 2010-10-08 13:53       ` Christoph Hellwig
  2010-10-08 14:09         ` Dave Chinner
  1 sibling, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08 13:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 09:15:49PM +1100, Dave Chinner wrote:
> > You are overdoing the information hiding here; _way_ too many small
> > functions that don't buy you anything so far, AFAICS.
> 
> See akpm's comments on the previous version of the series.

It's one persons opinion.  I tend to disagree with lots of it.  iref
is a good new helper for filesystems to use, but for the unlocked
read it's reather pointless.  iref_locked is even more pointless -
it's only used in core fs code (fs/inode.c, fs/fs-writeback.c,
fs/drop_caches.c, fs/notify/inode_mark.c and fs/quota/dquot.c) and
an opencoded increment would be a lot more readable.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 09/18] fs: rework icount to be a locked variable
  2010-10-08 13:53       ` Christoph Hellwig
@ 2010-10-08 14:09         ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08 14:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 09:53:04AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 09:15:49PM +1100, Dave Chinner wrote:
> > > You are overdoing the information hiding here; _way_ too many small
> > > functions that don't buy you anything so far, AFAICS.
> > 
> > See akpm's comments on the previous version of the series.
> 
> It's one persons opinion.  I tend to disagree with lots of it.  iref
> is a good new helper for filesystems to use, but for the unlocked
> read it's reather pointless.  iref_locked is even more pointless -
> it's only used in core fs code (fs/inode.c, fs/fs-writeback.c,
> fs/drop_caches.c, fs/notify/inode_mark.c and fs/quota/dquot.c) and
> an opencoded increment would be a lot more readable.

I don't care one way or the other and certainly not enough to argue
about it. I'll change the code to match whatever is decreed as the
Right Way To Do Stuff.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 10/18] fs: Factor inode hash operations into functions
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (8 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 09/18] fs: rework icount to be a locked variable Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:29   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |  100 +++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 56 insertions(+), 44 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5c8a3ea..32da15e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -355,6 +355,59 @@ int iref_read(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(iref_read);
 
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+	unsigned long tmp;
+
+	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+			L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
+	return tmp & I_HASHMASK;
+}
+
+/**
+ *	__insert_inode_hash - hash an inode
+ *	@inode: unhashed inode
+ *	@hashval: unsigned long value used to locate this object in the
+ *		inode_hashtable.
+ *
+ *	Add an inode to the inode hash for this superblock.
+ */
+void __insert_inode_hash(struct inode *inode, unsigned long hashval)
+{
+	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	spin_lock(&inode_lock);
+	hlist_add_head(&inode->i_hash, head);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(__insert_inode_hash);
+
+/**
+ *	__remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock. inode->i_lock must be
+ *	held.
+ */
+static void __remove_inode_hash(struct inode *inode)
+{
+	hlist_del_init(&inode->i_hash);
+}
+
+/**
+ *	remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock.
+ */
+void remove_inode_hash(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(remove_inode_hash);
+
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
@@ -402,7 +455,7 @@ static void dispose_list(struct list_head *head)
 		evict(inode);
 
 		spin_lock(&inode_lock);
-		hlist_del_init(&inode->i_hash);
+		__remove_inode_hash(inode);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode_lock);
 
@@ -657,16 +710,6 @@ repeat:
 	return node ? inode : NULL;
 }
 
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-	unsigned long tmp;
-
-	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-			L1_CACHE_BYTES;
-	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
-	return tmp & I_HASHMASK;
-}
-
 static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
@@ -1231,36 +1274,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 }
 EXPORT_SYMBOL(insert_inode_locked4);
 
-/**
- *	__insert_inode_hash - hash an inode
- *	@inode: unhashed inode
- *	@hashval: unsigned long value used to locate this object in the
- *		inode_hashtable.
- *
- *	Add an inode to the inode hash for this superblock.
- */
-void __insert_inode_hash(struct inode *inode, unsigned long hashval)
-{
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL(__insert_inode_hash);
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL(remove_inode_hash);
 
 int generic_delete_inode(struct inode *inode)
 {
@@ -1319,6 +1332,7 @@ static void iput_final(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
+		__remove_inode_hash(inode);
 	}
 	list_del_init(&inode->i_io);
 	WARN_ON(inode->i_state & I_NEW);
@@ -1337,9 +1351,7 @@ static void iput_final(struct inode *inode)
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
-	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_lock);
+	remove_inode_hash(inode);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 10/18] fs: Factor inode hash operations into functions
  2010-10-08  5:21 ` [PATCH 10/18] fs: Factor inode hash operations into functions Dave Chinner
@ 2010-10-08  7:29   ` Christoph Hellwig
  2010-10-08  9:41     ` Al Viro
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:24PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Before replacing the inode hash locking with a more scalable
> mechanism, factor the removal of the inode from the hashes rather
> than open coding it in several places.
> 
> Based on a patch originally from Nick Piggin.

Looks good as an equal transformation, but what code is doing with
remove_inode_hash looks really buggy.  It's doing a re-hash of a live
inode which is probably causing enough problems by itself, but should
at least have locks for it.  Anyway, that's something for the coda folks
to sort out.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 10/18] fs: Factor inode hash operations into functions
  2010-10-08  7:29   ` Christoph Hellwig
@ 2010-10-08  9:41     ` Al Viro
  0 siblings, 0 replies; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:29:47AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:24PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Before replacing the inode hash locking with a more scalable
> > mechanism, factor the removal of the inode from the hashes rather
> > than open coding it in several places.
> > 
> > Based on a patch originally from Nick Piggin.
> 
> Looks good as an equal transformation, but what code is doing with
> remove_inode_hash looks really buggy.  It's doing a re-hash of a live
> inode which is probably causing enough problems by itself, but should
> at least have locks for it.  Anyway, that's something for the coda folks
> to sort out.

Known problem; nobody got around to fixing it.  But if that's going where
I think it's going, the problem has just got nastier...  We do need locking
around there anyway; current use of BKL to protect lists in psdev stuff
is not good.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (9 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 10/18] fs: Factor inode hash operations into functions Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:33   ` Christoph Hellwig
                     ` (3 more replies)
  2010-10-08  5:21 ` [PATCH 12/18] fs: add a per-superblock lock for the inode list Dave Chinner
                   ` (8 subsequent siblings)
  19 siblings, 4 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Protect the inod hash with a single lock is not scalable.  Convert
the inode hash to use the new bit-locked hash list implementation
that allows per-bucket locks to be used. This allows us to replace
the global inode_lock with finer grained locking without increasing
the size of the hash table.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/btrfs/inode.c        |    2 +-
 fs/fs-writeback.c       |    2 +-
 fs/hfs/hfs_fs.h         |    2 +-
 fs/hfs/inode.c          |    2 +-
 fs/hfsplus/hfsplus_fs.h |    2 +-
 fs/hfsplus/inode.c      |    2 +-
 fs/inode.c              |  165 ++++++++++++++++++++++++++++++----------------
 fs/nilfs2/gcinode.c     |   22 ++++---
 fs/nilfs2/segment.c     |    2 +-
 fs/nilfs2/the_nilfs.h   |    2 +-
 fs/reiserfs/xattr.c     |    2 +-
 include/linux/fs.h      |    3 +-
 mm/shmem.c              |    4 +-
 13 files changed, 132 insertions(+), 80 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9f04478..f908a12 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3855,7 +3855,7 @@ again:
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_unhashed(&inode->i_hash))
+	if (hlist_bl_unhashed(&inode->i_hash))
 		return;
 
 	spin_lock(&root->inode_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ec7a689..d63ab47 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -959,7 +959,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
+			if (hlist_bl_unhashed(&inode->i_hash))
 				goto out;
 		}
 		if (inode->i_state & I_FREEING)
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 4f55651..24591be 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {
 
 	int fs_div;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFS_FLG_BITMAP_DIRTY	0
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 397b7ad..7778298 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -524,7 +524,7 @@ static struct dentry *hfs_file_lookup(struct inode *dir, struct dentry *dentry,
 	HFS_I(inode)->rsrc_inode = dir;
 	HFS_I(dir)->rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index dc856be..499f5a5 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -144,7 +144,7 @@ struct hfsplus_sb_info {
 
 	unsigned long flags;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFSPLUS_SB_WRITEBACKUP	0x0001
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index c5a979d..b755cf0 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -202,7 +202,7 @@ static struct dentry *hfsplus_file_lookup(struct inode *dir, struct dentry *dent
 	HFSPLUS_I(inode).rsrc_inode = dir;
 	HFSPLUS_I(dir).rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/inode.c b/fs/inode.c
index 32da15e..3c07719 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -24,12 +24,20 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
+#include <linux/bit_spinlock.h>
 
 /*
  * Locking rules.
  *
  * inode->i_lock protects:
  *   i_ref
+ * inode_hash_bucket lock protects:
+ *   inode hash table, i_hash
+ *
+ * Lock orders
+ * inode_lock
+ *   inode hash bucket lock
+ *     inode->i_lock
  */
 
 /*
@@ -80,7 +88,22 @@ static unsigned int i_hash_shift __read_mostly;
  */
 
 LIST_HEAD(inode_unused);
-static struct hlist_head *inode_hashtable __read_mostly;
+
+struct inode_hash_bucket {
+	struct hlist_bl_head head;
+};
+
+static inline void spin_lock_bucket(struct inode_hash_bucket *b)
+{
+	bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
+{
+	__bit_spin_unlock(0, (unsigned long *)b);
+}
+
+static struct inode_hash_bucket *inode_hashtable __read_mostly;
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -295,7 +318,7 @@ void destroy_inode(struct inode *inode)
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	init_hlist_bl_node(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io);
@@ -375,9 +398,13 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct inode_hash_bucket *b;
+
+	b = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
+	spin_lock_bucket(b);
+	hlist_bl_add_head(&inode->i_hash, &b->head);
+	spin_unlock_bucket(b);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -391,7 +418,12 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 static void __remove_inode_hash(struct inode *inode)
 {
-	hlist_del_init(&inode->i_hash);
+	struct inode_hash_bucket *b;
+
+	b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+	spin_lock_bucket(b);
+	hlist_bl_del_init(&inode->i_hash);
+	spin_unlock_bucket(b);
 }
 
 /**
@@ -403,7 +435,7 @@ static void __remove_inode_hash(struct inode *inode)
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
+	__remove_inode_hash(inode);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -663,25 +695,28 @@ static void __wait_on_freeing_inode(struct inode *inode);
  * add any additional branch in the common code.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct inode_hash_bucket *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock_bucket(b);
 	return node ? inode : NULL;
 }
 
@@ -690,33 +725,40 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct inode_hash_bucket *b,
+				unsigned long ino)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock_bucket(b);
 	return node ? inode : NULL;
 }
 
 static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
+__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
 	list_add(&inode->i_sb_list, &sb->s_inodes);
-	if (head)
-		hlist_add_head(&inode->i_hash, head);
+	if (b) {
+		spin_lock_bucket(b);
+		hlist_bl_add_head(&inode->i_hash, &b->head);
+		spin_unlock_bucket(b);
+	}
 }
 
 /**
@@ -733,10 +775,10 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
  */
 void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&inode_lock);
-	__inode_add_to_lists(sb, head, inode);
+	__inode_add_to_lists(sb, b, inode);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -819,7 +861,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct inode_hash_bucket *b,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -832,12 +874,12 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, b, test, data);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -873,7 +915,7 @@ set_failed:
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct inode_hash_bucket *b, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -883,10 +925,10 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 
 		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -935,7 +977,7 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 */
 	static unsigned int counter;
 	struct inode *inode;
-	struct hlist_head *head;
+	struct inode_hash_bucket *b;
 	ino_t res;
 
 	spin_lock(&inode_lock);
@@ -943,8 +985,8 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		head = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, head, res);
+		b = inode_hashtable + hash(sb, res);
+		inode = find_inode_fast(sb, b, res);
 	} while (inode != NULL);
 	spin_unlock(&inode_lock);
 
@@ -991,13 +1033,14 @@ EXPORT_SYMBOL(igrab);
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct hlist_head *head, int (*test)(struct inode *, void *),
+		struct inode_hash_bucket *b,
+		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
-	inode = find_inode(sb, head, test, data);
+	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		spin_lock(&inode->i_lock);
 		iref_locked(inode);
@@ -1027,12 +1070,13 @@ static struct inode *ifind(struct super_block *sb,
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct hlist_head *head, unsigned long ino)
+		struct inode_hash_bucket *b,
+		unsigned long ino)
 {
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
-	inode = find_inode_fast(sb, head, ino);
+	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		spin_lock(&inode->i_lock);
 		iref_locked(inode);
@@ -1069,9 +1113,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 0);
+	return ifind(sb, b, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1097,9 +1141,9 @@ EXPORT_SYMBOL(ilookup5_nowait);
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 1);
+	return ifind(sb, b, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 
@@ -1119,9 +1163,9 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
-	return ifind_fast(sb, head, ino);
+	return ifind_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(ilookup);
 
@@ -1149,17 +1193,17 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	inode = ifind(sb, head, test, data, 1);
+	inode = ifind(sb, b, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode(sb, head, test, set, data);
+	return get_new_inode(sb, b, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 
@@ -1180,17 +1224,17 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	inode = ifind_fast(sb, head, ino);
+	inode = ifind_fast(sb, b, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode_fast(sb, head, ino);
+	return get_new_inode_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 
@@ -1198,14 +1242,15 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
 	inode->i_state |= I_NEW;
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 		spin_lock(&inode_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		spin_lock_bucket(b);
+		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1215,16 +1260,18 @@ int insert_inode_locked(struct inode *inode)
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, &b->head);
+			spin_unlock_bucket(b);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
+		spin_unlock_bucket(b);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1237,16 +1284,17 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
 	inode->i_state |= I_NEW;
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		spin_lock_bucket(b);
+		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1256,16 +1304,18 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, &b->head);
+			spin_unlock_bucket(b);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
+		spin_unlock_bucket(b);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1288,7 +1338,7 @@ EXPORT_SYMBOL(generic_delete_inode);
  */
 int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || hlist_unhashed(&inode->i_hash);
+	return !inode->i_nlink || hlist_bl_unhashed(&inode->i_hash);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 
@@ -1331,7 +1381,6 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		hlist_del_init(&inode->i_hash);
 		__remove_inode_hash(inode);
 	}
 	list_del_init(&inode->i_io);
@@ -1599,7 +1648,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct inode_hash_bucket),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1608,7 +1657,7 @@ void __init inode_init_early(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
 
 }
 
@@ -1633,7 +1682,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct inode_hash_bucket),
 					ihash_entries,
 					14,
 					0,
@@ -1642,7 +1691,7 @@ void __init inode_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index bed3a78..ce7344e 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -196,13 +196,13 @@ int nilfs_init_gccache(struct the_nilfs *nilfs)
 	INIT_LIST_HEAD(&nilfs->ns_gc_inodes);
 
 	nilfs->ns_gc_inodes_h =
-		kmalloc(sizeof(struct hlist_head) * NILFS_GCINODE_HASH_SIZE,
+		kmalloc(sizeof(struct hlist_bl_head) * NILFS_GCINODE_HASH_SIZE,
 			GFP_NOFS);
 	if (nilfs->ns_gc_inodes_h == NULL)
 		return -ENOMEM;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++)
-		INIT_HLIST_HEAD(&nilfs->ns_gc_inodes_h[loop]);
+		INIT_HLIST_BL_HEAD(&nilfs->ns_gc_inodes_h[loop]);
 	return 0;
 }
 
@@ -254,18 +254,18 @@ static unsigned long ihash(ino_t ino, __u64 cno)
  */
 struct inode *nilfs_gc_iget(struct the_nilfs *nilfs, ino_t ino, __u64 cno)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
-	struct hlist_node *node;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino == ino && NILFS_I(inode)->i_cno == cno)
 			return inode;
 	}
 
 	inode = alloc_gcinode(nilfs, ino, cno);
 	if (likely(inode)) {
-		hlist_add_head(&inode->i_hash, head);
+		hlist_bl_add_head(&inode->i_hash, head);
 		list_add(&NILFS_I(inode)->i_dirty, &nilfs->ns_gc_inodes);
 	}
 	return inode;
@@ -284,16 +284,18 @@ void nilfs_clear_gcinode(struct inode *inode)
  */
 void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h;
-	struct hlist_node *node, *n;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
+	struct hlist_bl_node *node;
 	struct inode *inode;
 	int loop;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
-		hlist_for_each_entry_safe(inode, node, n, head, i_hash) {
-			hlist_del_init(&inode->i_hash);
+restart:
+		hlist_bl_for_each_entry(inode, node, head, i_hash) {
+			hlist_bl_del_init(&inode->i_hash);
 			list_del_init(&NILFS_I(inode)->i_dirty);
 			nilfs_clear_gcinode(inode); /* might sleep */
+			goto restart;
 		}
 	}
 }
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9fd051a..038251c 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -2452,7 +2452,7 @@ nilfs_remove_written_gcinodes(struct the_nilfs *nilfs, struct list_head *head)
 	list_for_each_entry_safe(ii, n, head, i_dirty) {
 		if (!test_bit(NILFS_I_UPDATED, &ii->i_state))
 			continue;
-		hlist_del_init(&ii->vfs_inode.i_hash);
+		hlist_bl_del_init(&ii->vfs_inode.i_hash);
 		list_del_init(&ii->i_dirty);
 		nilfs_clear_gcinode(&ii->vfs_inode);
 	}
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index f785a7b..1ab441a 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -167,7 +167,7 @@ struct the_nilfs {
 
 	/* GC inode list and hash table head */
 	struct list_head	ns_gc_inodes;
-	struct hlist_head      *ns_gc_inodes_h;
+	struct hlist_bl_head      *ns_gc_inodes_h;
 
 	/* Disk layout information (static) */
 	unsigned int		ns_blocksize_bits;
diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index 8c4cf27..ea2f55c 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -424,7 +424,7 @@ int reiserfs_prepare_write(struct file *f, struct page *page,
 static void update_ctime(struct inode *inode)
 {
 	struct timespec now = current_fs_time(inode->i_sb);
-	if (hlist_unhashed(&inode->i_hash) || !inode->i_nlink ||
+	if (hlist_bl_unhashed(&inode->i_hash) || !inode->i_nlink ||
 	    timespec_equal(&inode->i_ctime, &now))
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1162c10..34f983f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -383,6 +383,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/rculist_bl.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -724,7 +725,7 @@ struct posix_acl;
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
 	struct list_head	i_io;		/* backing dev IO list */
 	struct list_head	i_lru;		/* backing dev IO list */
 	struct list_head	i_sb_list;
diff --git a/mm/shmem.c b/mm/shmem.c
index 4daaa24..7a2a5de 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2146,7 +2146,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 	if (*len < 3)
 		return 255;
 
-	if (hlist_unhashed(&inode->i_hash)) {
+	if (hlist_bl_unhashed(&inode->i_hash)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
 		 * so as we hash inodes here rather than at creation
 		 * time, we need a lock to ensure we only try
@@ -2154,7 +2154,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 		 */
 		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
-		if (hlist_unhashed(&inode->i_hash))
+		if (hlist_bl_unhashed(&inode->i_hash))
 			__insert_inode_hash(inode,
 					    inode->i_ino + inode->i_generation);
 		spin_unlock(&lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
@ 2010-10-08  7:33   ` Christoph Hellwig
  2010-10-08  7:51     ` Dave Chinner
  2010-10-08  9:49   ` Al Viro
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:25PM +1100, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Protect the inod hash with a single lock is not scalable.  Convert

s/inod/inode/

>  	p = &root->inode_tree.rb_node;
>  	parent = NULL;
>  
> -	if (hlist_unhashed(&inode->i_hash))
> +	if (hlist_bl_unhashed(&inode->i_hash))

Maybe introduce an inode_unhashed helper for this check which we're
doing in quite a lot of places?

Otherwise looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  7:33   ` Christoph Hellwig
@ 2010-10-08  7:51     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  7:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:33:06AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:25PM +1100, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Protect the inod hash with a single lock is not scalable.  Convert
> 
> s/inod/inode/
> 
> >  	p = &root->inode_tree.rb_node;
> >  	parent = NULL;
> >  
> > -	if (hlist_unhashed(&inode->i_hash))
> > +	if (hlist_bl_unhashed(&inode->i_hash))
> 
> Maybe introduce an inode_unhashed helper for this check which we're
> doing in quite a lot of places?

Ok, makes sense.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
  2010-10-08  7:33   ` Christoph Hellwig
@ 2010-10-08  9:49   ` Al Viro
  2010-10-08  9:51     ` Christoph Hellwig
  2010-10-08 13:43   ` Christoph Hellwig
  2010-10-08 18:54   ` Christoph Hellwig
  3 siblings, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:25PM +1100, Dave Chinner wrote:
> -			if (hlist_unhashed(&inode->i_hash))
> +			if (hlist_bl_unhashed(&inode->i_hash))

That, OTOH, begs for (inlined) helper with more readable name.

>  	HFS_I(inode)->rsrc_inode = dir;
>  	HFS_I(dir)->rsrc_inode = inode;
>  	igrab(dir);
> -	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
> +	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
>  	mark_inode_dirty(inode);

Lovely.  What protects that list?  Same question for hfsplus...

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  9:49   ` Al Viro
@ 2010-10-08  9:51     ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  9:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:49:15AM +0100, Al Viro wrote:
> >  	HFS_I(inode)->rsrc_inode = dir;
> >  	HFS_I(dir)->rsrc_inode = inode;
> >  	igrab(dir);
> > -	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
> > +	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
> >  	mark_inode_dirty(inode);
> 
> Lovely.  What protects that list?  Same question for hfsplus...

Nothing.  It's also never actually read.  For hfsplus all that is fixed
in my hfsplus tree, but I'll still need to find a sucker to backport
all this to hfs.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
  2010-10-08  7:33   ` Christoph Hellwig
  2010-10-08  9:49   ` Al Viro
@ 2010-10-08 13:43   ` Christoph Hellwig
  2010-10-08 14:17     ` Dave Chinner
  2010-10-08 18:54   ` Christoph Hellwig
  3 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08 13:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1162c10..34f983f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -383,6 +383,7 @@ struct inodes_stat_t {
>  #include <linux/capability.h>
>  #include <linux/semaphore.h>
>  #include <linux/fiemap.h>
> +#include <linux/rculist_bl.h>

rculist_bl.h doesn't actually exist anymore in your tree, so this needs
to be list_bl.h to actually compile.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08 13:43   ` Christoph Hellwig
@ 2010-10-08 14:17     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08 14:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 09:43:15AM -0400, Christoph Hellwig wrote:
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 1162c10..34f983f 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -383,6 +383,7 @@ struct inodes_stat_t {
> >  #include <linux/capability.h>
> >  #include <linux/semaphore.h>
> >  #include <linux/fiemap.h>
> > +#include <linux/rculist_bl.h>
> 
> rculist_bl.h doesn't actually exist anymore in your tree, so this needs
> to be list_bl.h to actually compile.

/me is confused.

rculist_bl.h doesn't exist in my tree, that include does, yet
everything compiles.

<shrug>

Fixed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
                     ` (2 preceding siblings ...)
  2010-10-08 13:43   ` Christoph Hellwig
@ 2010-10-08 18:54   ` Christoph Hellwig
  2010-10-16  7:57     ` Nick Piggin
  3 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08 18:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> +struct inode_hash_bucket {
> +	struct hlist_bl_head head;
> +};
> +
> +static inline void spin_lock_bucket(struct inode_hash_bucket *b)
> +{
> +	bit_spin_lock(0, (unsigned long *)b);
> +}
> +
> +static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
> +{
> +	__bit_spin_unlock(0, (unsigned long *)b);
> +}

I've looked at the dcache version of this again, and I really hate
duplicating these helpers in the dcache code aswell.  IMHO they
should simple operate directly on the hlist_bl_head, as that's
what it was designed for.  I also don't really see any point in
wrapping the hlist_bl_head as inode_hash_bucket.  If the bucket naming
is important we could rename the hlist_bl stuff to bl_hash, and the
hlist_bl_head could become bl_hash_bucket.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-08 18:54   ` Christoph Hellwig
@ 2010-10-16  7:57     ` Nick Piggin
  2010-10-16 16:16       ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  7:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 02:54:09PM -0400, Christoph Hellwig wrote:
> > +struct inode_hash_bucket {
> > +	struct hlist_bl_head head;
> > +};
> > +
> > +static inline void spin_lock_bucket(struct inode_hash_bucket *b)
> > +{
> > +	bit_spin_lock(0, (unsigned long *)b);
> > +}
> > +
> > +static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
> > +{
> > +	__bit_spin_unlock(0, (unsigned long *)b);
> > +}
> 
> I've looked at the dcache version of this again, and I really hate
> duplicating these helpers in the dcache code aswell.  IMHO they
> should simple operate directly on the hlist_bl_head, as that's
> what it was designed for.  I also don't really see any point in
> wrapping the hlist_bl_head as inode_hash_bucket.  If the bucket naming
> is important we could rename the hlist_bl stuff to bl_hash, and the
> hlist_bl_head could become bl_hash_bucket.

It was done because someone, like -rt, might want more than one bit of
memory to implement a lock. They would have to make a few other
changes, granted, but this helps reduce a lot of churn.

I didn't see the point of a layer of dumb wrappers for hlist_bl_head
locking. Just reproducing bit spin and wait locks in wrappers when we
already have good functions for them.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-16  7:57     ` Nick Piggin
@ 2010-10-16 16:16       ` Christoph Hellwig
  2010-10-16 17:12         ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:57:03PM +1100, Nick Piggin wrote:
> > duplicating these helpers in the dcache code aswell.  IMHO they
> > should simple operate directly on the hlist_bl_head, as that's
> > what it was designed for.  I also don't really see any point in
> > wrapping the hlist_bl_head as inode_hash_bucket.  If the bucket naming
> > is important we could rename the hlist_bl stuff to bl_hash, and the
> > hlist_bl_head could become bl_hash_bucket.
> 
> It was done because someone, like -rt, might want more than one bit of
> memory to implement a lock. They would have to make a few other
> changes, granted, but this helps reduce a lot of churn.
> 
> I didn't see the point of a layer of dumb wrappers for hlist_bl_head
> locking. Just reproducing bit spin and wait locks in wrappers when we
> already have good functions for them.

With the changes Dave implemented based on my suggestions we now have
an abstract locked hash list data type.  It has the normal hash list
operations plus lock/unlock operations.  So if e.g. the -rt folks need
real locks in there there is one single place they need to touch
instead of every user.  Similarly if we want to add lockdep support
there is just one place to touch.

Nevermind the cast of the list pointer to unsigned long is ugly enough
that it really needs to be hidden in a helper belonging to the list
implementation that is properly commented.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-16 16:16       ` Christoph Hellwig
@ 2010-10-16 17:12         ` Nick Piggin
  2010-10-17  0:45           ` Christoph Hellwig
  2010-10-17  0:46           ` Dave Chinner
  0 siblings, 2 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-16 17:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:16:42PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 06:57:03PM +1100, Nick Piggin wrote:
> > > duplicating these helpers in the dcache code aswell.  IMHO they
> > > should simple operate directly on the hlist_bl_head, as that's
> > > what it was designed for.  I also don't really see any point in
> > > wrapping the hlist_bl_head as inode_hash_bucket.  If the bucket naming
> > > is important we could rename the hlist_bl stuff to bl_hash, and the
> > > hlist_bl_head could become bl_hash_bucket.
> > 
> > It was done because someone, like -rt, might want more than one bit of
> > memory to implement a lock. They would have to make a few other
> > changes, granted, but this helps reduce a lot of churn.
> > 
> > I didn't see the point of a layer of dumb wrappers for hlist_bl_head
> > locking. Just reproducing bit spin and wait locks in wrappers when we
> > already have good functions for them.
> 
> With the changes Dave implemented based on my suggestions we now have
> an abstract locked hash list data type.  It has the normal hash list
> operations plus lock/unlock operations.

That's ugly. It just hides the locking. If a bit of casting bothers
you then put it in a function where it is used like I did.


>  So if e.g. the -rt folks need
> real locks in there there is one single place they need to touch
> instead of every user.  Similarly if we want to add lockdep support
> there is just one place to touch.

It's unnecessary.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-16 17:12         ` Nick Piggin
@ 2010-10-17  0:45           ` Christoph Hellwig
  2010-10-17  2:06             ` Nick Piggin
  2010-10-17  0:46           ` Dave Chinner
  1 sibling, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-17  0:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:12:13AM +1100, Nick Piggin wrote:
> > With the changes Dave implemented based on my suggestions we now have
> > an abstract locked hash list data type.  It has the normal hash list
> > operations plus lock/unlock operations.
> 
> That's ugly. It just hides the locking. If a bit of casting bothers
> you then put it in a function where it is used like I did.

Exposing the implementation details of which bit of a pointer can
be used as lock when cast to an unsigned long to every user of an
abstract type is what I would consider ugly, and on similar issues
I've certainly not been the only one.


> >  So if e.g. the -rt folks need
> > real locks in there there is one single place they need to touch
> > instead of every user.  Similarly if we want to add lockdep support
> > there is just one place to touch.
> 
> It's unnecessary.

What, lockdep support?

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-17  0:45           ` Christoph Hellwig
@ 2010-10-17  2:06             ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-17  2:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 08:45:15PM -0400, Christoph Hellwig wrote:
> On Sun, Oct 17, 2010 at 04:12:13AM +1100, Nick Piggin wrote:
> > > With the changes Dave implemented based on my suggestions we now have
> > > an abstract locked hash list data type.  It has the normal hash list
> > > operations plus lock/unlock operations.
> > 
> > That's ugly. It just hides the locking. If a bit of casting bothers
> > you then put it in a function where it is used like I did.
> 
> Exposing the implementation details of which bit of a pointer can
> be used as lock when cast to an unsigned long to every user of an
> abstract type is what I would consider ugly, and on similar issues
> I've certainly not been the only one.

The low bit.

 
> > >  So if e.g. the -rt folks need
> > > real locks in there there is one single place they need to touch
> > > instead of every user.  Similarly if we want to add lockdep support
> > > there is just one place to touch.
> > 
> > It's unnecessary.
> 
> What, lockdep support?

Yes. It would be stupid to do lockdep support for bit spinlocks in all
places where they are used. What should be done (and there is work
towards) is to be able to change the bit spinlock API (or add a new one)
so that external lockdep data structure can be given.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-16 17:12         ` Nick Piggin
  2010-10-17  0:45           ` Christoph Hellwig
@ 2010-10-17  0:46           ` Dave Chinner
  2010-10-17  2:25             ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-17  0:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:12:13AM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 12:16:42PM -0400, Christoph Hellwig wrote:
> > On Sat, Oct 16, 2010 at 06:57:03PM +1100, Nick Piggin wrote:
> > > > duplicating these helpers in the dcache code aswell.  IMHO they
> > > > should simple operate directly on the hlist_bl_head, as that's
> > > > what it was designed for.  I also don't really see any point in
> > > > wrapping the hlist_bl_head as inode_hash_bucket.  If the bucket naming
> > > > is important we could rename the hlist_bl stuff to bl_hash, and the
> > > > hlist_bl_head could become bl_hash_bucket.
> > > 
> > > It was done because someone, like -rt, might want more than one bit of
> > > memory to implement a lock. They would have to make a few other
> > > changes, granted, but this helps reduce a lot of churn.
> > > 
> > > I didn't see the point of a layer of dumb wrappers for hlist_bl_head
> > > locking. Just reproducing bit spin and wait locks in wrappers when we
> > > already have good functions for them.
> > 
> > With the changes Dave implemented based on my suggestions we now have
> > an abstract locked hash list data type.  It has the normal hash list
> > operations plus lock/unlock operations.
> 
> That's ugly. It just hides the locking. If a bit of casting bothers
> you then put it in a function where it is used like I did.

I much prefer the abstraction from an end-user point of view. Asking
every developer to understand the intricacies of locking the
hlist_bl structures is asking them to get it wrong. Providing
locking wrappers that are exactly what users need so they don't have
to care about it is, IMO, the right thing to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-17  0:46           ` Dave Chinner
@ 2010-10-17  2:25             ` Nick Piggin
  2010-10-18 16:16               ` Andi Kleen
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-17  2:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 11:46:10AM +1100, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 04:12:13AM +1100, Nick Piggin wrote:
> > On Sat, Oct 16, 2010 at 12:16:42PM -0400, Christoph Hellwig wrote:
> > > On Sat, Oct 16, 2010 at 06:57:03PM +1100, Nick Piggin wrote:
> > > > > duplicating these helpers in the dcache code aswell.  IMHO they
> > > > > should simple operate directly on the hlist_bl_head, as that's
> > > > > what it was designed for.  I also don't really see any point in
> > > > > wrapping the hlist_bl_head as inode_hash_bucket.  If the bucket naming
> > > > > is important we could rename the hlist_bl stuff to bl_hash, and the
> > > > > hlist_bl_head could become bl_hash_bucket.
> > > > 
> > > > It was done because someone, like -rt, might want more than one bit of
> > > > memory to implement a lock. They would have to make a few other
> > > > changes, granted, but this helps reduce a lot of churn.
> > > > 
> > > > I didn't see the point of a layer of dumb wrappers for hlist_bl_head
> > > > locking. Just reproducing bit spin and wait locks in wrappers when we
> > > > already have good functions for them.
> > > 
> > > With the changes Dave implemented based on my suggestions we now have
> > > an abstract locked hash list data type.  It has the normal hash list
> > > operations plus lock/unlock operations.
> > 
> > That's ugly. It just hides the locking. If a bit of casting bothers
> > you then put it in a function where it is used like I did.
> 
> I much prefer the abstraction from an end-user point of view. Asking
> every developer to understand the intricacies of locking the
> hlist_bl structures is asking them to get it wrong.

Asking them to understand that 0 bit is to be used for locking? If
they don't understand that they shouldn't be doing kernel programming,
let alone trying to implement advanced scalability with bl lists.
Really.


> Providing
> locking wrappers that are exactly what users need so they don't have
> to care about it is, IMO, the right thing to do.

Hiding the type of lock, and hiding the fact that it sets the low bit?
I don't agree. We don't have synchronization in our data structures,
where possible, because it is just restrictive or goes wrong when people
don't think enough about the locking.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-17  2:25             ` Nick Piggin
@ 2010-10-18 16:16               ` Andi Kleen
  2010-10-18 16:21                 ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Andi Kleen @ 2010-10-18 16:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

Nick Piggin <npiggin@kernel.dk> writes:
>
>> Providing
>> locking wrappers that are exactly what users need so they don't have
>> to care about it is, IMO, the right thing to do.
>
> Hiding the type of lock, and hiding the fact that it sets the low bit?
> I don't agree. We don't have synchronization in our data structures,
> where possible, because it is just restrictive or goes wrong when people
> don't think enough about the locking.

I fully agree. The old skb lists in networking made this mistake
long ago and it was a big problem, until people essentially stopped 
using it (always using __ variants) and it was eventually removed.

Magic locking in data structures is usually a bad idea.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-18 16:16               ` Andi Kleen
@ 2010-10-18 16:21                 ` Christoph Hellwig
  2010-10-19  7:00                   ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-18 16:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	linux-kernel

On Mon, Oct 18, 2010 at 06:16:50PM +0200, Andi Kleen wrote:
> > Hiding the type of lock, and hiding the fact that it sets the low bit?
> > I don't agree. We don't have synchronization in our data structures,
> > where possible, because it is just restrictive or goes wrong when people
> > don't think enough about the locking.
> 
> I fully agree. The old skb lists in networking made this mistake
> long ago and it was a big problem, until people essentially stopped 
> using it (always using __ variants) and it was eventually removed.
> 
> Magic locking in data structures is usually a bad idea.

Err, there is no implicit locking in the calls to hlist_*.  There
is just two small wrappers for the bit-lock/unlock so that the callers
don't have to know how the lock is overloaded onto the pointer in the
list head.  

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-18 16:21                 ` Christoph Hellwig
@ 2010-10-19  7:00                   ` Nick Piggin
  2010-10-19 16:50                     ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-19  7:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andi Kleen, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

On Mon, Oct 18, 2010 at 12:21:05PM -0400, Christoph Hellwig wrote:
> On Mon, Oct 18, 2010 at 06:16:50PM +0200, Andi Kleen wrote:
> > > Hiding the type of lock, and hiding the fact that it sets the low bit?
> > > I don't agree. We don't have synchronization in our data structures,
> > > where possible, because it is just restrictive or goes wrong when people
> > > don't think enough about the locking.
> > 
> > I fully agree. The old skb lists in networking made this mistake
> > long ago and it was a big problem, until people essentially stopped 
> > using it (always using __ variants) and it was eventually removed.
> > 
> > Magic locking in data structures is usually a bad idea.
> 
> Err, there is no implicit locking in the calls to hlist_*.  There
> is just two small wrappers for the bit-lock/unlock so that the callers
> don't have to know how the lock is overloaded onto the pointer in the
> list head.  

But it is still "magic". Because you don't even know whether it
is a spin or sleeping lock, let alone whether it is irq or bh safe.
You get far more information seeing a bit_spin_lock(0, &hlist) call
than hlist_lock().

Even if you do rename them to hlist_bit_spin_lock, etc. Then you need
to add variants for each type of locking a caller wants to do on it.
Ask Linus what he thinks about that.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-19  7:00                   ` Nick Piggin
@ 2010-10-19 16:50                     ` Christoph Hellwig
  2010-10-20  3:11                       ` Nick Piggin
  2010-10-24 15:44                       ` Thomas Gleixner
  0 siblings, 2 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-19 16:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andi Kleen, Dave Chinner, linux-fsdevel,
	linux-kernel, tglx

On Tue, Oct 19, 2010 at 06:00:57PM +1100, Nick Piggin wrote:
> But it is still "magic". Because you don't even know whether it
> is a spin or sleeping lock, let alone whether it is irq or bh safe.
> You get far more information seeing a bit_spin_lock(0, &hlist) call
> than hlist_lock().
> 
> Even if you do rename them to hlist_bit_spin_lock, etc. Then you need
> to add variants for each type of locking a caller wants to do on it.
> Ask Linus what he thinks about that.

And why do we need all these versions?  We never had irqsave or bhsave
versions of bitlocks.  The only thing we might eventually want are
trylock and is_locked operations once we grow user of it.

To get back a bit to the point:

 - we have a new bl_hlist sturcture which combines a hash list and a
   lock embedded into the head
 - the reason why we do it is to be able to use a bitlock

Now your initial version exposed the ugly defaults of that to the user
which is required to case the hash head to and unsigned long and use
bit_spin_lock on bit 0 of it.  There's zero abstraction and a lot
internal gutting required there.

The version I suggest and that dave implemented instead adds wrappers
to call bit_spin_lock/unlock with thos conventions as helper that
operate on the abstract type.  This frees the caller from revinventing
just those same wrappers, as done in your demo tree for the inode and
dentry hashes.

Furthermore it allows the RT people to simply throw a mutex into the
head and everything keeps working without touching a sinlge line of
code outside of hlist_bl.h.

To me that's very clearly preferable, and I'd really like to see
a clearly reasoned argument against it.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-19 16:50                     ` Christoph Hellwig
@ 2010-10-20  3:11                       ` Nick Piggin
  2010-10-24 15:44                       ` Thomas Gleixner
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-20  3:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, Andi Kleen, Dave Chinner, linux-fsdevel,
	linux-kernel, tglx

On Tue, Oct 19, 2010 at 12:50:44PM -0400, Christoph Hellwig wrote:
> On Tue, Oct 19, 2010 at 06:00:57PM +1100, Nick Piggin wrote:
> > But it is still "magic". Because you don't even know whether it
> > is a spin or sleeping lock, let alone whether it is irq or bh safe.
> > You get far more information seeing a bit_spin_lock(0, &hlist) call
> > than hlist_lock().
> > 
> > Even if you do rename them to hlist_bit_spin_lock, etc. Then you need
> > to add variants for each type of locking a caller wants to do on it.
> > Ask Linus what he thinks about that.
> 
> And why do we need all these versions?  We never had irqsave or bhsave
> versions of bitlocks.  The only thing we might eventually want are
> trylock and is_locked operations once we grow user of it.

Or sleeping locks. And there is no reason not to have irqsave or
bhsave versions of bitlocks (which could enable interrupts etc on
a contended lock) but they just haven't been put in yet. If someone
needs them, and implements them in the bitlock code, everybody can
use them.

> To get back a bit to the point:
> 
>  - we have a new bl_hlist sturcture which combines a hash list and a
>    lock embedded into the head
>  - the reason why we do it is to be able to use a bitlock

To use bit 0 as a lock bit, yes. But if another use for the API was
required like to store some other OOB state (not a lock), then some
assertions can come out of the hlist_bl code quite easily and it
can be extended to have an arbitrary bit there. (or several arbitrary
bits if the pointer is aligned correctly).

> Now your initial version exposed the ugly defaults of that to the user
> which is required to case the hash head to and unsigned long and use
> bit_spin_lock on bit 0 of it.  There's zero abstraction and a lot
> internal gutting required there.

What do you mean zero abstraction? It changes all the hlist operations
to handle bit 0 as lock bit.

Locking and unlocking a bit, we already have abstractions for. They work
really well, and nobody has to check the implementation to know what
they do.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-19 16:50                     ` Christoph Hellwig
  2010-10-20  3:11                       ` Nick Piggin
@ 2010-10-24 15:44                       ` Thomas Gleixner
  2010-10-24 21:17                         ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Thomas Gleixner @ 2010-10-24 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, Andi Kleen, Dave Chinner, linux-fsdevel,
	linux-kernel

On Tue, 19 Oct 2010, Christoph Hellwig wrote:

> On Tue, Oct 19, 2010 at 06:00:57PM +1100, Nick Piggin wrote:
> > But it is still "magic". Because you don't even know whether it
> > is a spin or sleeping lock, let alone whether it is irq or bh safe.
> > You get far more information seeing a bit_spin_lock(0, &hlist) call
> > than hlist_lock().

Errm, when hlist_lock() has proper documentation than it should not be
rocket science to figure out what it does.

And if you use bit 0 of hlist then you better have helper functions to
access it anyway. We do that with other data types which (ab)use the
lower two bits of pointers.

> To get back a bit to the point:
> 
>  - we have a new bl_hlist sturcture which combines a hash list and a
>    lock embedded into the head
>  - the reason why we do it is to be able to use a bitlock

And if you design that structure clever, then simple dereferencing of
it (w/o casting magic) should make the compiler barf. So you are
forced to use the helper functions.

> Furthermore it allows the RT people to simply throw a mutex into the
> head and everything keeps working without touching a sinlge line of
> code outside of hlist_bl.h.

Yes, please use proper helper functions. Having to change code is a
horror for RT, when we can get away with a single change in a header
file.

Aside of RT there is another advantage of being able to change the
lock implementation at a single place: you can change it to a real
spinlock and have lockdep coverage of that code. I fundamentally hate
bit_spin_locks for sneaking around lockdep.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-24 15:44                       ` Thomas Gleixner
@ 2010-10-24 21:17                         ` Nick Piggin
  2010-10-25  4:41                           ` Thomas Gleixner
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-24 21:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, Nick Piggin, Andi Kleen, Dave Chinner,
	linux-fsdevel, linux-kernel

On Sun, Oct 24, 2010 at 05:44:24PM +0200, Thomas Gleixner wrote:
> On Tue, 19 Oct 2010, Christoph Hellwig wrote:
> 
> > On Tue, Oct 19, 2010 at 06:00:57PM +1100, Nick Piggin wrote:
> > > But it is still "magic". Because you don't even know whether it
> > > is a spin or sleeping lock, let alone whether it is irq or bh safe.
> > > You get far more information seeing a bit_spin_lock(0, &hlist) call
> > > than hlist_lock().
> 
> Errm, when hlist_lock() has proper documentation than it should not be
> rocket science to figure out what it does.

Right, a look at the docmentation and another layer of indirection
for a reader.

And it's not exactly "properly" documented. It doesn't say if it may
turn into a sleeping lock or is allowed to be used from irq or bh
context.


> And if you use bit 0 of hlist then you better have helper functions to
> access it anyway. We do that with other data types which (ab)use the
> lower two bits of pointers.
> 
> > To get back a bit to the point:
> > 
> >  - we have a new bl_hlist sturcture which combines a hash list and a
> >    lock embedded into the head
> >  - the reason why we do it is to be able to use a bitlock
> 
> And if you design that structure clever, then simple dereferencing of
> it (w/o casting magic) should make the compiler barf. So you are
> forced to use the helper functions.
> 
> > Furthermore it allows the RT people to simply throw a mutex into the
> > head and everything keeps working without touching a sinlge line of
> > code outside of hlist_bl.h.
> 
> Yes, please use proper helper functions. Having to change code is a
> horror for RT, when we can get away with a single change in a header
> file.
> 
> Aside of RT there is another advantage of being able to change the
> lock implementation at a single place: you can change it to a real
> spinlock and have lockdep coverage of that code. I fundamentally hate
> bit_spin_locks for sneaking around lockdep.

You do not want to add a bloated mutex to each inode hash bucket and
think you can just dust off your hands and walk away.  You would
probably make a smaller auxiliary hash of locks, sanely sized, and
protect it with that.

So it would be wrong to just bloat hlist_bl by a factor of several times
(how big is a mutex in -rt?) without doing anything else.

Although a sane locking macro and structure like I had, would perfectly
allow you to switch locks in a single place just the same.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-24 21:17                         ` Nick Piggin
@ 2010-10-25  4:41                           ` Thomas Gleixner
  2010-10-25  7:04                             ` Thomas Gleixner
  2010-10-26  0:06                             ` Nick Piggin
  0 siblings, 2 replies; 162+ messages in thread
From: Thomas Gleixner @ 2010-10-25  4:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andi Kleen, Dave Chinner, linux-fsdevel,
	linux-kernel

On Mon, 25 Oct 2010, Nick Piggin wrote:

> On Sun, Oct 24, 2010 at 05:44:24PM +0200, Thomas Gleixner wrote:
> > On Tue, 19 Oct 2010, Christoph Hellwig wrote:
> > 
> > > On Tue, Oct 19, 2010 at 06:00:57PM +1100, Nick Piggin wrote:
> > > > But it is still "magic". Because you don't even know whether it
> > > > is a spin or sleeping lock, let alone whether it is irq or bh safe.
> > > > You get far more information seeing a bit_spin_lock(0, &hlist) call
> > > > than hlist_lock().
> > 
> > Errm, when hlist_lock() has proper documentation than it should not be
> > rocket science to figure out what it does.
> 
> Right, a look at the docmentation and another layer of indirection
> for a reader.
> 
> And it's not exactly "properly" documented. It doesn't say if it may
> turn into a sleeping lock or is allowed to be used from irq or bh
> context.

Oh well, that's nickpicking :)
 
> > And if you use bit 0 of hlist then you better have helper functions to
> > access it anyway. We do that with other data types which (ab)use the
> > lower two bits of pointers.
> > 
> > > To get back a bit to the point:
> > > 
> > >  - we have a new bl_hlist sturcture which combines a hash list and a
> > >    lock embedded into the head
> > >  - the reason why we do it is to be able to use a bitlock
> > 
> > And if you design that structure clever, then simple dereferencing of
> > it (w/o casting magic) should make the compiler barf. So you are
> > forced to use the helper functions.
> > 
> > > Furthermore it allows the RT people to simply throw a mutex into the
> > > head and everything keeps working without touching a sinlge line of
> > > code outside of hlist_bl.h.
> > 
> > Yes, please use proper helper functions. Having to change code is a
> > horror for RT, when we can get away with a single change in a header
> > file.
> > 
> > Aside of RT there is another advantage of being able to change the
> > lock implementation at a single place: you can change it to a real
> > spinlock and have lockdep coverage of that code. I fundamentally hate
> > bit_spin_locks for sneaking around lockdep.
> 
> You do not want to add a bloated mutex to each inode hash bucket and
> think you can just dust off your hands and walk away.  You would
> probably make a smaller auxiliary hash of locks, sanely sized, and
> protect it with that.
> 
> So it would be wrong to just bloat hlist_bl by a factor of several times
> (how big is a mutex in -rt?) without doing anything else.

Let me worry about it.

> Although a sane locking macro and structure like I had, would perfectly
> allow you to switch locks in a single place just the same.

And a locking macro/structure is better in self documenting than a
helper function which was proposed by Christoph?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-25  4:41                           ` Thomas Gleixner
@ 2010-10-25  7:04                             ` Thomas Gleixner
  2010-10-26  0:12                               ` Nick Piggin
  2010-10-26  0:06                             ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Thomas Gleixner @ 2010-10-25  7:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andi Kleen, Dave Chinner, linux-fsdevel,
	linux-kernel

On Mon, 25 Oct 2010, Thomas Gleixner wrote:
> On Mon, 25 Oct 2010, Nick Piggin wrote:
> > Although a sane locking macro and structure like I had, would perfectly
> > allow you to switch locks in a single place just the same.
> 
> And a locking macro/structure is better in self documenting than a
> helper function which was proposed by Christoph?

Independently of what data structure you folks agree on, we really do
_NOT_ want to have open coded bit_spin_*lock() anywhere in the code.

As I said before, aside of RT it's a basic requirement to switch bit
spinlocks to real ones for lockdep debugging.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-25  7:04                             ` Thomas Gleixner
@ 2010-10-26  0:12                               ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-26  0:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nick Piggin, Christoph Hellwig, Andi Kleen, Dave Chinner,
	linux-fsdevel, linux-kernel

On Mon, Oct 25, 2010 at 1:04 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 25 Oct 2010, Thomas Gleixner wrote:
>> On Mon, 25 Oct 2010, Nick Piggin wrote:
>> > Although a sane locking macro and structure like I had, would perfectly
>> > allow you to switch locks in a single place just the same.
>>
>> And a locking macro/structure is better in self documenting than a
>> helper function which was proposed by Christoph?
>
> Independently of what data structure you folks agree on, we really do
> _NOT_ want to have open coded bit_spin_*lock() anywhere in the code.
>
> As I said before, aside of RT it's a basic requirement to switch bit
> spinlocks to real ones for lockdep debugging.

Putting it in hlist_bl locking function doesn't do much to help --
putting mutexes
or spinlocks into hlist hashes is insane.

What might be good is to have a bit spinlock structure which is 0 size
in a normal
config, but it can hold things like lockdep data. Someone posted a patch maybe a
year ago to do that, which I thought was good but I don't know why it didn't go
anywhere.

It still doesn't solve your -rt problem really, because on a
production rt build like I
say, you can't blindly just replace bit spinlocks with mutexes. But it
makes lockdep
work and could take care of *some* bit spinlocks for -rt.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 11/18] fs: Introduce per-bucket inode hash locks
  2010-10-25  4:41                           ` Thomas Gleixner
  2010-10-25  7:04                             ` Thomas Gleixner
@ 2010-10-26  0:06                             ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-26  0:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nick Piggin, Christoph Hellwig, Andi Kleen, Dave Chinner,
	linux-fsdevel, linux-kernel

On Sun, Oct 24, 2010 at 10:41 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 25 Oct 2010, Nick Piggin wrote:
>> You do not want to add a bloated mutex to each inode hash bucket and
>> think you can just dust off your hands and walk away.  You would
>> probably make a smaller auxiliary hash of locks, sanely sized, and
>> protect it with that.
>>
>> So it would be wrong to just bloat hlist_bl by a factor of several times
>> (how big is a mutex in -rt?) without doing anything else.
>
> Let me worry about it.

No, because you simply should almost never turn the hlist locking into anything
big and bloated, whether it is for -rt or anything else. It is most
likely going to be
used as a per-bucket hash lock (or bit of metadata), so anything
larger than a bit
(which is essentially free) is way overkill.

You would instead have an auxiliary small hash of locks, not tens of megs of
mutexes.


>> Although a sane locking macro and structure like I had, would perfectly
>> allow you to switch locks in a single place just the same.
>
> And a locking macro/structure is better in self documenting than a
> helper function which was proposed by Christoph?

Yes, because you still have the problem that you need to go through and fix up
all call sites.

With my abstraction, there is a small inode function for locking an inode hash
bucket. You have to change 2 places (lock and unlock) to look up an aux hash
of locks and you're done.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 12/18] fs: add a per-superblock lock for the inode list
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (10 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:35   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 13/18] fs: split locking of inode writeback and LRU lists Dave Chinner
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

To allow removal of the inode_lock, we first need to protect the
superblock inode list with its own lock instead of using the
inode_lock. Add a lock to the superblock to protect this list and
nest the new lock inside the inode_lock around the list operations
it needs to protect.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c       |    4 ++++
 fs/fs-writeback.c      |    4 ++++
 fs/inode.c             |   22 +++++++++++++++++++---
 fs/notify/inode_mark.c |    3 +++
 fs/quota/dquot.c       |    6 ++++++
 fs/super.c             |    1 +
 include/linux/fs.h     |    1 +
 7 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c4f3e06..c808ca8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,18 +17,22 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
 		iref_locked(inode);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d63ab47..29f8032 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1026,6 +1026,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1043,6 +1044,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		if (mapping->nrpages == 0)
 			continue;
 		iref_locked(inode);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
@@ -1060,7 +1062,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		cond_resched();
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 3c07719..e6bb36d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -33,13 +33,18 @@
  *   i_ref
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
+ * sb inode lock protects:
+ *   s_inodes, i_sb_list
  *
  * Lock orders
  * inode_lock
  *   inode hash bucket lock
  *     inode->i_lock
+ *
+ * inode_lock
+ *   sb inode lock
+ *     inode->i_lock
  */
-
 /*
  * This is needed for the following functions:
  *  - inode_has_buffers
@@ -488,7 +493,9 @@ static void dispose_list(struct list_head *head)
 
 		spin_lock(&inode_lock);
 		__remove_inode_hash(inode);
+		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode->i_sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -499,7 +506,8 @@ static void dispose_list(struct list_head *head)
 /*
  * Invalidate all inodes for a device.
  */
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_list(struct super_block *sb, struct list_head *head,
+			struct list_head *dispose)
 {
 	struct list_head *next;
 	int busy = 0;
@@ -516,6 +524,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -555,8 +564,10 @@ int invalidate_inodes(struct super_block *sb)
 
 	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
-	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -753,7 +764,9 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
+	spin_lock(&sb->s_inodes_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
 	if (b) {
 		spin_lock_bucket(b);
 		hlist_bl_add_head(&inode->i_hash, &b->head);
@@ -1397,7 +1410,10 @@ static void iput_final(struct inode *inode)
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 
+	spin_lock(&sb->s_inodes_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb->s_inodes_lock);
+
 	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 2fe319b..3389ff0 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -242,6 +242,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
+		struct super_block *sb = inode->i_sb;
 
 		/*
 		 * We cannot iref() an inode in state I_FREEING,
@@ -288,6 +289,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -301,5 +303,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		iput(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 5199418..b7cbc41 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -897,6 +897,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 #endif
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -910,6 +911,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 			continue;
 
 		iref_locked(inode);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -921,7 +923,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 
@@ -1004,6 +1008,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	int reserved = 0;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1017,6 +1022,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
diff --git a/fs/super.c b/fs/super.c
index 8819e3a..d826214 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,6 +76,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		spin_lock_init(&(s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34f983f..54c4e86 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1342,6 +1342,7 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
+	spinlock_t		s_inodes_lock;	/* lock for s_inodes */
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 12/18] fs: add a per-superblock lock for the inode list
  2010-10-08  5:21 ` [PATCH 12/18] fs: add a per-superblock lock for the inode list Dave Chinner
@ 2010-10-08  7:35   ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:26PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow removal of the inode_lock, we first need to protect the
> superblock inode list with its own lock instead of using the
> inode_lock. Add a lock to the superblock to protect this list and
> nest the new lock inside the inode_lock around the list operations
> it needs to protect.
> 
> Based on a patch originally from Nick Piggin.

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (11 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 12/18] fs: add a per-superblock lock for the inode list Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:42   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock Dave Chinner
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Now that the inode LRU and IO lists are split apart, we can separate
the locking for them. The IO lists are only ever accessed in the
context of writeback, so a per-BDI lock for those lists separates
them out nicely.

For the inode LRU, introduce a simple global lock to protect it.
While this could be made per-sb, it is unclear yet as to what is the
next steps for optimising/parallelising reclaim of inodes. Rather
than optimise now, leave it as a global list and lock until further
analysis canbe done.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c           |   48 +++++++++++++-------
 fs/inode.c                  |  101 ++++++++++++++++++++++++++++++++++--------
 fs/internal.h               |    6 +++
 fs/super.c                  |    2 +-
 include/linux/backing-dev.h |    1 +
 include/linux/writeback.h   |   12 ++++-
 mm/backing-dev.c            |   21 +++++++++
 7 files changed, 150 insertions(+), 41 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 29f8032..49d44cc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -69,16 +69,6 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 	return test_bit(BDI_writeback_running, &bdi->state);
 }
 
-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	if (strcmp(sb->s_type->name, "bdev") == 0)
-		return inode->i_mapping->a_bdi;
-
-	return sb->s_bdi;
-}
-
 static void bdi_queue_work(struct backing_dev_info *bdi,
 		struct wb_writeback_work *work)
 {
@@ -169,6 +159,7 @@ static void redirty_tail(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -186,6 +177,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	list_move(&inode->i_io, &wb->b_more_io);
 }
 
@@ -268,6 +260,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	assert_spin_locked(&wb->b_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -311,6 +304,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
@@ -330,7 +324,9 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
+			spin_unlock(&bdi->wb.b_lock);
 			return 0;
 		}
 
@@ -385,6 +381,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
@@ -400,6 +397,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 				 */
 				redirty_tail(inode);
 			}
+			spin_unlock(&bdi->wb.b_lock);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
@@ -407,14 +405,15 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
+			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/* The inode is clean */
+			spin_lock(&bdi->wb.b_lock);
 			list_del_init(&inode->i_io);
-			if (list_empty(&inode->i_lru)) {
-				list_add(&inode->i_lru, &inode_unused);
-				percpu_counter_inc(&nr_inodes_unused);
-			}
+			spin_unlock(&bdi->wb.b_lock);
+			inode_lru_list_add(inode);
 		}
 	}
 	inode_sync_complete(inode);
@@ -460,6 +459,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
+	assert_spin_locked(&wb->b_lock);
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -475,7 +475,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 				redirty_tail(inode);
 				continue;
 			}
-
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -484,7 +483,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
-		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
 			requeue_io(inode);
 			continue;
 		}
@@ -495,8 +494,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		if (inode_dirtied_after(inode, wbc->wb_start))
 			return 1;
 
-		BUG_ON(inode->i_state & I_FREEING);
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&wb->b_lock);
+
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -504,12 +506,15 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
+			spin_lock(&wb->b_lock);
 			redirty_tail(inode);
+			spin_unlock(&wb->b_lock);
 		}
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -529,6 +534,8 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
+
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -547,6 +554,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		if (ret)
 			break;
 	}
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
@@ -557,9 +565,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -672,8 +682,10 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 */
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
+			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_io);
+			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -986,8 +998,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_io, &bdi->wb.b_dirty);
+			spin_unlock(&bdi->wb.b_lock);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index e6bb36d..4ad7900 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -35,6 +35,10 @@
  *   inode hash table, i_hash
  * sb inode lock protects:
  *   s_inodes, i_sb_list
+ * bdi writeback lock protects:
+ *   b_io, b_more_io, b_dirty, i_io
+ * inode_lru_lock protects:
+ *   inode_lru, i_lru
  *
  * Lock orders
  * inode_lock
@@ -43,7 +47,9 @@
  *
  * inode_lock
  *   sb inode lock
- *     inode->i_lock
+ *     inode_lru_lock
+ *       wb->b_lock
+ *         inode->i_lock
  */
 /*
  * This is needed for the following functions:
@@ -92,7 +98,8 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_lru);
+static DEFINE_SPINLOCK(inode_lru_lock);
 
 struct inode_hash_bucket {
 	struct hlist_bl_head head;
@@ -383,6 +390,30 @@ int iref_read(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(iref_read);
 
+/*
+ * check against I_FREEING as inode writeback completion could race with
+ * setting the I_FREEING and removing the inode from the LRU.
+ */
+void inode_lru_list_add(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
+		list_add(&inode->i_lru, &inode_lru);
+		percpu_counter_inc(&nr_inodes_unused);
+	}
+	spin_unlock(&inode_lru_lock);
+}
+
+void inode_lru_list_del(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
+	spin_unlock(&inode_lru_lock);
+}
+
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
 	unsigned long tmp;
@@ -535,11 +566,26 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		invalidate_inode_buffers(inode);
 		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
+			struct backing_dev_info *bdi = inode_to_bdi(inode);
+
 			spin_unlock(&inode->i_lock);
-			list_move(&inode->i_lru, dispose);
-			list_del_init(&inode->i_io);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+
+
+			/*
+			 * move the inode off the IO lists and LRU once
+			 * I_FREEING is set so that it won't get moved back on
+			 * there if it is dirty.
+			 */
+			spin_lock(&bdi->wb.b_lock);
+			list_del_init(&inode->i_io);
+			spin_unlock(&bdi->wb.b_lock);
+
+			spin_lock(&inode_lru_lock);
+			list_move(&inode->i_lru, dispose);
+			spin_unlock(&inode_lru_lock);
+
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
@@ -596,7 +642,7 @@ static int can_unuse(struct inode *inode)
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
+ * the front of the inode_lru list.  So look for it there and if the
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
@@ -611,13 +657,15 @@ static void prune_icache(int nr_to_scan)
 
 	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
+		struct backing_dev_info *bdi;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_lru);
+		inode = list_entry(inode_lru.prev, struct inode, i_lru);
 
 		spin_lock(&inode->i_lock);
 		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
@@ -628,19 +676,21 @@ static void prune_icache(int nr_to_scan)
 		}
 		if (inode->i_state & I_REFERENCED) {
 			spin_unlock(&inode->i_lock);
-			list_move(&inode->i_lru, &inode_unused);
+			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			iref_locked(inode);
 			spin_unlock(&inode->i_lock);
+			spin_unlock(&inode_lru_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
+			spin_lock(&inode_lru_lock);
 
 			/*
 			 * if we can't reclaim this inod immediately, give it
@@ -648,21 +698,32 @@ static void prune_icache(int nr_to_scan)
 			 * on it.
 			 */
 			if (!can_unuse(inode)) {
-				list_move(&inode->i_lru, &inode_unused);
+				list_move(&inode->i_lru, &inode_lru);
 				continue;
 			}
 		} else
 			spin_unlock(&inode->i_lock);
-		list_move(&inode->i_lru, &freeable);
-		list_del_init(&inode->i_io);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+
+		/*
+		 * move the inode off the IO lists and LRU once
+		 * I_FREEING is set so that it won't get moved back on
+		 * there if it is dirty.
+		 */
+		bdi = inode_to_bdi(inode);
+		spin_lock(&bdi->wb.b_lock);
+		list_del_init(&inode->i_io);
+		spin_unlock(&bdi->wb.b_lock);
+
+		list_move(&inode->i_lru, &freeable);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
+	spin_unlock(&inode_lru_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
@@ -1369,6 +1430,7 @@ static void iput_final(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	const struct super_operations *op = inode->i_sb->s_op;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int drop;
 
 	if (op && op->drop_inode)
@@ -1381,8 +1443,7 @@ static void iput_final(struct inode *inode)
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
 			    list_empty(&inode->i_lru)) {
-				list_add(&inode->i_lru, &inode_unused);
-				percpu_counter_inc(&nr_inodes_unused);
+				inode_lru_list_add(inode);
 			}
 			spin_unlock(&inode_lock);
 			return;
@@ -1396,19 +1457,19 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
-	list_del_init(&inode->i_io);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
 	/*
-	 * We avoid moving dirty inodes back onto the LRU now because I_FREEING
-	 * is set and hence writeback_single_inode() won't move the inode
+	 * move the inode off the IO lists and LRU once I_FREEING is set so
+	 * that it won't get moved back on there if it is dirty.
 	 * around.
 	 */
-	if (!list_empty(&inode->i_lru)) {
-		list_del_init(&inode->i_lru);
-		percpu_counter_dec(&nr_inodes_unused);
-	}
+	spin_lock(&bdi->wb.b_lock);
+	list_del_init(&inode->i_io);
+	spin_unlock(&bdi->wb.b_lock);
+
+	inode_lru_list_del(inode);
 
 	spin_lock(&sb->s_inodes_lock);
 	list_del_init(&inode->i_sb_list);
diff --git a/fs/internal.h b/fs/internal.h
index a6910e9..ece3565 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -101,3 +101,9 @@ extern void put_super(struct super_block *sb);
 struct nameidata;
 extern struct file *nameidata_to_filp(struct nameidata *);
 extern void release_open_intent(struct nameidata *);
+
+/*
+ * inode.c
+ */
+extern void inode_lru_list_add(struct inode *inode);
+extern void inode_lru_list_del(struct inode *inode);
diff --git a/fs/super.c b/fs/super.c
index d826214..c5332e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
-		spin_lock_init(&(s->s_inodes_lock);
+		spin_lock_init(&s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 31e1346..5106fc4 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t b_lock;		/* writeback lists lock */
 };
 
 struct backing_dev_info {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index f7ed2a0..b182ccc 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,10 +10,7 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_unused;
 
-extern struct percpu_counter nr_inodes;
-extern struct percpu_counter nr_inodes_unused;
 
 /*
  * fs/fs-writeback.c
@@ -82,6 +79,15 @@ static inline void inode_sync_wait(struct inode *inode)
 							TASK_UNINTERRUPTIBLE);
 }
 
+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (strcmp(sb->s_type->name, "bdev") == 0)
+		return inode->i_mapping->a_bdi;
+
+	return sb->s_bdi;
+}
 
 /*
  * mm/page-writeback.c
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a124991..74e8269 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,12 +74,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_io)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -634,6 +636,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->b_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -671,6 +674,18 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
+static void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2)
+{
+	if (bdi1 < bdi2) {
+		spin_lock(&bdi1->wb.b_lock);
+		spin_lock_nested(&bdi2->wb.b_lock, 1);
+	} else {
+		spin_lock(&bdi2->wb.b_lock);
+		spin_lock_nested(&bdi1->wb.b_lock, 1);
+	}
+}
+
 void mapping_set_bdi(struct address_space *mapping,
 				struct backing_dev_info *bdi)
 {
@@ -681,6 +696,7 @@ void mapping_set_bdi(struct address_space *mapping,
 		return;
 
 	spin_lock(&inode_lock);
+	bdi_lock_two(bdi, old);
 	if (!list_empty(&inode->i_io)) {
 		struct inode *i;
 
@@ -709,6 +725,8 @@ void mapping_set_bdi(struct address_space *mapping,
 	}
 found:
 	mapping->a_bdi = bdi;
+	spin_unlock(&bdi->wb.b_lock);
+	spin_unlock(&old->wb.b_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(mapping_set_bdi);
@@ -726,6 +744,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct inode *i, *tmp;
 
 		spin_lock(&inode_lock);
+		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_io) {
 			list_del(&i->i_io);
 			list_add_tail(&i->i_io, &dst->b_dirty);
@@ -741,6 +760,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
 			list_add_tail(&i->i_io, &dst->b_more_io);
 			i->i_mapping->a_bdi = bdi;
 		}
+		spin_unlock(&bdi->wb.b_lock);
+		spin_unlock(&dst->b_lock);
 		spin_unlock(&inode_lock);
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-08  5:21 ` [PATCH 13/18] fs: split locking of inode writeback and LRU lists Dave Chinner
@ 2010-10-08  7:42   ` Christoph Hellwig
  2010-10-08  8:00     ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:27PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that the inode LRU and IO lists are split apart, we can separate
> the locking for them. The IO lists are only ever accessed in the
> context of writeback, so a per-BDI lock for those lists separates
> them out nicely.

I think this description needs some updates.  It seems like it's from
Nick's original patch that splits the lock, but at this point we still
have inode_lock anyway.

>  
> -static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
> -{
> -	struct super_block *sb = inode->i_sb;
> -
> -	if (strcmp(sb->s_type->name, "bdev") == 0)
> -		return inode->i_mapping->a_bdi;
> -
> -	return sb->s_bdi;
> -}

Please don't extent the scope of this one.  Just add a new inode_wb_del
or similar helper to remove and inode from the writeback list.

>  		struct inode *inode = list_entry(wb->b_io.prev,
> @@ -475,7 +475,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
>  				redirty_tail(inode);
>  				continue;
>  			}
> -
>  			/*
>  			 * The inode belongs to a different superblock.
>  			 * Bounce back to the caller to unpin this and

spurious whitespace change.

> @@ -484,7 +483,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
>  			return 0;
>  		}
>  
> -		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
> +		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
>  			requeue_io(inode);
>  			continue;
>  		}

What does this have to do with the rest of the patch?

> @@ -495,8 +494,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
>  		if (inode_dirtied_after(inode, wbc->wb_start))
>  			return 1;
>  
> -		BUG_ON(inode->i_state & I_FREEING);
> +		spin_lock(&inode->i_lock);
>  		iref_locked(inode);
> +		spin_unlock(&inode->i_lock);

Shouldn't this become a plain iref now?

> +/*
> + * check against I_FREEING as inode writeback completion could race with
> + * setting the I_FREEING and removing the inode from the LRU.
> + */
> +void inode_lru_list_add(struct inode *inode)
> +{
> +	spin_lock(&inode_lru_lock);
> +	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
> +		list_add(&inode->i_lru, &inode_lru);
> +		percpu_counter_inc(&nr_inodes_unused);
> +	}
> +	spin_unlock(&inode_lru_lock);
> +}

Ah, here you introduce the lru list helpers I suggested earlier.  Moving
them earlier in the series probably is a good idea to avoid exporting
nr_inodes_unused, even if the locking for the helpers will change in
this patch.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-08  7:42   ` Christoph Hellwig
@ 2010-10-08  8:00     ` Dave Chinner
  2010-10-08  8:18       ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  8:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:42:43AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:27PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that the inode LRU and IO lists are split apart, we can separate
> > the locking for them. The IO lists are only ever accessed in the
> > context of writeback, so a per-BDI lock for those lists separates
> > them out nicely.
> 
> I think this description needs some updates.  It seems like it's from
> Nick's original patch that splits the lock, but at this point we still
> have inode_lock anyway.

Ok, will do.

> > -static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
> > -{
> > -	struct super_block *sb = inode->i_sb;
> > -
> > -	if (strcmp(sb->s_type->name, "bdev") == 0)
> > -		return inode->i_mapping->a_bdi;
> > -
> > -	return sb->s_bdi;
> > -}
> 
> Please don't extent the scope of this one.  Just add a new inode_wb_del
> or similar helper to remove and inode from the writeback list.

OK, will do.

> > @@ -484,7 +483,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
> >  			return 0;
> >  		}
> >  
> > -		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
> > +		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
> >  			requeue_io(inode);
> >  			continue;
> >  		}
> 
> What does this have to do with the rest of the patch?

That's because there's now a window between setting I_FREEING and taking
the inode off the writeback list which means that we can see inodes
in that state here. Generally it means that the code setting
I_FREEING is spinning waiting for the wb->b_lock that this thread
currently holds so it can be removed from the list.. Hence the requeue
to move the inode out of the way and keep processing inodes for
writeback.


> > @@ -495,8 +494,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
> >  		if (inode_dirtied_after(inode, wbc->wb_start))
> >  			return 1;
> >  
> > -		BUG_ON(inode->i_state & I_FREEING);
> > +		spin_lock(&inode->i_lock);
> >  		iref_locked(inode);
> > +		spin_unlock(&inode->i_lock);
> 
> Shouldn't this become a plain iref now?

No, because we're holding the inode_lock here. And later, the lock
get moved to protect the i_state fields as well, so it would end up
this way, anyway.

> > +/*
> > + * check against I_FREEING as inode writeback completion could race with
> > + * setting the I_FREEING and removing the inode from the LRU.
> > + */
> > +void inode_lru_list_add(struct inode *inode)
> > +{
> > +	spin_lock(&inode_lru_lock);
> > +	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
> > +		list_add(&inode->i_lru, &inode_lru);
> > +		percpu_counter_inc(&nr_inodes_unused);
> > +	}
> > +	spin_unlock(&inode_lru_lock);
> > +}
> 
> Ah, here you introduce the lru list helpers I suggested earlier.  Moving
> them earlier in the series probably is a good idea to avoid exporting
> nr_inodes_unused, even if the locking for the helpers will change in
> this patch.

I'll see what I can do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-08  8:00     ` Dave Chinner
@ 2010-10-08  8:18       ` Christoph Hellwig
  2010-10-16  7:57         ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 07:00:18PM +1100, Dave Chinner wrote:
> > >  
> > > -		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
> > > +		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
> > >  			requeue_io(inode);
> > >  			continue;
> > >  		}
> > 
> > What does this have to do with the rest of the patch?
> 
> That's because there's now a window between setting I_FREEING and taking
> the inode off the writeback list which means that we can see inodes
> in that state here. Generally it means that the code setting
> I_FREEING is spinning waiting for the wb->b_lock that this thread
> currently holds so it can be removed from the list.. Hence the requeue
> to move the inode out of the way and keep processing inodes for
> writeback.

That needs some documentation both in the changelog and in the code
I think.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-08  8:18       ` Christoph Hellwig
@ 2010-10-16  7:57         ` Nick Piggin
  2010-10-16 16:20           ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  7:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:18:16AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 07:00:18PM +1100, Dave Chinner wrote:
> > > >  
> > > > -		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
> > > > +		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
> > > >  			requeue_io(inode);
> > > >  			continue;
> > > >  		}
> > > 
> > > What does this have to do with the rest of the patch?
> > 
> > That's because there's now a window between setting I_FREEING and taking
> > the inode off the writeback list which means that we can see inodes
> > in that state here. Generally it means that the code setting
> > I_FREEING is spinning waiting for the wb->b_lock that this thread
> > currently holds so it can be removed from the list.. Hence the requeue
> > to move the inode out of the way and keep processing inodes for
> > writeback.
> 
> That needs some documentation both in the changelog and in the code
> I think.

This is another instance where the irregular i_lock locking is making
these little subtleties to the locking. I think that is actually much
worse for maintainence/complexity than a few trylocks which can be
mostly removed with rcu anyway (which are obvious because of the well
documented lock order).

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-16  7:57         ` Nick Piggin
@ 2010-10-16 16:20           ` Christoph Hellwig
  2010-10-16 17:19             ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:57:13PM +1100, Nick Piggin wrote:
> > That needs some documentation both in the changelog and in the code
> > I think.
> 
> This is another instance where the irregular i_lock locking is making
> these little subtleties to the locking. I think that is actually much
> worse for maintainence/complexity than a few trylocks which can be
> mostly removed with rcu anyway (which are obvious because of the well
> documented lock order).

Care to explain why?  The I_FREEING and co checks are how we do things
all over the icache for a long time.  They are perfectly easy to
understand concept.  What I asked Dave about is documenting why he
changed things here.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-16 16:20           ` Christoph Hellwig
@ 2010-10-16 17:19             ` Nick Piggin
  2010-10-17  1:00               ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-16 17:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:20:21PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 06:57:13PM +1100, Nick Piggin wrote:
> > > That needs some documentation both in the changelog and in the code
> > > I think.
> > 
> > This is another instance where the irregular i_lock locking is making
> > these little subtleties to the locking. I think that is actually much
> > worse for maintainence/complexity than a few trylocks which can be
> > mostly removed with rcu anyway (which are obvious because of the well
> > documented lock order).
> 
> Care to explain why?

OK.

>  The I_FREEING and co checks are how we do things
> all over the icache for a long time.

That's missing my point. My point is that the semantics of icache
concurrency here are changed from the old inode_lock model. With
my design, holding i_lock on an inode is equivalent (stronger,
actually) to holding inode_lock which is an important part of
making small correct steps.

>  They are perfectly easy to
> understand concept.

The concept is easy to understand, but catching all the changes
are not necessarily so easy. You said it yourself

 "What does this have to do with the rest of the patch?"

So, I repeat, I never objected to narrowing i_lock widths, but I
would like to do it as very small patches that would move i_lock
out of some data structure manipulation and add the required additions
like this hunk. I want to see what we gain by making the locking model
less regular.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-16 17:19             ` Nick Piggin
@ 2010-10-17  1:00               ` Dave Chinner
  2010-10-17  2:20                 ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-17  1:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:19:48AM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 12:20:21PM -0400, Christoph Hellwig wrote:
> > On Sat, Oct 16, 2010 at 06:57:13PM +1100, Nick Piggin wrote:
> > > > That needs some documentation both in the changelog and in the code
> > > > I think.
> > > 
> > > This is another instance where the irregular i_lock locking is making
> > > these little subtleties to the locking. I think that is actually much
> > > worse for maintainence/complexity than a few trylocks which can be
> > > mostly removed with rcu anyway (which are obvious because of the well
> > > documented lock order).
> > 
> > Care to explain why?
> 
> OK.
> 
> 
> >  The I_FREEING and co checks are how we do things
> > all over the icache for a long time.
> 
> That's missing my point. My point is that the semantics of icache
> concurrency here are changed from the old inode_lock model. With
> my design, holding i_lock on an inode is equivalent (stronger,
> actually) to holding inode_lock which is an important part of
> making small correct steps.

That doesn't necesarily make it better, Nick.

The existing of I_FREEING checks in the writeback code is an
exception rather than the rule - it was the only list traversal
where an inode in the I_FREEING state was unacceptable and it
special cased that with an undocumented BUG_ON(inode->i_state &
I_FREEING).  i only found this and understood it as a result of
tripping over it while testing this patch.

The change I made to allow handling the I_FREEING case in this code
in exactly the same way as every other inode list traversal is a
significant improvement. it also greatly simplified the i_state
locking patches in this area. Any by the end of the series, the
behaviour of setting I_FREEING before disposing of the inode is well
documented, consistently implemented, and protected by a commented
BUG_ON to ensure the rule is always followed in future.

IMO, removing an undocumented special case landmine is a much
better solution than continuing to hide it and hoping no-one treads on
it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 13/18] fs: split locking of inode writeback and LRU lists
  2010-10-17  1:00               ` Dave Chinner
@ 2010-10-17  2:20                 ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-17  2:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 12:00:23PM +1100, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 04:19:48AM +1100, Nick Piggin wrote:
> > On Sat, Oct 16, 2010 at 12:20:21PM -0400, Christoph Hellwig wrote:
> > > On Sat, Oct 16, 2010 at 06:57:13PM +1100, Nick Piggin wrote:
> > > > > That needs some documentation both in the changelog and in the code
> > > > > I think.
> > > > 
> > > > This is another instance where the irregular i_lock locking is making
> > > > these little subtleties to the locking. I think that is actually much
> > > > worse for maintainence/complexity than a few trylocks which can be
> > > > mostly removed with rcu anyway (which are obvious because of the well
> > > > documented lock order).
> > > 
> > > Care to explain why?
> > 
> > OK.
> > 
> > 
> > >  The I_FREEING and co checks are how we do things
> > > all over the icache for a long time.
> > 
> > That's missing my point. My point is that the semantics of icache
> > concurrency here are changed from the old inode_lock model. With
> > my design, holding i_lock on an inode is equivalent (stronger,
> > actually) to holding inode_lock which is an important part of
> > making small correct steps.
> 
> That doesn't necesarily make it better, Nick.

I argue that it does. Today you protect the icache state of an
inode with inode_lock; in my model you can do it with i_lock.

A single particular inode is most often the thing you are interested in,
so once you take i_lock on it, you safely know that you can lift
inode_lock away.

> The existing of I_FREEING checks in the writeback code is an
> exception rather than the rule - it was the only list traversal
> where an inode in the I_FREEING state was unacceptable and it
> special cased that with an undocumented BUG_ON(inode->i_state &
> I_FREEING).  i only found this and understood it as a result of
> tripping over it while testing this patch.
> 
> The change I made to allow handling the I_FREEING case in this code
> in exactly the same way as every other inode list traversal is a
> significant improvement. it also greatly simplified the i_state
> locking patches in this area. Any by the end of the series, the
> behaviour of setting I_FREEING before disposing of the inode is well
> documented, consistently implemented, and protected by a commented
> BUG_ON to ensure the rule is always followed in future.
> 
> IMO, removing an undocumented special case landmine is a much
> better solution than continuing to hide it and hoping no-one treads on
> it....

You blew up said land mine because your locking model did not provide
the same coverage as the inode_lock that was lifted away.

I'm not saying that particular "landmine" could not go away, I'm saying
that it was exploded because of the irregular locking that your patches
introduce. That is what I don't like.

Now I repeat again. There are some possible upsides of reducing i_lock
coverage (like improving lock ordering). I _never_ said I was totally
against them. I said I want to wait in the series until the inode_lock
is lifted, and they can be proposed as individual, bisectable small
changes rather than being lumped in with lock splitting.

In fact, before I realised that I could make use of RCU-inodes due to
the rcu-walk work going on, I was experimenting exactly with reducing
i_lock widths like you have been doing to reduce the ordering
complexity. You can do it in a few patches of a couple of dozen lines
each. But I decided the complexity and potential for undetected problems
from making the locking less regular is just not worth it when you can
use RCU for mostly the same job.

I would be willing to be proven wrong on that, but I would like to be
proven wrong with a 40 line patch that does nothing but moves i_lock
out of a particular data structure manipulation, and then adds in these
required hunks like the above. That way it is far easier to bisect,
review, and audit the rest of the code for other exposed landmines.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (12 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 13/18] fs: split locking of inode writeback and LRU lists Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:49   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 15/18] fs: introduce a per-cpu last_ino allocator Dave Chinner
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state.  Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.

Based on work originally written by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c       |    9 +++--
 fs/fs-writeback.c      |   49 ++++++++++++++++++++++------
 fs/inode.c             |   83 ++++++++++++++++++++++++++++++++---------------
 fs/nilfs2/gcdat.c      |    1 +
 fs/notify/inode_mark.c |   10 ++++--
 fs/quota/dquot.c       |   12 ++++---
 6 files changed, 115 insertions(+), 49 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c808ca8..00180dc 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (inode->i_mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 49d44cc..404d449 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -281,10 +281,12 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 	}
 }
 
@@ -309,7 +311,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!iref_read(inode))
+	spin_lock(&inode->i_lock);
+	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -324,6 +327,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
 			spin_unlock(&bdi->wb.b_lock);
@@ -341,6 +345,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -362,8 +367,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * write_inode()
 	 */
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -373,6 +380,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -381,6 +389,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
@@ -405,16 +414,21 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
 			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/* The inode is clean */
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			list_del_init(&inode->i_io);
 			spin_unlock(&bdi->wb.b_lock);
 			inode_lru_list_add(inode);
 		}
+	} else {
+		/* freer will clean up */
+		spin_unlock(&inode->i_lock);
 	}
 	inode_sync_complete(inode);
 	return ret;
@@ -483,7 +497,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
+			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			continue;
 		}
@@ -491,10 +507,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
+		if (inode_dirtied_after(inode, wbc->wb_start)) {
+			spin_unlock(&inode->i_lock);
 			return 1;
+		}
 
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->b_lock);
@@ -687,7 +704,9 @@ static long wb_writeback(struct bdi_writeback *wb,
 						struct inode, i_io);
 			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
+			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode);
+			spin_unlock(&inode->i_lock);
 		}
 		spin_unlock(&inode_lock);
 	}
@@ -953,6 +972,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -964,7 +984,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * superblock list, based upon its state.
 		 */
 		if (inode->i_state & I_SYNC)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * Only add valid (hashed) inodes to the superblock's
@@ -972,10 +992,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
 			if (hlist_bl_unhashed(&inode->i_hash))
-				goto out;
+				goto out_unlock;
 		}
 		if (inode->i_state & I_FREEING)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * If the inode was already on b_dirty/b_io/b_more_io, don't
@@ -998,12 +1018,16 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
-			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
+			spin_unlock(&inode->i_lock);
+			spin_lock(&bdi->wb.b_lock);
 			list_move(&inode->i_io, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.b_lock);
+			goto out;
 		}
 	}
+out_unlock:
+	spin_unlock(&inode->i_lock);
 out:
 	spin_unlock(&inode_lock);
 
@@ -1052,12 +1076,15 @@ static void wait_sb_inodes(struct super_block *sb)
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
+		spin_lock(&inode->i_lock);
 		mapping = inode->i_mapping;
-		if (mapping->nrpages == 0)
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		/*
diff --git a/fs/inode.c b/fs/inode.c
index 4ad7900..d3bd08a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -30,7 +30,7 @@
  * Locking rules.
  *
  * inode->i_lock protects:
- *   i_ref
+ *   i_ref i_state
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
  * sb inode lock protects:
@@ -182,7 +182,7 @@ int proc_nr_inodes(ctl_table *table, int write,
 static void wake_up_inode(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_NEW);
@@ -361,6 +361,8 @@ static void init_once(void *foo)
  */
 void iref_locked(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
+
 	inode->i_ref++;
 }
 EXPORT_SYMBOL_GPL(iref_locked);
@@ -484,7 +486,9 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
+	spin_lock(&inode->i_lock);
 	inode->i_state = I_FREEING | I_CLEAR;
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(end_writeback);
 
@@ -561,17 +565,18 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
-		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
 			struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-
+			spin_unlock(&inode->i_lock);
 
 			/*
 			 * move the inode off the IO lists and LRU once
@@ -625,11 +630,12 @@ EXPORT_SYMBOL(invalidate_inodes);
 
 static int can_unuse(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
 	if (inode->i_state)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (iref_read(inode))
+	if (inode->i_ref)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -675,9 +681,9 @@ static void prune_icache(int nr_to_scan)
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
+			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_lru);
-			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -691,6 +697,7 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
+			spin_lock(&inode->i_lock);
 
 			/*
 			 * if we can't reclaim this inod immediately, give it
@@ -699,12 +706,14 @@ static void prune_icache(int nr_to_scan)
 			 */
 			if (!can_unuse(inode)) {
 				list_move(&inode->i_lru, &inode_lru);
+				spin_unlock(&inode->i_lock);
 				continue;
 			}
-		} else
-			spin_unlock(&inode->i_lock);
+		}
+
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 
 		/*
 		 * move the inode off the IO lists and LRU once
@@ -761,7 +770,7 @@ static struct shrinker icache_shrinker = {
 
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
- * Called with the inode lock held.
+ * Returns with inode->i_lock held.
  * NOTE: we are not increasing the inode-refcount, you must call iref_locked()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
@@ -779,8 +788,11 @@ repeat:
 	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		spin_lock(&inode->i_lock);
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
@@ -810,6 +822,7 @@ repeat:
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
@@ -884,9 +897,9 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -968,7 +981,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
@@ -1017,7 +1029,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
@@ -1071,17 +1082,19 @@ EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
-	} else
+	} else {
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
 		inode = NULL;
+	}
 	spin_unlock(&inode_lock);
 	return inode;
 }
@@ -1116,7 +1129,6 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
@@ -1152,7 +1164,6 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
@@ -1318,6 +1329,10 @@ int insert_inode_locked(struct inode *inode)
 	ino_t ino = inode->i_ino;
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
+	/*
+	 * Nobody else can see the new inode yet, so it is safe to set flags
+	 * without locking here.
+	 */
 	inode->i_state |= I_NEW;
 	while (1) {
 		struct hlist_bl_node *node;
@@ -1329,8 +1344,11 @@ int insert_inode_locked(struct inode *inode)
 				continue;
 			if (old->i_sb != sb)
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1339,7 +1357,6 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
@@ -1373,8 +1390,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 				continue;
 			if (!test(old, data))
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1383,7 +1403,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
@@ -1433,6 +1452,8 @@ static void iput_final(struct inode *inode)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int drop;
 
+	assert_spin_locked(&inode->i_lock);
+
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1443,22 +1464,28 @@ static void iput_final(struct inode *inode)
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
 			    list_empty(&inode->i_lru)) {
+				spin_unlock(&inode->i_lock);
 				inode_lru_list_add(inode);
+				return;
 			}
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
 	/*
 	 * move the inode off the IO lists and LRU once I_FREEING is set so
@@ -1495,13 +1522,12 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		BUG_ON(inode->i_state & I_CLEAR);
-
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
+		BUG_ON(inode->i_state & I_CLEAR);
+
 		inode->i_ref--;
 		if (inode->i_ref == 0) {
-			spin_unlock(&inode->i_lock);
 			iput_final(inode);
 			return;
 		}
@@ -1687,6 +1713,8 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1694,6 +1722,7 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
diff --git a/fs/nilfs2/gcdat.c b/fs/nilfs2/gcdat.c
index 84a45d1..c51f0e8 100644
--- a/fs/nilfs2/gcdat.c
+++ b/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 3389ff0..8a05213 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -249,8 +249,11 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		/*
 		 * If the inode is not referenced, the inode cannot have any
@@ -258,9 +261,10 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * actually evict all unreferenced inodes from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		spin_lock(&inode->i_lock);
-		if (!inode->i_ref)
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index b7cbc41..c7b5fc6 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -899,18 +899,20 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    !atomic_read(&inode->i_writecount) ||
+		    !dqinit_needed(inode, type)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount))
-			continue;
-		if (!dqinit_needed(inode, type))
-			continue;
 
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  5:21 ` [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock Dave Chinner
@ 2010-10-08  7:49   ` Christoph Hellwig
  2010-10-08  8:04     ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:28PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently protect the per-inode state flags with the inode_lock.
> Using a global lock to protect per-object state is overkill when we
> coul duse a per-inode lock to protect the state.  Use the
> inode->i_lock for this, and wrap all the state changes and checks
> with the inode->i_lock.
> 
> Based on work originally written by Nick Piggin.

> @@ -884,9 +897,9 @@ struct inode *new_inode(struct super_block *sb)
>  	inode = alloc_inode(sb);
>  	if (inode) {
>  		spin_lock(&inode_lock);
> -		__inode_add_to_lists(sb, NULL, inode);
>  		inode->i_ino = ++last_ino;
>  		inode->i_state = 0;
> +		__inode_add_to_lists(sb, NULL, inode);
>  		spin_unlock(&inode_lock);
>  	}
>  	return inode;

What's the point in doing this move?

> @@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
>  			if (set(inode, data))
>  				goto set_failed;
>  
> -			__inode_add_to_lists(sb, b, inode);
>  			inode->i_state = I_NEW;
> +			__inode_add_to_lists(sb, b, inode);

Same here.



Otherwise it looks good.  But all this moving around of i_lock really
hurts my brain.  I guess I'll need to review the placement on a tree
with the fully applied series again.


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  7:49   ` Christoph Hellwig
@ 2010-10-08  8:04     ` Dave Chinner
  2010-10-08  8:18       ` Christoph Hellwig
                         ` (2 more replies)
  0 siblings, 3 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  8:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:49:32AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:28PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We currently protect the per-inode state flags with the inode_lock.
> > Using a global lock to protect per-object state is overkill when we
> > coul duse a per-inode lock to protect the state.  Use the
> > inode->i_lock for this, and wrap all the state changes and checks
> > with the inode->i_lock.
> > 
> > Based on work originally written by Nick Piggin.
> 
> > @@ -884,9 +897,9 @@ struct inode *new_inode(struct super_block *sb)
> >  	inode = alloc_inode(sb);
> >  	if (inode) {
> >  		spin_lock(&inode_lock);
> > -		__inode_add_to_lists(sb, NULL, inode);
> >  		inode->i_ino = ++last_ino;
> >  		inode->i_state = 0;
> > +		__inode_add_to_lists(sb, NULL, inode);
> >  		spin_unlock(&inode_lock);
> >  	}
> >  	return inode;
> 
> What's the point in doing this move?

hmmmm, let me think on that....

> 
> > @@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
> >  			if (set(inode, data))
> >  				goto set_failed;
> >  
> > -			__inode_add_to_lists(sb, b, inode);
> >  			inode->i_state = I_NEW;
> > +			__inode_add_to_lists(sb, b, inode);
> 
> Same here.

Ah, done thinking now! I was so the i_state field had been set
before the inode was added to various lists and potentially
accessable to other threads. I should probably add a comment to that
effect, right?

> Otherwise it looks good.  But all this moving around of i_lock really
> hurts my brain.  I guess I'll need to review the placement on a tree
> with the fully applied series again.

Probably best - I didn't get it right the first time, either, when
doing it patch by patch. I had to take that step back to analyse
where i'd screwed it up....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  8:04     ` Dave Chinner
@ 2010-10-08  8:18       ` Christoph Hellwig
  2010-10-16  7:57         ` Nick Piggin
  2010-10-09  8:05       ` Christoph Hellwig
  2010-10-09 14:52       ` Matthew Wilcox
  2 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

> 
> Ah, done thinking now! I was so the i_state field had been set
> before the inode was added to various lists and potentially
> accessable to other threads. I should probably add a comment to that
> effect, right?

Yes, please.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  8:18       ` Christoph Hellwig
@ 2010-10-16  7:57         ` Nick Piggin
  2010-10-16 16:19           ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  7:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:18:43AM -0400, Christoph Hellwig wrote:
> > 
> > Ah, done thinking now! I was so the i_state field had been set
> > before the inode was added to various lists and potentially
> > accessable to other threads. I should probably add a comment to that
> > effect, right?
> 
> Yes, please.

This is due to i_lock not covering all the icache state of the inode,
so you have to make these synchronisation changes like this.

I much prefer such proposals to go at the end of my series, where I
will probably nack them (and use rcu instead if the remaining trylocks
are such a big issue).

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-16  7:57         ` Nick Piggin
@ 2010-10-16 16:19           ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:57:09PM +1100, Nick Piggin wrote:
> > > Ah, done thinking now! I was so the i_state field had been set
> > > before the inode was added to various lists and potentially
> > > accessable to other threads. I should probably add a comment to that
> > > effect, right?
> > 
> > Yes, please.
> 
> This is due to i_lock not covering all the icache state of the inode,
> so you have to make these synchronisation changes like this.
> 
> I much prefer such proposals to go at the end of my series, where I
> will probably nack them (and use rcu instead if the remaining trylocks
> are such a big issue).

To get back to the context - what it changes is setting up i_state =
I_NEW before adding the inode to the sb-list and the hash.  Making
sure objects are fully set up before adding to a list is always a good
idea, and really has nothing to do with RCU.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  8:04     ` Dave Chinner
  2010-10-08  8:18       ` Christoph Hellwig
@ 2010-10-09  8:05       ` Christoph Hellwig
  2010-10-09 14:52       ` Matthew Wilcox
  2 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-09  8:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 07:04:28PM +1100, Dave Chinner wrote:
> > >  		inode->i_ino = ++last_ino;
> > >  		inode->i_state = 0;
> > > +		__inode_add_to_lists(sb, NULL, inode);
> > >  		spin_unlock(&inode_lock);
> > >  	}
> > >  	return inode;
> > 
> > What's the point in doing this move?
> 
> hmmmm, let me think on that....
> 
> > 
> > > @@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
> > >  			if (set(inode, data))
> > >  				goto set_failed;
> > >  
> > > -			__inode_add_to_lists(sb, b, inode);
> > >  			inode->i_state = I_NEW;
> > > +			__inode_add_to_lists(sb, b, inode);
> > 
> > Same here.
> 
> Ah, done thinking now! I was so the i_state field had been set
> before the inode was added to various lists and potentially
> accessable to other threads. I should probably add a comment to that
> effect, right?

In addition to the comment get_new_inode_fast also needs the same
treatment.  I also wonder if we need to set I_NEW in new_inode and
then later call unlock_new_inode on it.  It's not on the hash at that
point, but it is on the per-sbi list which we use for a few things.
With current callers it seems safe, but the whole thing also is rather
fragile.  Better left for another patch, though.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-08  8:04     ` Dave Chinner
  2010-10-08  8:18       ` Christoph Hellwig
  2010-10-09  8:05       ` Christoph Hellwig
@ 2010-10-09 14:52       ` Matthew Wilcox
  2010-10-10  2:01         ` Dave Chinner
  2 siblings, 1 reply; 162+ messages in thread
From: Matthew Wilcox @ 2010-10-09 14:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 07:04:28PM +1100, Dave Chinner wrote:
> > > @@ -884,9 +897,9 @@ struct inode *new_inode(struct super_block *sb)
> > >  	inode = alloc_inode(sb);
> > >  	if (inode) {
> > >  		spin_lock(&inode_lock);
> > > -		__inode_add_to_lists(sb, NULL, inode);
> > >  		inode->i_ino = ++last_ino;
> > >  		inode->i_state = 0;
> > > +		__inode_add_to_lists(sb, NULL, inode);
> > >  		spin_unlock(&inode_lock);
> > >  	}
> > >  	return inode;
> > 
> > What's the point in doing this move?
> 
> hmmmm, let me think on that....
> 
> > 
> > > @@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
> > >  			if (set(inode, data))
> > >  				goto set_failed;
> > >  
> > > -			__inode_add_to_lists(sb, b, inode);
> > >  			inode->i_state = I_NEW;
> > > +			__inode_add_to_lists(sb, b, inode);
> > 
> > Same here.
> 
> Ah, done thinking now! I was so the i_state field had been set
> before the inode was added to various lists and potentially
> accessable to other threads. I should probably add a comment to that
> effect, right?

If that can happen, don't we need a wmb() between the assignment to
i_state and the list_add too?  If so, that's a good comment :-)

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
  2010-10-09 14:52       ` Matthew Wilcox
@ 2010-10-10  2:01         ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-10  2:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Sat, Oct 09, 2010 at 08:52:27AM -0600, Matthew Wilcox wrote:
> On Fri, Oct 08, 2010 at 07:04:28PM +1100, Dave Chinner wrote:
> > > > @@ -884,9 +897,9 @@ struct inode *new_inode(struct super_block *sb)
> > > >  	inode = alloc_inode(sb);
> > > >  	if (inode) {
> > > >  		spin_lock(&inode_lock);
> > > > -		__inode_add_to_lists(sb, NULL, inode);
> > > >  		inode->i_ino = ++last_ino;
> > > >  		inode->i_state = 0;
> > > > +		__inode_add_to_lists(sb, NULL, inode);
> > > >  		spin_unlock(&inode_lock);
> > > >  	}
> > > >  	return inode;
> > > 
> > > What's the point in doing this move?
> > 
> > hmmmm, let me think on that....
> > 
> > > 
> > > > @@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
> > > >  			if (set(inode, data))
> > > >  				goto set_failed;
> > > >  
> > > > -			__inode_add_to_lists(sb, b, inode);
> > > >  			inode->i_state = I_NEW;
> > > > +			__inode_add_to_lists(sb, b, inode);
> > > 
> > > Same here.
> > 
> > Ah, done thinking now! I was so the i_state field had been set
> > before the inode was added to various lists and potentially
> > accessable to other threads. I should probably add a comment to that
> > effect, right?
> 
> If that can happen, don't we need a wmb() between the assignment to
> i_state and the list_add too?  If so, that's a good comment :-)

No, because the locking on the lists will provide the memory
barrier.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (13 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:53   ` Christoph Hellwig
                     ` (2 more replies)
  2010-10-08  5:21 ` [PATCH 16/18] fs: Make iunique independent of inode_lock Dave Chinner
                   ` (4 subsequent siblings)
  19 siblings, 3 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>

new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations.  This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   45 ++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d3bd08a..13e1325 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -870,6 +870,43 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+#define LAST_INO_BATCH 1024
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int last_ino_get(void)
+{
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -884,12 +921,6 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static unsigned int last_ino;
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -897,7 +928,7 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		inode->i_ino = ++last_ino;
+		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  5:21 ` [PATCH 15/18] fs: introduce a per-cpu last_ino allocator Dave Chinner
@ 2010-10-08  7:53   ` Christoph Hellwig
  2010-10-08  8:05     ` Dave Chinner
  2010-10-08  8:22   ` Andi Kleen
  2010-10-08  9:56   ` Al Viro
  2 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> +static unsigned int last_ino_get(void)

Shouldn't this be get_next_ino?

Otherwise looks okay for now.


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  7:53   ` Christoph Hellwig
@ 2010-10-08  8:05     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  8:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:53:20AM -0400, Christoph Hellwig wrote:
> > +static unsigned int last_ino_get(void)
> 
> Shouldn't this be get_next_ino?

Sure, that sounds better....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  5:21 ` [PATCH 15/18] fs: introduce a per-cpu last_ino allocator Dave Chinner
  2010-10-08  7:53   ` Christoph Hellwig
@ 2010-10-08  8:22   ` Andi Kleen
  2010-10-08  8:44     ` Christoph Hellwig
  2010-10-08  9:58     ` Al Viro
  2010-10-08  9:56   ` Al Viro
  2 siblings, 2 replies; 162+ messages in thread
From: Andi Kleen @ 2010-10-08  8:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Dave Chinner <david@fromorbit.com> writes:

> From: Eric Dumazet <eric.dumazet@gmail.com>
>
> new_inode() dirties a contended cache line to get increasing
> inode numbers. This limits performance on workloads that cause
> significant parallel inode allocation.
>
> Solve this problem by using a per_cpu variable fed by the shared
> last_ino in batches of 1024 allocations.  This reduces contention on
> the shared last_ino, and give same spreading ino numbers than before
> (i.e. same wraparound after 2^32 allocations).

This doesn't help for Unix disk file systems, so not fully sure why you
need it for XFS.

But looks reasonable, although it would be better to simply fix 
sockets/pipes/etc. to not allocate an inode numbers.

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  8:22   ` Andi Kleen
@ 2010-10-08  8:44     ` Christoph Hellwig
  2010-10-08  9:58     ` Al Viro
  1 sibling, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:22:34AM +0200, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > new_inode() dirties a contended cache line to get increasing
> > inode numbers. This limits performance on workloads that cause
> > significant parallel inode allocation.
> >
> > Solve this problem by using a per_cpu variable fed by the shared
> > last_ino in batches of 1024 allocations.  This reduces contention on
> > the shared last_ino, and give same spreading ino numbers than before
> > (i.e. same wraparound after 2^32 allocations).
> 
> This doesn't help for Unix disk file systems, so not fully sure why you
> need it for XFS.

Currently i_ino is assigned for every inode allocated using new_inode.
It's pretty stupid as most callers simply don't need it.  But Dave
didn't want to make this series even more complicated than nessecary
and leave sorting this out for later.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  8:22   ` Andi Kleen
  2010-10-08  8:44     ` Christoph Hellwig
@ 2010-10-08  9:58     ` Al Viro
  2010-10-08 10:09       ` Andi Kleen
  1 sibling, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:22:34AM +0200, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > new_inode() dirties a contended cache line to get increasing
> > inode numbers. This limits performance on workloads that cause
> > significant parallel inode allocation.
> >
> > Solve this problem by using a per_cpu variable fed by the shared
> > last_ino in batches of 1024 allocations.  This reduces contention on
> > the shared last_ino, and give same spreading ino numbers than before
> > (i.e. same wraparound after 2^32 allocations).
> 
> This doesn't help for Unix disk file systems, so not fully sure why you
> need it for XFS.
> 
> But looks reasonable, although it would be better to simply fix 
> sockets/pipes/etc. to not allocate an inode numbers.

Can be done if you bother to add ->getattr() for those, but you'll need
to do some kind of lazy allocation of inumbers for those; fstat() _will_
want st_ino.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  9:58     ` Al Viro
@ 2010-10-08 10:09       ` Andi Kleen
  2010-10-08 10:19         ` Al Viro
  0 siblings, 1 reply; 162+ messages in thread
From: Andi Kleen @ 2010-10-08 10:09 UTC (permalink / raw)
  To: Al Viro; +Cc: Andi Kleen, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:58:24AM +0100, Al Viro wrote:
> On Fri, Oct 08, 2010 at 10:22:34AM +0200, Andi Kleen wrote:
> > Dave Chinner <david@fromorbit.com> writes:
> > 
> > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > >
> > > new_inode() dirties a contended cache line to get increasing
> > > inode numbers. This limits performance on workloads that cause
> > > significant parallel inode allocation.
> > >
> > > Solve this problem by using a per_cpu variable fed by the shared
> > > last_ino in batches of 1024 allocations.  This reduces contention on
> > > the shared last_ino, and give same spreading ino numbers than before
> > > (i.e. same wraparound after 2^32 allocations).
> > 
> > This doesn't help for Unix disk file systems, so not fully sure why you
> > need it for XFS.
> > 
> > But looks reasonable, although it would be better to simply fix 
> > sockets/pipes/etc. to not allocate an inode numbers.
> 
> Can be done if you bother to add ->getattr() for those, but you'll need
> to do some kind of lazy allocation of inumbers for those; fstat() _will_
> want st_ino.

Why not just put 0 in st_ino for sockets/pipes/etc. ? 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 10:09       ` Andi Kleen
@ 2010-10-08 10:19         ` Al Viro
  2010-10-08 10:20           ` Eric Dumazet
  0 siblings, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08 10:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 12:09:44PM +0200, Andi Kleen wrote:

> > Can be done if you bother to add ->getattr() for those, but you'll need
> > to do some kind of lazy allocation of inumbers for those; fstat() _will_
> > want st_ino.
> 
> Why not just put 0 in st_ino for sockets/pipes/etc. ? 

Because it's a userland-visible change; right now you can compare
i_ino for sockets...

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 10:19         ` Al Viro
@ 2010-10-08 10:20           ` Eric Dumazet
  0 siblings, 0 replies; 162+ messages in thread
From: Eric Dumazet @ 2010-10-08 10:20 UTC (permalink / raw)
  To: Al Viro; +Cc: Andi Kleen, Dave Chinner, linux-fsdevel, linux-kernel

Le vendredi 08 octobre 2010 à 11:19 +0100, Al Viro a écrit :
> On Fri, Oct 08, 2010 at 12:09:44PM +0200, Andi Kleen wrote:
> 
> > > Can be done if you bother to add ->getattr() for those, but you'll need
> > > to do some kind of lazy allocation of inumbers for those; fstat() _will_
> > > want st_ino.
> > 
> > Why not just put 0 in st_ino for sockets/pipes/etc. ? 
> 
> Because it's a userland-visible change; right now you can compare
> i_ino for sockets...

lsof for example uses i_ino



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  5:21 ` [PATCH 15/18] fs: introduce a per-cpu last_ino allocator Dave Chinner
  2010-10-08  7:53   ` Christoph Hellwig
  2010-10-08  8:22   ` Andi Kleen
@ 2010-10-08  9:56   ` Al Viro
  2010-10-08 10:03     ` Christoph Hellwig
  2 siblings, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08  9:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:29PM +1100, Dave Chinner wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> 
> new_inode() dirties a contended cache line to get increasing
> inode numbers. This limits performance on workloads that cause
> significant parallel inode allocation.
> 
> Solve this problem by using a per_cpu variable fed by the shared
> last_ino in batches of 1024 allocations.  This reduces contention on
> the shared last_ino, and give same spreading ino numbers than before
> (i.e. same wraparound after 2^32 allocations).

FWIW, that one is begging to be split; what I mean is that there are
two classes of callers; ones that will set i_ino themselves anyway
and ones that really want i_ino invented.  Two functions?

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08  9:56   ` Al Viro
@ 2010-10-08 10:03     ` Christoph Hellwig
  2010-10-08 10:20       ` Eric Dumazet
  2010-10-16  7:57       ` Nick Piggin
  0 siblings, 2 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08 10:03 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 10:56:58AM +0100, Al Viro wrote:
> FWIW, that one is begging to be split; what I mean is that there are
> two classes of callers; ones that will set i_ino themselves anyway
> and ones that really want i_ino invented.  Two functions?

There's no reason to add i_ino before adding it to the per-sb list,
we don't do so either for inodes acquired via iget.  The fix is simply
to stop assigning i_ino in new_inode and call the helper to get it in
the place that need it after the call to new_inode.  Later we can
even move to a lazy assignment scheme where needed.  I'd also really
like to get a grip on why the simple counters if fine for some
filesystems while we need iunique() for others.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 10:03     ` Christoph Hellwig
@ 2010-10-08 10:20       ` Eric Dumazet
  2010-10-08 13:48         ` Christoph Hellwig
  2010-10-16  7:57       ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Eric Dumazet @ 2010-10-08 10:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

Le vendredi 08 octobre 2010 à 06:03 -0400, Christoph Hellwig a écrit :
> On Fri, Oct 08, 2010 at 10:56:58AM +0100, Al Viro wrote:
> > FWIW, that one is begging to be split; what I mean is that there are
> > two classes of callers; ones that will set i_ino themselves anyway
> > and ones that really want i_ino invented.  Two functions?
> 
> There's no reason to add i_ino before adding it to the per-sb list,
> we don't do so either for inodes acquired via iget.  The fix is simply
> to stop assigning i_ino in new_inode and call the helper to get it in
> the place that need it after the call to new_inode.  Later we can
> even move to a lazy assignment scheme where needed.  I'd also really
> like to get a grip on why the simple counters if fine for some
> filesystems while we need iunique() for others.

If iunique() was scalable, sockets could use it, so that we can have
hard guarantee two sockets on machine dont have same inum.

A reasonable compromise here is to use a simple and scalable allocator,
and take the risk two sockets have same inum.

While it might break some applications playing fstats() games, on
sockets, current schem is vastly faster.

I worked with machines with millions of opened socket concurrently,
iunique() was not an option, and application didnt care of possible inum
clash.


For disk files, inum _must_ be unique per fs, for sockets, its only if
you want strict compliance to some standards.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 10:20       ` Eric Dumazet
@ 2010-10-08 13:48         ` Christoph Hellwig
  2010-10-08 14:06           ` Eric Dumazet
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08 13:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Al Viro, Dave Chinner, linux-fsdevel,
	linux-kernel

On Fri, Oct 08, 2010 at 12:20:19PM +0200, Eric Dumazet wrote:
> If iunique() was scalable, sockets could use it, so that we can have
> hard guarantee two sockets on machine dont have same inum.
> 
> A reasonable compromise here is to use a simple and scalable allocator,
> and take the risk two sockets have same inum.
> 
> While it might break some applications playing fstats() games, on
> sockets, current schem is vastly faster.
> 
> I worked with machines with millions of opened socket concurrently,
> iunique() was not an option, and application didnt care of possible inum
> clash.

The current version of iuniqueue is indeed rather suboptimal.  As is
the pure counter approach.  I think the right way to deal with it
is to use an idr allocator.  This means the filesystem needs to
explicitly free the inode number when the inode is gone, but that
just makes the usage more clear.  Together with the lazy assignment
scheme for synthetic filesystems that should give us both speed and
correctness.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 13:48         ` Christoph Hellwig
@ 2010-10-08 14:06           ` Eric Dumazet
  2010-10-08 19:10             ` Christoph Hellwig
  2010-10-09 17:14             ` Matthew Wilcox
  0 siblings, 2 replies; 162+ messages in thread
From: Eric Dumazet @ 2010-10-08 14:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

Le vendredi 08 octobre 2010 à 09:48 -0400, Christoph Hellwig a écrit :
> On Fri, Oct 08, 2010 at 12:20:19PM +0200, Eric Dumazet wrote:
> > If iunique() was scalable, sockets could use it, so that we can have
> > hard guarantee two sockets on machine dont have same inum.
> > 
> > A reasonable compromise here is to use a simple and scalable allocator,
> > and take the risk two sockets have same inum.
> > 
> > While it might break some applications playing fstats() games, on
> > sockets, current schem is vastly faster.
> > 
> > I worked with machines with millions of opened socket concurrently,
> > iunique() was not an option, and application didnt care of possible inum
> > clash.
> 
> The current version of iuniqueue is indeed rather suboptimal.  As is
> the pure counter approach.  I think the right way to deal with it
> is to use an idr allocator.  This means the filesystem needs to
> explicitly free the inode number when the inode is gone, but that
> just makes the usage more clear.  Together with the lazy assignment
> scheme for synthetic filesystems that should give us both speed and
> correctness.
> 

On 32bit arches, inum for sockets/pipes could be pretty fast

unsigned u32 rnd_val __read_mostly; /* seeded at boot time */

unsigned u32 get_inum(struct inode *ino, size_t size)
{
	return rnd_val ^ ((long)ino + random32() % size);
}

(Ie , use fact that an inode is a kernel object, with a given address
and a given size, two inodes cannot overlap)


I have no idea how scalable is an idr allocator, but it probably uses
one big lock.

Maybe finally generate 64bit inum on 64bit arches...



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 14:06           ` Eric Dumazet
@ 2010-10-08 19:10             ` Christoph Hellwig
  2010-10-09 17:14             ` Matthew Wilcox
  1 sibling, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08 19:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Al Viro, Dave Chinner, linux-fsdevel,
	linux-kernel

On Fri, Oct 08, 2010 at 04:06:12PM +0200, Eric Dumazet wrote:
> On 32bit arches, inum for sockets/pipes could be pretty fast
> 
> unsigned u32 rnd_val __read_mostly; /* seeded at boot time */
> 
> unsigned u32 get_inum(struct inode *ino, size_t size)
> {
> 	return rnd_val ^ ((long)ino + random32() % size);
> }
> 
> (Ie , use fact that an inode is a kernel object, with a given address
> and a given size, two inodes cannot overlap)

Yeah, we could probably do this.  From looking at the remaining users
of the last_ino replacement this could probably work for them.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 14:06           ` Eric Dumazet
  2010-10-08 19:10             ` Christoph Hellwig
@ 2010-10-09 17:14             ` Matthew Wilcox
  1 sibling, 0 replies; 162+ messages in thread
From: Matthew Wilcox @ 2010-10-09 17:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, Al Viro, Dave Chinner, linux-fsdevel,
	linux-kernel

On Fri, Oct 08, 2010 at 04:06:12PM +0200, Eric Dumazet wrote:
> Maybe finally generate 64bit inum on 64bit arches...

Last I heard, Eric Sandeen took a look at this, and it'd break something
like 30-40% of the userspace shipped in RHEL 5 if run as 32-bit on a
64-bit kernel.  IOW, keep inode numbers to 32-bit for now.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-08 10:03     ` Christoph Hellwig
  2010-10-08 10:20       ` Eric Dumazet
@ 2010-10-16  7:57       ` Nick Piggin
  2010-10-16 16:22         ` Christoph Hellwig
  1 sibling, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  7:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 06:03:46AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 10:56:58AM +0100, Al Viro wrote:
> > FWIW, that one is begging to be split; what I mean is that there are
> > two classes of callers; ones that will set i_ino themselves anyway
> > and ones that really want i_ino invented.  Two functions?
> 
> There's no reason to add i_ino before adding it to the per-sb list,
> we don't do so either for inodes acquired via iget.  The fix is simply
> to stop assigning i_ino in new_inode and call the helper to get it in
> the place that need it after the call to new_inode.  Later we can

My approach in my tree is a new function like Al suggests, which
simply doesn't assign the ino. That keeps compatibility backward.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-16  7:57       ` Nick Piggin
@ 2010-10-16 16:22         ` Christoph Hellwig
  2010-10-16 17:21           ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Al Viro, Dave Chinner, linux-fsdevel,
	linux-kernel

On Sat, Oct 16, 2010 at 06:57:21PM +1100, Nick Piggin wrote:
> My approach in my tree is a new function like Al suggests, which
> simply doesn't assign the ino. That keeps compatibility backward.

There's really no point.  The concept of creating a new inode has
absolutely nothing to do with i_ino.  We'll just need i_ino before
adding an inode to the hash.  The only reason it's been done by
new_inode is historic coincidence - cleaning this mess up is a good
thing independent of making the fake inode number generation scale
better.  As you can see in my patch moving it out there's actually
only very few filesystems that need it.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 15/18] fs: introduce a per-cpu last_ino allocator
  2010-10-16 16:22         ` Christoph Hellwig
@ 2010-10-16 17:21           ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-16 17:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:22:01PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 06:57:21PM +1100, Nick Piggin wrote:
> > My approach in my tree is a new function like Al suggests, which
> > simply doesn't assign the ino. That keeps compatibility backward.
> 
> There's really no point.

It is, the point is backwards compatibility and churn. It's like a
single function call and a load from cache in the inode creation
path -- a drop in the ocean. So it's not worth my time with the
churn.

>  The concept of creating a new inode has
> absolutely nothing to do with i_ino.  We'll just need i_ino before
> adding an inode to the hash.  The only reason it's been done by
> new_inode is historic coincidence - cleaning this mess up is a good
> thing independent of making the fake inode number generation scale
> better.  As you can see in my patch moving it out there's actually
> only very few filesystems that need it.

Easy to just have a new name, IMO. But I won't get hung up arguing
the point.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 16/18] fs: Make iunique independent of inode_lock
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (14 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 15/18] fs: introduce a per-cpu last_ino allocator Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  7:55   ` Christoph Hellwig
  2010-10-08  5:21 ` [PATCH 17/18] fs: icache remove inode_lock Dave Chinner
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Before removing the inode_lock, the iunique counter needs to be made
independent of the inode_lock. Add a new lock to protect the iunique
counter and nest it inside the inode_lock to provide the same
protection that the inode_lock currently provides.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   33 ++++++++++++++++++++++++++++-----
 1 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 13e1325..4ec360e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1070,6 +1070,30 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * search the inode cache for a matching inode number.
+ * If we find one, then the inode number we are trying to
+ * allocate is not unique and so we should not use it.
+ *
+ * Returns 1 if the inode number is unique, 0 if it is not.
+ */
+static int test_inode_iunique(struct super_block * sb, unsigned long ino)
+{
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_node *node;
+	struct inode *inode;
+
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb) {
+			spin_unlock_bucket(b);
+			return 0;
+		}
+	}
+	spin_unlock_bucket(b);
+	return 1;
+}
+
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
@@ -1091,19 +1115,18 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
+	static DEFINE_SPINLOCK(unique_lock);
 	static unsigned int counter;
-	struct inode *inode;
-	struct inode_hash_bucket *b;
 	ino_t res;
 
 	spin_lock(&inode_lock);
+	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		b = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, b, res);
-	} while (inode != NULL);
+	} while (!test_inode_iunique(sb, res));
+	spin_unlock(&unique_lock);
 	spin_unlock(&inode_lock);
 
 	return res;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 16/18] fs: Make iunique independent of inode_lock
  2010-10-08  5:21 ` [PATCH 16/18] fs: Make iunique independent of inode_lock Dave Chinner
@ 2010-10-08  7:55   ` Christoph Hellwig
  2010-10-08  8:06     ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  7:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> +	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
> +		if (inode->i_ino == ino && inode->i_sb == sb) {

wouldn't it be more natural to test the sb first here?

Otherwise looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 16/18] fs: Make iunique independent of inode_lock
  2010-10-08  7:55   ` Christoph Hellwig
@ 2010-10-08  8:06     ` Dave Chinner
  2010-10-08  8:19       ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  8:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 03:55:24AM -0400, Christoph Hellwig wrote:
> > +	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
> > +		if (inode->i_ino == ino && inode->i_sb == sb) {
> 
> wouldn't it be more natural to test the sb first here?

Maybe, but I think an inode number match is less likely, so the
order it currently does the check results in less code being
executed on misses. I'll have a look at the rest of the code and do
whatever order they do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 16/18] fs: Make iunique independent of inode_lock
  2010-10-08  8:06     ` Dave Chinner
@ 2010-10-08  8:19       ` Christoph Hellwig
  0 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 07:06:55PM +1100, Dave Chinner wrote:
> On Fri, Oct 08, 2010 at 03:55:24AM -0400, Christoph Hellwig wrote:
> > > +	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
> > > +		if (inode->i_ino == ino && inode->i_sb == sb) {
> > 
> > wouldn't it be more natural to test the sb first here?
> 
> Maybe, but I think an inode number match is less likely, so the
> order it currently does the check results in less code being
> executed on misses.

Ok, sounds fine.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 17/18] fs: icache remove inode_lock
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (15 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 16/18] fs: Make iunique independent of inode_lock Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  8:03   ` Christoph Hellwig
                     ` (2 more replies)
  2010-10-08  5:21 ` [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
                   ` (2 subsequent siblings)
  19 siblings, 3 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

All the functionality that the inode_lock protected has now been
wrapped up in new independent locks and/or functionality. Hence the
inode_lock does not serve a purpose any longer and hence can now be
removed.

Based on work originally done by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 Documentation/filesystems/Locking |    2 +-
 Documentation/filesystems/porting |   10 ++++-
 Documentation/filesystems/vfs.txt |    2 +-
 fs/buffer.c                       |    2 +-
 fs/drop_caches.c                  |    4 --
 fs/fs-writeback.c                 |   47 ++++-----------------
 fs/inode.c                        |   82 ++++---------------------------------
 fs/logfs/inode.c                  |    2 +-
 fs/notify/inode_mark.c            |   11 ++---
 fs/notify/mark.c                  |    1 -
 fs/notify/vfsmount_mark.c         |    1 -
 fs/ntfs/inode.c                   |    4 +-
 fs/ocfs2/inode.c                  |    2 +-
 fs/quota/dquot.c                  |   12 +----
 include/linux/fs.h                |    2 +-
 include/linux/writeback.h         |    3 -
 mm/backing-dev.c                  |    6 ---
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 19 files changed, 48 insertions(+), 157 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283..e92dad2 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -114,7 +114,7 @@ alloc_inode:
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock, sb_inode_list_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..ab07213 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -299,7 +299,7 @@ be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
 been for ->delete_inode().
-	->drop_inode() returns int now; it's called on final iput() with inode_lock
+	->drop_inode() returns int now; it's called on final iput() with i_lock
 held and it returns true if filesystems wants the inode to be dropped.  As before,
 generic_drop_inode() is still the default and it's been updated appropriately.
 generic_delete_inode() is also alive and it consists simply of return 1.  Note that
@@ -318,3 +318,11 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+
+[mandatory]
+	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
+for details of what locks to replace inode_lock with in order to protect
+particular things. Most of the time, a filesystem only needs ->i_lock, which
+protects *all* the inode state and its membership on lists that was
+previously protected with inode_lock.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..405beb2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -246,7 +246,7 @@ or bottom half).
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the inode_lock spinlock held.
+	with the i_lock and sb_inode_list_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
diff --git a/fs/buffer.c b/fs/buffer.c
index b5c4153..99a9f8d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 00180dc..2105713 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 404d449..f8eb27c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -184,7 +184,7 @@ static void requeue_io(struct inode *inode)
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -283,25 +283,21 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via iref_locked or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ * Write out an inode's dirty pages.  Either the caller has ref on the inode
+ * (either via iref_locked or via syscall against an fd) or the inode has
+ * I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
- *
- * Called under inode_lock.
  */
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
@@ -346,7 +342,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -366,12 +361,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * due to delalloc, clear dirty metadata flags right before
 	 * write_inode()
 	 */
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -379,7 +372,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -527,10 +519,8 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			redirty_tail(inode);
 			spin_unlock(&wb->b_lock);
 		}
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -550,9 +540,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
-
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -572,7 +560,6 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 			break;
 	}
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -581,13 +568,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -697,7 +682,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
@@ -708,7 +692,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 			inode_wait_for_writeback(inode);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_lock);
 	}
 
 	return wrote;
@@ -971,7 +954,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -1029,8 +1011,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 out_unlock:
 	spin_unlock(&inode->i_lock);
 out:
-	spin_unlock(&inode_lock);
-
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
 }
@@ -1063,7 +1043,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 
 	/*
@@ -1086,14 +1065,12 @@ static void wait_sb_inodes(struct super_block *sb)
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * inode_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it
-		 * under inode_lock. So we keep the reference and iput
-		 * it later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * s_inodes_lock.  We cannot iput the inode now as we can be
+		 * holding the last reference and we cannot iput it under
+		 * s_inodes_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1102,11 +1079,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -1209,9 +1184,7 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
 	ret = writeback_single_inode(inode, &wbc);
-	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1233,9 +1206,7 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	int ret;
 
-	spin_lock(&inode_lock);
 	ret = writeback_single_inode(inode, wbc);
-	spin_unlock(&inode_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
diff --git a/fs/inode.c b/fs/inode.c
index 4ec360e..c778ec4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -41,11 +41,9 @@
  *   inode_lru, i_lru
  *
  * Lock orders
- * inode_lock
  *   inode hash bucket lock
  *     inode->i_lock
  *
- * inode_lock
  *   sb inode lock
  *     inode_lru_lock
  *       wb->b_lock
@@ -118,14 +116,6 @@ static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
 static struct inode_hash_bucket *inode_hashtable __read_mostly;
 
 /*
- * A simple spinlock to protect the list manipulations.
- *
- * NOTE! You also have to own the lock if you change
- * the i_state of an inode while it is in use..
- */
-DEFINE_SPINLOCK(inode_lock);
-
-/*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
@@ -357,7 +347,7 @@ static void init_once(void *foo)
 }
 
 /*
- * inode_lock must be held
+ * i_lock must be held
  */
 void iref_locked(struct inode *inode)
 {
@@ -369,11 +359,9 @@ EXPORT_SYMBOL_GPL(iref_locked);
 
 void iref(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	iref_locked(inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
 
@@ -439,11 +427,9 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 	struct inode_hash_bucket *b;
 
 	b = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock(&inode_lock);
 	spin_lock_bucket(b);
 	hlist_bl_add_head(&inode->i_hash, &b->head);
 	spin_unlock_bucket(b);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -472,9 +458,7 @@ static void __remove_inode_hash(struct inode *inode)
  */
 void remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	__remove_inode_hash(inode);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
 
@@ -526,12 +510,10 @@ static void dispose_list(struct list_head *head)
 
 		evict(inode);
 
-		spin_lock(&inode_lock);
 		__remove_inode_hash(inode);
 		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode->i_sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -558,7 +540,6 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
@@ -614,12 +595,10 @@ int invalidate_inodes(struct super_block *sb)
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -644,7 +623,7 @@ static int can_unuse(struct inode *inode)
 
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * a temporary list and then are freed outside LRU lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -662,7 +641,6 @@ static void prune_icache(int nr_to_scan)
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -690,12 +668,10 @@ static void prune_icache(int nr_to_scan)
 			iref_locked(inode);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lru_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
 			spin_lock(&inode->i_lock);
 
@@ -733,7 +709,6 @@ static void prune_icache(int nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -854,9 +829,9 @@ __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
  * @inode: inode to mark in use
  *
  * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash. This needs to be done under
- * the inode_lock, so export a function to do this rather than the inode lock
- * itself. We calculate the hash list to add to here so it is all internal
+ * list, the owning superblock and the inode hash.
+ *
+ * We calculate the hash list to add to here so it is all internal
  * which requires the caller to have already set up the inode number in the
  * inode to add.
  */
@@ -864,9 +839,7 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&inode_lock);
 	__inode_add_to_lists(sb, b, inode);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
@@ -923,15 +896,11 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_lock);
-
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
 		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
@@ -990,7 +959,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, b, test, data);
 		if (!old) {
@@ -999,7 +967,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, b, inode);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1014,7 +981,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 */
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1022,7 +988,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	return inode;
 
 set_failed:
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -1040,14 +1005,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
 			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1062,7 +1025,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 */
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1119,7 +1081,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	static unsigned int counter;
 	ino_t res;
 
-	spin_lock(&inode_lock);
 	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
@@ -1127,7 +1088,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		res = counter++;
 	} while (!test_inode_iunique(sb, res));
 	spin_unlock(&unique_lock);
-	spin_unlock(&inode_lock);
 
 	return res;
 }
@@ -1135,7 +1095,6 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
 		iref_locked(inode);
@@ -1149,7 +1108,6 @@ struct inode *igrab(struct inode *inode)
 		 */
 		inode = NULL;
 	}
-	spin_unlock(&inode_lock);
 	return inode;
 }
 EXPORT_SYMBOL(igrab);
@@ -1171,7 +1129,7 @@ EXPORT_SYMBOL(igrab);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct inode_hash_bucket *b,
@@ -1180,17 +1138,14 @@ static struct inode *ifind(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1215,16 +1170,13 @@ static struct inode *ifind_fast(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1247,7 +1199,7 @@ static struct inode *ifind_fast(struct super_block *sb,
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1275,7 +1227,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1326,7 +1278,7 @@ EXPORT_SYMBOL(ilookup);
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_lock held, so can't sleep.
+ * Note both @test and @set are called with the i_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
@@ -1391,7 +1343,6 @@ int insert_inode_locked(struct inode *inode)
 	while (1) {
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
-		spin_lock(&inode_lock);
 		spin_lock_bucket(b);
 		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_ino != ino)
@@ -1408,13 +1359,11 @@ int insert_inode_locked(struct inode *inode)
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, &b->head);
 			spin_unlock_bucket(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1437,7 +1386,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 		spin_lock_bucket(b);
 		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_sb != sb)
@@ -1454,13 +1402,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, &b->head);
 			spin_unlock_bucket(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1523,15 +1469,12 @@ static void iput_final(struct inode *inode)
 				return;
 			}
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
@@ -1556,7 +1499,6 @@ static void iput_final(struct inode *inode)
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb->s_inodes_lock);
 
-	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
 	wake_up_inode(inode);
@@ -1576,7 +1518,6 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		BUG_ON(inode->i_state & I_CLEAR);
 
@@ -1586,7 +1527,6 @@ void iput(struct inode *inode)
 			return;
 		}
 		spin_unlock(&inode->i_lock);
-		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
@@ -1766,8 +1706,6 @@ EXPORT_SYMBOL(inode_wait);
  * It doesn't matter if I_NEW is not set initially, a call to
  * wake_up_inode() after removing from the hash list will DTRT.
  *
- * This is called with inode_lock held.
- *
  * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
@@ -1777,10 +1715,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index d8c71ec..a67b607 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -286,7 +286,7 @@ static int logfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-/* called with inode_lock held */
+/* called with i_lock held */
 static int logfs_drop_inode(struct inode *inode)
 {
 	struct logfs_super *super = logfs_super(inode->i_sb);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 8a05213..57c28ae 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -22,7 +22,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
+#include <linux/writeback.h>
 
 #include <asm/atomic.h>
 
@@ -232,9 +232,8 @@ out:
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with inode_lock held, protecting the unmounting super block's list
- * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * We temporarily drop inode_lock, however, and CAN block.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb_inode_list_lock to protect the super block's list of inodes.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -288,13 +287,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_lock here because we hold
+		 * We can safely drop sb->s_inodes_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -306,7 +304,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 325185e..50c0085 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -91,7 +91,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/srcu.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c
index 56772b5..6f8eefe 100644
--- a/fs/notify/vfsmount_mark.c
+++ b/fs/notify/vfsmount_mark.c
@@ -23,7 +23,6 @@
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 93622b1..7c530f3 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -54,7 +54,7 @@
  *
  * Return 1 if the attributes match and 0 if not.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep.
  */
 int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
@@ -98,7 +98,7 @@ int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
  *
  * Return 0 on success and -errno on error.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
  */
 static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index eece3e0..65c61e2 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1195,7 +1195,7 @@ void ocfs2_evict_inode(struct inode *inode)
 	ocfs2_clear_inode(inode);
 }
 
-/* Called under inode_lock, with no more references on the
+/* Called under i_lock, with no more references on the
  * struct inode, so it's safe here to check the flags field
  * and to manipulate i_nlink without any other locks. */
 int ocfs2_drop_inode(struct inode *inode)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index c7b5fc6..533cd95 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -76,7 +76,7 @@
 #include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
-#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -896,7 +896,6 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -914,21 +913,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the inode_lock.
+		 * removed from s_inodes list while we dropped the lock.
 		 * We cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under inode_lock. So we
+		 * reference and we cannot iput it under the lock. So we
 		 * keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1009,7 +1005,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -1025,7 +1020,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 54c4e86..453e0b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1588,7 +1588,7 @@ struct super_operations {
 };
 
 /*
- * Inode state bits.  Protected by inode_lock.
+ * Inode state bits.  Protected by i_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b182ccc..67be7a2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,9 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
-
-
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 74e8269..0c0586b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -73,7 +73,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
@@ -82,7 +81,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -695,7 +693,6 @@ void mapping_set_bdi(struct address_space *mapping,
 	if (unlikely(old == bdi))
 		return;
 
-	spin_lock(&inode_lock);
 	bdi_lock_two(bdi, old);
 	if (!list_empty(&inode->i_io)) {
 		struct inode *i;
@@ -727,7 +724,6 @@ found:
 	mapping->a_bdi = bdi;
 	spin_unlock(&bdi->wb.b_lock);
 	spin_unlock(&old->wb.b_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(mapping_set_bdi);
 
@@ -743,7 +739,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 		struct inode *i, *tmp;
 
-		spin_lock(&inode_lock);
 		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_io) {
 			list_del(&i->i_io);
@@ -762,7 +757,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		}
 		spin_unlock(&bdi->wb.b_lock);
 		spin_unlock(&dst->b_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	bdi_unregister(bdi);
diff --git a/mm/filemap.c b/mm/filemap.c
index 454d5ec..857fb34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  ->inode_lock
+ *  ->i_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,8 +98,8 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(zap_pte_range->set_page_dirty)
+ *    ->i_lock			(page_remove_rmap->set_page_dirty)
+ *    ->i_lock			(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
  *  ->task->proc_lock
diff --git a/mm/rmap.c b/mm/rmap.c
index 92e6757..dbfccae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -31,11 +31,11 @@
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
- *               inode_lock (in set_page_dirty's __mark_inode_dirty)
- *                 sb_lock (within inode_lock in fs/fs-writeback.c)
+ *               i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                 sb_lock (within i_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_lock in __sync_single_inode)
+ *                           within i_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-08  5:21 ` [PATCH 17/18] fs: icache remove inode_lock Dave Chinner
@ 2010-10-08  8:03   ` Christoph Hellwig
  2010-10-08  8:09     ` Dave Chinner
  2010-10-13  7:20   ` Nick Piggin
  2010-10-16  7:57   ` Nick Piggin
  2 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All the functionality that the inode_lock protected has now been
> wrapped up in new independent locks and/or functionality. Hence the
> inode_lock does not serve a purpose any longer and hence can now be
> removed.

Might be worth mentioning this also updates the locking / lock order
documenation all over the place.

> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -114,7 +114,7 @@ alloc_inode:
>  destroy_inode:
>  dirty_inode:				(must not sleep)
>  write_inode:
> -drop_inode:				!!!inode_lock!!!
> +drop_inode:				!!!i_lock, sb_inode_list_lock!!!

sb_inode_list_lock now is sb->s_inodes_lock, this also applies in a few
other places.


> +[mandatory]
> +	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
> +for details of what locks to replace inode_lock with in order to protect
> +particular things. Most of the time, a filesystem only needs ->i_lock, which
> +protects *all* the inode state and its membership on lists that was
> +previously protected with inode_lock.

Which list membership does i_lock protect?

> --- a/fs/notify/inode_mark.c
> +++ b/fs/notify/inode_mark.c
> @@ -22,7 +22,7 @@
>  #include <linux/module.h>
>  #include <linux/mutex.h>
>  #include <linux/spinlock.h>
> -#include <linux/writeback.h> /* for inode_lock */
> +#include <linux/writeback.h>

Do we still need writeback.h here?

> @@ -76,7 +76,7 @@
>  #include <linux/buffer_head.h>
>  #include <linux/capability.h>
>  #include <linux/quotaops.h>
> -#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
> +#include <linux/writeback.h>

Same here.

Otherwise looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-08  8:03   ` Christoph Hellwig
@ 2010-10-08  8:09     ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  8:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:03:19AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 08, 2010 at 04:21:31PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > All the functionality that the inode_lock protected has now been
> > wrapped up in new independent locks and/or functionality. Hence the
> > inode_lock does not serve a purpose any longer and hence can now be
> > removed.
> 
> Might be worth mentioning this also updates the locking / lock order
> documenation all over the place.

Ok.

> 
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -114,7 +114,7 @@ alloc_inode:
> >  destroy_inode:
> >  dirty_inode:				(must not sleep)
> >  write_inode:
> > -drop_inode:				!!!inode_lock!!!
> > +drop_inode:				!!!i_lock, sb_inode_list_lock!!!
> 
> sb_inode_list_lock now is sb->s_inodes_lock, this also applies in a few
> other places.
> 
> 
> > +[mandatory]
> > +	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
> > +for details of what locks to replace inode_lock with in order to protect
> > +particular things. Most of the time, a filesystem only needs ->i_lock, which
> > +protects *all* the inode state and its membership on lists that was
> > +previously protected with inode_lock.
> 
> Which list membership does i_lock protect?

Oops, I missed updating the documentation file. Will fix that up.

> > --- a/fs/notify/inode_mark.c
> > +++ b/fs/notify/inode_mark.c
> > @@ -22,7 +22,7 @@
> >  #include <linux/module.h>
> >  #include <linux/mutex.h>
> >  #include <linux/spinlock.h>
> > -#include <linux/writeback.h> /* for inode_lock */
> > +#include <linux/writeback.h>
> 
> Do we still need writeback.h here?
> 
> > @@ -76,7 +76,7 @@
> >  #include <linux/buffer_head.h>
> >  #include <linux/capability.h>
> >  #include <linux/quotaops.h>
> > -#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
> > +#include <linux/writeback.h>
> 
> Same here.

I'll check them and clean it up appropriately.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-08  5:21 ` [PATCH 17/18] fs: icache remove inode_lock Dave Chinner
  2010-10-08  8:03   ` Christoph Hellwig
@ 2010-10-13  7:20   ` Nick Piggin
  2010-10-13  7:27     ` Nick Piggin
                       ` (2 more replies)
  2010-10-16  7:57   ` Nick Piggin
  2 siblings, 3 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-13  7:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All the functionality that the inode_lock protected has now been
> wrapped up in new independent locks and/or functionality. Hence the
> inode_lock does not serve a purpose any longer and hence can now be
> removed.
> 
> Based on work originally done by Nick Piggin.

Sorry about being offline for so long. I had some work finishing with
SUSE and then took some vacation without much net access for several
weeks :P

Unfortunate timing that everybody is suddenly interested in the
scalability work :) I didn't want to dump a lot of patches just
before I went and not be able to support them if they were merged /
respond to review in a timely way. But I still want to maintain my
vfs-scale stack. I'm glad to see lots of interest in it now.

So I will like criticism of that and hopefully fold improvements back.
The problem I guess with taking the patches and reworking them a bit is
just that I have lost a bit of context of what you're doing, and also
it loses it's verification within the entire series (ie. the end goal
of doing store free path walking relies a bit on RCU inode for example),
and I've done a lot of microbenchmarking.

I don't see any radical changes that you've done yet, although it's
hard to tell exactly.

I'm not sure about trylocking. I don't think it is an unmaintainable
mess in inode, because it is confined to the core inode (fs/inode,
fs/fs-writeback.c etc), and not visible to outside.

But let me get back up to speed and see what you've done here.

I might need a little more time than 2.6.37! But I'll try my best.
I don't think the patchset has suddenly become vastly more urgent
in the past month, so I think my approach of having it get a lot
of testing and go in Al's vfs tree for a while is best.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13  7:20   ` Nick Piggin
@ 2010-10-13  7:27     ` Nick Piggin
  2010-10-13 11:28       ` Christoph Hellwig
  2010-10-13 10:42     ` Eric Dumazet
  2010-10-13 11:25     ` Christoph Hellwig
  2 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-13  7:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 06:20:58PM +1100, Nick Piggin wrote:
> On Fri, Oct 08, 2010 at 04:21:31PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > All the functionality that the inode_lock protected has now been
> > wrapped up in new independent locks and/or functionality. Hence the
> > inode_lock does not serve a purpose any longer and hence can now be
> > removed.
> > 
> > Based on work originally done by Nick Piggin.
> 
> Sorry about being offline for so long. I had some work finishing with
> SUSE and then took some vacation without much net access for several
> weeks :P
> 
> Unfortunate timing that everybody is suddenly interested in the
> scalability work :) I didn't want to dump a lot of patches just
> before I went and not be able to support them if they were merged /
> respond to review in a timely way. But I still want to maintain my
> vfs-scale stack. I'm glad to see lots of interest in it now.
> 
> So I will like criticism of that and hopefully fold improvements back.
> The problem I guess with taking the patches and reworking them a bit is
> just that I have lost a bit of context of what you're doing, and also
> it loses it's verification within the entire series (ie. the end goal
> of doing store free path walking relies a bit on RCU inode for example),
> and I've done a lot of microbenchmarking.

I guess my point here is not that the inode scaling series must be
set in stone or that I'm not happy to debate changes. But what I would
like is to have that series working within the full vfs-scale branch
before merging pices of it.

This is basically because there are some dependencies with the rcu walk,
for example, and also the need to measure the end-game performance of
the whole series and ensure that there are no showstopper regressions or
nasty surprises.

I'm happy to help to port the top of the patch set onto changes in
earlier parts of it, but I would like the chance to do this really. I'm
back in action now, so I can spend a lot of time catching up.

As I said, I'll continue catching up with all the threads and see where
it's up to.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13  7:27     ` Nick Piggin
@ 2010-10-13 11:28       ` Christoph Hellwig
  2010-10-13 12:03         ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-13 11:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 06:27:01PM +1100, Nick Piggin wrote:
> I'm happy to help to port the top of the patch set onto changes in
> earlier parts of it, but I would like the chance to do this really. I'm
> back in action now, so I can spend a lot of time catching up.

That's all good and fine, but it's really no reason for delaying getting
the most important bits in.  RCU path walk is all good and fine, and
I'm really looking forward to eventually see it.  But the basic
inode_lock and dcache_lock splits are fundamental work we need rather
sooner than later.  Additional candy ontop of that is fine but we'll
need a solid base.  Also note that having the locking split up, and
proper exported APIs instead of growling into dcache internals in
various filesystems means that we can start to look into replacing
the global inode and dcache hashes much more easily, and having
global data structures at least for the dcache is almost as bad
as having global locks.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 11:28       ` Christoph Hellwig
@ 2010-10-13 12:03         ` Nick Piggin
  2010-10-13 12:20           ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-13 12:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 07:28:27AM -0400, Christoph Hellwig wrote:
> On Wed, Oct 13, 2010 at 06:27:01PM +1100, Nick Piggin wrote:
> > I'm happy to help to port the top of the patch set onto changes in
> > earlier parts of it, but I would like the chance to do this really. I'm
> > back in action now, so I can spend a lot of time catching up.
> 
> That's all good and fine, but it's really no reason for delaying getting
> the most important bits in.

Really? I *really* think I can be given the chance to review what's
happened, catch up, and make sure it's foward compatible with the
rest of my tree. The most important bits are, in fact, mostly my
patches anyway unless there is a fundamentally different approach to
take. And so either way I don't think it is ready for 2.6.37 if it
hasn't been in vfs for testing and review by fs people -- that's what
we agreed I thought for the inode and dcache lock splitups.

>  RCU path walk is all good and fine, and
> I'm really looking forward to eventually see it.  But the basic
> inode_lock and dcache_lock splits are fundamental work we need rather
> sooner than later.

Sure, and I'm glad you're agreeing with that now, I'm just saying I
need to catch up with it after taking a few weeks off. OK?

>  Additional candy ontop of that is fine but we'll
> need a solid base.  Also note that having the locking split up, and
> proper exported APIs instead of growling into dcache internals in

It's not really additional candy, but very fundamental work to the
whole series. Linus agreed with that, so I need to ensure everything
will work properly.

> various filesystems means that we can start to look into replacing
> the global inode and dcache hashes much more easily, and having
> global data structures at least for the dcache is almost as bad
> as having global locks.

As I've repeated, I don't know if your assertion is true, and
definitely any fine grained type of data structure will need to
show it is competitive with a fine grained locked hash.

I would be very interested if there is a better data structure,
but it is hard to know actually. I think it is a topic best
explored after the vfs-scale series goes in, but I think any kind
of per-directory tree might suffer too many cache misses on large
directory lookups, I don't know if marginal locality improvements
in dir-local workloads would outweigh it. A per-directory hash would
have to be resized a lot. Any other ideas?

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 12:03         ` Nick Piggin
@ 2010-10-13 12:20           ` Christoph Hellwig
  2010-10-13 12:25             ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-13 12:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 11:03:07PM +1100, Nick Piggin wrote:
> On Wed, Oct 13, 2010 at 07:28:27AM -0400, Christoph Hellwig wrote:
> > On Wed, Oct 13, 2010 at 06:27:01PM +1100, Nick Piggin wrote:
> > > I'm happy to help to port the top of the patch set onto changes in
> > > earlier parts of it, but I would like the chance to do this really. I'm
> > > back in action now, so I can spend a lot of time catching up.
> > 
> > That's all good and fine, but it's really no reason for delaying getting
> > the most important bits in.
> 
> Really? I *really* think I can be given the chance to review what's
> happened, catch up, and make sure it's foward compatible with the
> rest of my tree.

Please go and review it, the more eyes core code gets the better.  But
don't assume you have carte blanche to delay things again just for the
fun of it.  Other patches in your tree will need at least as many
changes as the inode bits did, so they will need some major work anyway.
Holding the splitup now for things that will take at least another half
a year to hit the tree is rather pointless.

> The most important bits are, in fact, mostly my
> patches anyway unless there is a fundamentally different approach to
> take. And so either way I don't think it is ready for 2.6.37 if it
> hasn't been in vfs for testing and review by fs people -- that's what
> we agreed I thought for the inode and dcache lock splitups.

fs and vfs people have been reviewing the code for the last couple of
weeks, and we're almost done.  Unless we'll find anoher issue it
should go into the vfs tree.  Given that we don't even have a vfs
tree for .37 yet there's no way we could have put it in earlier anyay..

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 12:20           ` Christoph Hellwig
@ 2010-10-13 12:25             ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-13 12:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 08:20:02AM -0400, Christoph Hellwig wrote:
> On Wed, Oct 13, 2010 at 11:03:07PM +1100, Nick Piggin wrote:
> > On Wed, Oct 13, 2010 at 07:28:27AM -0400, Christoph Hellwig wrote:
> > > On Wed, Oct 13, 2010 at 06:27:01PM +1100, Nick Piggin wrote:
> > > > I'm happy to help to port the top of the patch set onto changes in
> > > > earlier parts of it, but I would like the chance to do this really. I'm
> > > > back in action now, so I can spend a lot of time catching up.
> > > 
> > > That's all good and fine, but it's really no reason for delaying getting
> > > the most important bits in.
> > 
> > Really? I *really* think I can be given the chance to review what's
> > happened, catch up, and make sure it's foward compatible with the
> > rest of my tree.
> 
> Please go and review it, the more eyes core code gets the better.  But
> don't assume you have carte blanche to delay things again just for the
> fun of it.

Heh, I'm not, I've been the one driving most of this, so it's not a big
deal. I'm not interested in trying to delay it, trust me, but I think I
can be given some time to review it. It's taken much more than a couple
of weeks for me to get serious reviews...


>  Other patches in your tree will need at least as many
> changes as the inode bits did, so they will need some major work anyway.
> Holding the splitup now for things that will take at least another half
> a year to hit the tree is rather pointless.
> 
> > The most important bits are, in fact, mostly my
> > patches anyway unless there is a fundamentally different approach to
> > take. And so either way I don't think it is ready for 2.6.37 if it
> > hasn't been in vfs for testing and review by fs people -- that's what
> > we agreed I thought for the inode and dcache lock splitups.
> 
> fs and vfs people have been reviewing the code for the last couple of
> weeks, and we're almost done.

Thanks very much, it looks productive. I'm not sure if I agree exactly
with everything but I'll catch up and get back to it.


>  Unless we'll find anoher issue it
> should go into the vfs tree.  Given that we don't even have a vfs
> tree for .37 yet there's no way we could have put it in earlier anyay..

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13  7:20   ` Nick Piggin
  2010-10-13  7:27     ` Nick Piggin
@ 2010-10-13 10:42     ` Eric Dumazet
  2010-10-13 12:07       ` Nick Piggin
  2010-10-13 11:25     ` Christoph Hellwig
  2 siblings, 1 reply; 162+ messages in thread
From: Eric Dumazet @ 2010-10-13 10:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le mercredi 13 octobre 2010 à 18:20 +1100, Nick Piggin a écrit :

> I don't think the patchset has suddenly become vastly more urgent
> in the past month, so I think my approach of having it get a lot
> of testing and go in Al's vfs tree for a while is best.
> 

Hi Nick 

Not vastly urgent, but highly wanted on many workloads, even ones not
really related to 'fs'...

In current tree, a "close(socket())" needs 31 us on a 2x4x2 machine,
instead of 1.45 us if single thread.

But yes, I agree a lot of testing is needed :)

--------------------------------------------------------------------------------------------
   PerfTop:   16584 irqs/sec  kernel:99.8%  exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
--------------------------------------------------------------------------------------------

             samples  pcnt function                 DSO
             _______ _____ ________________________ _______

           101217.00 87.8% _raw_spin_lock           vmlinux
             1365.00  1.2% kmem_cache_alloc         vmlinux
             1220.00  1.1% d_alloc                  vmlinux
             1159.00  1.0% sock_alloc_file          vmlinux
              881.00  0.8% __d_instantiate          vmlinux
              657.00  0.6% memset                   vmlinux
              516.00  0.4% dput                     vmlinux
              423.00  0.4% new_inode                vmlinux
              401.00  0.3% _atomic_dec_and_lock     vmlinux
              397.00  0.3% iput                     vmlinux
              358.00  0.3% dentry_iput              vmlinux
              320.00  0.3% kmem_cache_free          vmlinux
              309.00  0.3% __slab_free              vmlinux
              297.00  0.3% __sk_free                vmlinux
              264.00  0.2% kmem_cache_alloc_notrace vmlinux
              263.00  0.2% __call_rcu               vmlinux

Thanks

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 10:42     ` Eric Dumazet
@ 2010-10-13 12:07       ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-13 12:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 12:42:36PM +0200, Eric Dumazet wrote:
> Le mercredi 13 octobre 2010 à 18:20 +1100, Nick Piggin a écrit :
> 
> > I don't think the patchset has suddenly become vastly more urgent
> > in the past month, so I think my approach of having it get a lot
> > of testing and go in Al's vfs tree for a while is best.
> > 
> 
> Hi Nick 
> 
> Not vastly urgent, but highly wanted on many workloads, even ones not
> really related to 'fs'...
> 
> In current tree, a "close(socket())" needs 31 us on a 2x4x2 machine,
> instead of 1.45 us if single thread.
> 
> But yes, I agree a lot of testing is needed :)

Hi Eric,

Yes of course I know you know about this :) And google knows about it
too -- they of course posted the batched iput/dput patches a couple of
years back when they noticed it on their socket workloads. I've
extensively tested the socket paths during development of the patches,
and on a POWER7 system with many hundreds of threads, it scales
completely linearly!

I acknowledge that the vfs scale work is actually quite urgent, and
probably at least a year overdue (2.6.32 would have been nice target
for distros). I just mean that it hasn't suddenly gone from less to
much more important to push this in now before I review it or before
it has had a chance in vfs tree.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13  7:20   ` Nick Piggin
  2010-10-13  7:27     ` Nick Piggin
  2010-10-13 10:42     ` Eric Dumazet
@ 2010-10-13 11:25     ` Christoph Hellwig
  2010-10-13 12:30       ` Nick Piggin
  2 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-13 11:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 06:20:58PM +1100, Nick Piggin wrote:
> Unfortunate timing that everybody is suddenly interested in the
> scalability work :)

People have been interested for a long time.  It's just that we finally
made forward progress to get parts of it shape for merging, which should
have been done long time ago.

> I might need a little more time than 2.6.37! But I'll try my best.
> I don't think the patchset has suddenly become vastly more urgent
> in the past month, so I think my approach of having it get a lot
> of testing and go in Al's vfs tree for a while is best.

It's always been quite urgent.  Both inode_lock and dcache_lock
contention have been hurting us for a while.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 11:25     ` Christoph Hellwig
@ 2010-10-13 12:30       ` Nick Piggin
  2010-10-13 23:23         ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-13 12:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 07:25:52AM -0400, Christoph Hellwig wrote:
> On Wed, Oct 13, 2010 at 06:20:58PM +1100, Nick Piggin wrote:
> > Unfortunate timing that everybody is suddenly interested in the
> > scalability work :)
> 
> People have been interested for a long time.  It's just that we finally
> made forward progress to get parts of it shape for merging, which should
> have been done long time ago.

It has been pretty close, IMO, it just needed some more reviewers, which
it only just got really.

> > I might need a little more time than 2.6.37! But I'll try my best.
> > I don't think the patchset has suddenly become vastly more urgent
> > in the past month, so I think my approach of having it get a lot
> > of testing and go in Al's vfs tree for a while is best.
> 
> It's always been quite urgent.  Both inode_lock and dcache_lock
> contention have been hurting us for a while.

Yes I know it has been urgent for several years, but it hasn't been
treated that way by vfs people until really so late that it is going to
hurt a lot for people deploying Linux now or using enterprise distros
etc. I had to really learn most of the code from scratch to get this far
and got quite little constructive review really until recently. Anyway
that's nothing to be done about that now. I don't think there should be
any significant delay in me catching up, really.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 12:30       ` Nick Piggin
@ 2010-10-13 23:23         ` Dave Chinner
  2010-10-14  9:06           ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-13 23:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Wed, Oct 13, 2010 at 11:30:08PM +1100, Nick Piggin wrote:
> On Wed, Oct 13, 2010 at 07:25:52AM -0400, Christoph Hellwig wrote:
> > On Wed, Oct 13, 2010 at 06:20:58PM +1100, Nick Piggin wrote:
> > > Unfortunate timing that everybody is suddenly interested in the
> > > scalability work :)
> > 
> > People have been interested for a long time.  It's just that we finally
> > made forward progress to get parts of it shape for merging, which should
> > have been done long time ago.
> 
> It has been pretty close, IMO, it just needed some more reviewers, which
> it only just got really.

Going back a couple of weeks, it seemed as far away from inclusion
as when you first posted the series - there had been no substanital
review and nobody wanting to review it in it's current form. Then
I'd heard you were travelling indefinitely.....

There is stuff in the vfs-scale tree that is somewhat controversial
and had not been discussed satisfactorily - the lock ordering
(resulting in trylocks everywhere), the shrinker API change, the
writeback LRU changes, the zone reclaim changes, etc - and some of
them even have alternative proposals for fixing the algorithmic
deficiencies.  Nobody was going to review or accept that as one big
lump.

There's been review now because I went and did what the potential
reviewers were asking for - break the series into smaller, more
easily reviewable and verifiable chunks. As a result, I think we're
close to the end of the review cycle for the inode_lock breakup now.
I think the code is now much cleaner and more maintainable than what I
originally pulled from the vfs-scale tree, and it still provides the
same gains and ability to be converted to RCU algorithms in the
future.

Hence, IMO, the current vfs-scale tree needs to be treated as a
prototype, not as a finished product. It demonstrates the the path
we need to follow to move forward, as well as the gains we will get
as we move in that direction, but the code in that tree is not
guaranteed a free pass into the mainline tree.

> etc. I had to really learn most of the code from scratch to get this far
> and got quite little constructive review really until recently.

Which seems to be another good reason for treating the tree as prototype
code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-13 23:23         ` Dave Chinner
@ 2010-10-14  9:06           ` Nick Piggin
  2010-10-14  9:13             ` Nick Piggin
  2010-10-14 14:41             ` Christoph Hellwig
  0 siblings, 2 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-14  9:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2010 at 10:23:19AM +1100, Dave Chinner wrote:
> On Wed, Oct 13, 2010 at 11:30:08PM +1100, Nick Piggin wrote:
> > On Wed, Oct 13, 2010 at 07:25:52AM -0400, Christoph Hellwig wrote:
> > > On Wed, Oct 13, 2010 at 06:20:58PM +1100, Nick Piggin wrote:
> > > > Unfortunate timing that everybody is suddenly interested in the
> > > > scalability work :)
> > > 
> > > People have been interested for a long time.  It's just that we finally
> > > made forward progress to get parts of it shape for merging, which should
> > > have been done long time ago.
> > 
> > It has been pretty close, IMO, it just needed some more reviewers, which
> > it only just got really.
> 
> Going back a couple of weeks, it seemed as far away from inclusion
> as when you first posted the series - there had been no substanital
> review and nobody wanting to review it in it's current form. Then
> I'd heard you were travelling indefinitely.....

Well it was a few weeks. Not great timing, but as I said, I didn't
want to dump the latest patches and then disappear.

There were actually several (google and intel) people testing things.
Last I heard from you (from the first constructive review I had), you
were skeptical about performance and scalability required and thought
it would be a better idea to reduce lock widths and things incrementally
(which I believe will lead to much more overall churn and confusion for
maintaining between different kernel versions).

Anyway, no real harm done and I thank you for picking things up, so
I'm still working through what you've done atm.

>
> There is stuff in the vfs-scale tree that is somewhat controversial
> and had not been discussed satisfactorily - the lock ordering
> (resulting in trylocks everywhere), the shrinker API change, the
> writeback LRU changes, the zone reclaim changes, etc - and some of
> them even have alternative proposals for fixing the algorithmic
> deficiencies.  Nobody was going to review or accept that as one big
> lump.

I think what is absolutely needed is a final(ish) "lump", so we can
actually see what we're working towards and whether it is the right
approach.

Shrinker and zone reclaim is definitely needed. It is needed for NUMA
scalability and locality of reclaim, and also for container and directed
dentry/inode reclaim. Google have a very similar patch and they've said
this is needed (and I already know it is needed for scalability on
large NUMA -- SGI were complaining about this nearly 5 years ago IIRC).
So that is _definitely_ going to be needed.

Store-free path walking is definitely needed, so we need to do RCU inodes.
With RCU inodes, the optimal locking protocols change quite a bit --

Trylocks are a side effect of my conservative approach to establishing
a lock order, and making incremental changes in locking protocol which
are supposed to be easy to follow and verify. Then comes optimisation
of the locking after the "correctness" part of it. It can be taken even
further and *most* of the icache trylocks removed using RCU on a few
more of the lists.

Now I don't want to necessarily merge it all at once, but I do need an
overview of it, most certainly.

A few trylocks confined to core inode handling (and the other inode
locks are not exposed to filesystems at all -- unlike inode_lock today)
is not what I'd call a maintainence nightmare. In fact, IMO it is
cleaner locking especially for filesystems than today.

With a structure like icache, where icache can be accessed down, via
the cache management structures or up, via references to inodes,
trylocks are not unusual unless RCU is extensively used. As long as
I'm careful, I don't see a problem at all -- XFS similarly has some
trylocks.

> There's been review now because I went and did what the potential
> reviewers were asking for - break the series into smaller, more
> easily reviewable and verifiable chunks. As a result, I think we're
> close to the end of the review cycle for the inode_lock breakup now.
> I think the code is now much cleaner and more maintainable than what I
> originally pulled from the vfs-scale tree, and it still provides the
> same gains and ability to be converted to RCU algorithms in the
> future.
> 
> Hence, IMO, the current vfs-scale tree needs to be treated as a
> prototype, not as a finished product. It demonstrates the the path
> we need to follow to move forward, as well as the gains we will get
> as we move in that direction, but the code in that tree is not
> guaranteed a free pass into the mainline tree.

It is really pretty close, and while *you* have some disagreements,
it has had some reviews from other people (including Linus) who actually
agree with most of it and agree that scalability is needed.

I am fine with continuing to take suggestions that I agree with, and
integrating them into the tree and begin to start pushing chunks out
the bottom of the stack, but I would like to keep things together in
my upstream tree.

> > etc. I had to really learn most of the code from scratch to get this far
> > and got quite little constructive review really until recently.
> 
> Which seems to be another good reason for treating the tree as prototype
> code.

It's much past a prototype. While the patches need some more cleanup
and review still, the final end result gives a tree with almost no
global cachelines in the entire vfs, including path walking. Things
like path walks are nearly 50% faster single threaded, and perfectly
scalable. Linus actually wants the store-free path walk stuff
_before_ any of the other things, if that gives you an idea of where
other people are putting the priority of the patches.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-14  9:06           ` Nick Piggin
@ 2010-10-14  9:13             ` Nick Piggin
  2010-10-14 14:41             ` Christoph Hellwig
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-14  9:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2010 at 08:06:09PM +1100, Nick Piggin wrote:
> On Thu, Oct 14, 2010 at 10:23:19AM +1100, Dave Chinner wrote:

> Shrinker and zone reclaim is definitely needed. It is needed for NUMA
> scalability and locality of reclaim, and also for container and directed
> dentry/inode reclaim. Google have a very similar patch and they've said
> this is needed (and I already know it is needed for scalability on
> large NUMA -- SGI were complaining about this nearly 5 years ago IIRC).
> So that is _definitely_ going to be needed.
> 
> Store-free path walking is definitely needed, so we need to do RCU inodes.
> With RCU inodes, the optimal locking protocols change quite a bit --

[...]

> It's much past a prototype. While the patches need some more cleanup
> and review still, the final end result gives a tree with almost no
> global cachelines in the entire vfs, including path walking. Things
> like path walks are nearly 50% faster single threaded, and perfectly
> scalable. Linus actually wants the store-free path walk stuff
> _before_ any of the other things, if that gives you an idea of where
> other people are putting the priority of the patches.

With this said, I think you're probably not quite aware of the bigger
picture with the vfs-scale series. Yes it will be important to help
your XFS inode contention, but there are many other people having other
problems, and other big improvements that will benefit desktops and
more common workloads than big-IO ones in the series.

So yes I'll definitely keep the vfs-scale series together. Most of the
inode scaling work is at the bottom of it and should be able to go in
soon. But for example, the inode RCU work is going to go in -- Linus
has acked my strategy for it (and plan for mitigating/avoiding possible
regressions if needed). So with that, it makes more sense to design
the locking with the RCU available.

If we _know_ it will be needed in future anyway, it doesn't make sense
to a different non-RCU approach, and then rework that again down the
line IMO. That just gives a larger burden of locking models that need
to be supported/debugged.

Ditto for other things like per zone locking.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-14  9:06           ` Nick Piggin
  2010-10-14  9:13             ` Nick Piggin
@ 2010-10-14 14:41             ` Christoph Hellwig
  2010-10-15  0:14               ` Nick Piggin
  2010-10-15  4:04               ` Nick Piggin
  1 sibling, 2 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-14 14:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2010 at 08:06:09PM +1100, Nick Piggin wrote:
> Shrinker and zone reclaim is definitely needed. It is needed for NUMA
> scalability and locality of reclaim, and also for container and directed
> dentry/inode reclaim. Google have a very similar patch and they've said
> this is needed (and I already know it is needed for scalability on
> large NUMA -- SGI were complaining about this nearly 5 years ago IIRC).
> So that is _definitely_ going to be needed.

I'm sitll not sold on the per-zone shrinkers.  For one per-zone is
a really weird concept.  per-node might make a lot more sense, but
what we really need for doing things usefully is per-sb.  If that's
not scalable we might have to for sb x zone.

Either way it's not needed for a lot of workloads, and it's very
controversial.  Trying to beat it though with a take all or everthing
attitude is not helpful, and your constant insistance on it is probably
the biggest factor for delaying all this work so long.

Someone is going to do VFS scaling in pieces and if you're not willing
to help it's going to be someone else.  We'll still build ontop of your
great initial work, though.

> Store-free path walking is definitely needed, so we need to do RCU inodes.
> With RCU inodes, the optimal locking protocols change quite a bit --

I don't think anyone disagrees with that.  How we do the RCU locking
in detail is however still open.  I'd for example really like to see
inodes use slab rcu freeing from the beginning.

> It is really pretty close, and while *you* have some disagreements,
> it has had some reviews from other people (including Linus) who actually
> agree with most of it and agree that scalability is needed.

Again, I've not seen anyone arguing against the scalability.  But as
you might have noticed there's some very different opinions on how
to go there.

> It's much past a prototype. While the patches need some more cleanup
> and review still, the final end result gives a tree with almost no
> global cachelines in the entire vfs, including path walking.

It's a nice prototype, no diagreement.  But we'll need to change a lot
of the VFS things as we go to do things properly.

> Things
> like path walks are nearly 50% faster single threaded, and perfectly
> scalable. Linus actually wants the store-free path walk stuff
> _before_ any of the other things, if that gives you an idea of where
> other people are putting the priority of the patches.

Different people have different priorities.  In the end the person
doing the work of actually getting it in a mergeable shape is setting
the pace.  If you had started splitting out the RCU pathwalk bits half a
year ago there's we already have it in now.  But that's now how it
worked.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-14 14:41             ` Christoph Hellwig
@ 2010-10-15  0:14               ` Nick Piggin
  2010-10-15  3:13                 ` Dave Chinner
  2010-10-15  4:04               ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15  0:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2010 at 10:41:59AM -0400, Christoph Hellwig wrote:
> On Thu, Oct 14, 2010 at 08:06:09PM +1100, Nick Piggin wrote:
> > Shrinker and zone reclaim is definitely needed. It is needed for NUMA
> > scalability and locality of reclaim, and also for container and directed
> > dentry/inode reclaim. Google have a very similar patch and they've said
> > this is needed (and I already know it is needed for scalability on
> > large NUMA -- SGI were complaining about this nearly 5 years ago IIRC).
> > So that is _definitely_ going to be needed.
> 
> I'm sitll not sold on the per-zone shrinkers.  For one per-zone is
> a really weird concept.  per-node might make a lot more sense, but
> what we really need for doing things usefully is per-sb.  If that's
> not scalable we might have to for sb x zone.

Well I don't know what it means that you're "not sold" on them, and
then come up with ridiculous things like per-node might make a lot
more sense, or per-sb; and that per-zone is a really weird concept.

Per-zone is the right way to drive reclaim, and it will allow locality
to work properly, as well as zone reclaim and zone targetted shortages
and policies, and it will also give good scalability. People need it for
all these reasons.

If you're not "sold" on something, you don't have carte blanche power
to obstruct and delay it either, you have to voice reasonable objections
and be prepared to be shown you're wrong. Saying they're "weird" is just
not productive.

> Either way it's not needed for a lot of workloads, and it's very
> controversial.  Trying to beat it though with a take all or everthing
> attitude is not helpful, and your constant insistance on it is probably
> the biggest factor for delaying all this work so long.

How is it very controversial? The only controversy came from Dave, when
he a) totally misunderstood how zone based reclaim works, and b)
objected about lazy LRU.

As far as b) went, I said it was a valid point and that I can change a
few lines so it is no longer lazy for the first merge, or we can just
merge it and change it back if needed. I didn't hear back about that at
the time when I replied, so it appeared that the answer satisfied him.

> 
> Someone is going to do VFS scaling in pieces and if you're not willing
> to help it's going to be someone else.  We'll still build ontop of your
> great initial work, though.
> 
> > Store-free path walking is definitely needed, so we need to do RCU inodes.
> > With RCU inodes, the optimal locking protocols change quite a bit --
> 
> I don't think anyone disagrees with that.  How we do the RCU locking
> in detail is however still open.

Well I didn't think it was, because I have the _entire_ stack here and
working and so we can see exactly how it is working and how it all fits
together. Surely you agree it is better to have the end goal visible,
testable, and reviewable, even if pices are merged at a time?

>  I'd for example really like to see
> inodes use slab rcu freeing from the beginning.

I have gone over this in a couple of threads (with Linus and Dave
actually). I found that, outside a microbenchmark, I hadn't yet found
a real workload that suffered from it; due to how it works, it hopefully
won't be too common to find one; and I also sketched a design for slab
RCU freeing that adds a little more complexity but can be used to fix
regressions if it is needed.

We have gone over all this and it is not "controversial" any more,
unless you actually have a reasonable objection other than handwaving
or bringing up red herrings as you keep doing (like replacing the global
hashes).

> 
> > It is really pretty close, and while *you* have some disagreements,
> > it has had some reviews from other people (including Linus) who actually
> > agree with most of it and agree that scalability is needed.
> 
> Again, I've not seen anyone arguing against the scalability.  But as

Well you have been.

 "Either way [per zone shrinker patch] is not needed for a lot of workloads,
  and it's very controversial."

If you spent 5 minutes looking at the VM reclaim design, you would know
that per-zone reclaim is actually the natural and correct way to do it,
for a large range of reasons. In fact, a global shrinker or a per-node
shirnker is far more weird by comparison.

> you might have noticed there's some very different opinions on how
> to go there.

Very different opinions? Outside obstructionist handwaving? Well then
don't you agree we should sort out those opinions and get a good idea
of the overall picture of where we will end up?

> > It's much past a prototype. While the patches need some more cleanup
> > and review still, the final end result gives a tree with almost no
> > global cachelines in the entire vfs, including path walking.
> 
> It's a nice prototype, no diagreement.  But we'll need to change a lot
> of the VFS things as we go to do things properly.

I don't see much has changed so far, just omitted. Which will need to
be changed again doubly to get where we need.

> > Things
> > like path walks are nearly 50% faster single threaded, and perfectly
> > scalable. Linus actually wants the store-free path walk stuff
> > _before_ any of the other things, if that gives you an idea of where
> > other people are putting the priority of the patches.
> 
> Different people have different priorities.  In the end the person
> doing the work of actually getting it in a mergeable shape is setting
> the pace.  If you had started splitting out the RCU pathwalk bits half a
> year ago there's we already have it in now.  But that's now how it
> worked.

No way it would not have. It wasn't even finished and tested to a degree
that it could be posted for serious review by someone like Al and Linus
until a few months ago.

But you missed my point about that. My point is that we _know_ that
store free path walks are going to be merged, it is one of the more
desirable pieces of the series. So we _know_ RCU inodes are needed, so
we can happily use RCU work earlier in the series to make locking
better in the icache.

If you handwave and say "Oh but RCU is controversial and it's disputed
blah blah so let's not do it yet", then you are being obstructionist,
or haven't kept up with the big picture.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15  0:14               ` Nick Piggin
@ 2010-10-15  3:13                 ` Dave Chinner
  2010-10-15  3:30                   ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-15  3:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 11:14:43AM +1100, Nick Piggin wrote:
> On Thu, Oct 14, 2010 at 10:41:59AM -0400, Christoph Hellwig wrote:
> > On Thu, Oct 14, 2010 at 08:06:09PM +1100, Nick Piggin wrote:
> > > Shrinker and zone reclaim is definitely needed. It is needed for NUMA
> > > scalability and locality of reclaim, and also for container and directed
> > > dentry/inode reclaim. Google have a very similar patch and they've said
> > > this is needed (and I already know it is needed for scalability on
> > > large NUMA -- SGI were complaining about this nearly 5 years ago IIRC).
> > > So that is _definitely_ going to be needed.
> > 
> > I'm sitll not sold on the per-zone shrinkers.  For one per-zone is
> > a really weird concept.  per-node might make a lot more sense, but
> > what we really need for doing things usefully is per-sb.  If that's
> > not scalable we might have to for sb x zone.
> 
> Well I don't know what it means that you're "not sold" on them, and
> then come up with ridiculous things like per-node might make a lot
> more sense, or per-sb; and that per-zone is a really weird concept.
>
> Per-zone is the right way to drive reclaim, and it will allow locality
> to work properly, as well as zone reclaim and zone targetted shortages
> and policies, and it will also give good scalability. People need it for
> all these reasons.

I don't have enough information to be able to say what is the
correct way to improve the shrinkers, but I do have plenty of
information on how the current unbound reclaim parallelism is an
utter PITA to handle. Indeed, it was partially responsible for the
recent kernel.org outage. Hence I really don't like anything that
potentially increases reclaim parallelism until there's been
discussion and work towards fixing these problems first.

Beisdes, IMO, we don't need to rework shrinkers, add zone-based
reclaim, use per-cpu inode lists, etc. to enable store-free path
walking.  You've shown it can be done, and that's great - it shows
us the impact of making those changes, but they need to be analysed
separately and treated on own their merits, not lumped with core
locking changes necessary for store-free path walking.

We know what you think, but you have to let everyone else form their
own opinions and then be convinced by code or discussion that your
way is the right way to do it. This requires your tree to be broken
down into it's component pieces so that us mere mortals can
understand and test the impact of each separate set of changes has
on the system.  It makes it easier to review, identify regressions,
etc but it should not prevent us from reaching the end goal.

> But you missed my point about that. My point is that we _know_ that
> store free path walks are going to be merged, it is one of the more
> desirable pieces of the series. So we _know_ RCU inodes are needed, so
> we can happily use RCU work earlier in the series to make locking
> better in the icache.

We've still got to do all the lock splitting work so we can update
everything without contention. It doesn't matter if that is done
before or after adding RCU - the end result _should_ be the same.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15  3:13                 ` Dave Chinner
@ 2010-10-15  3:30                   ` Nick Piggin
  2010-10-15  3:44                     ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15  3:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 02:13:43PM +1100, Dave Chinner wrote:
> On Fri, Oct 15, 2010 at 11:14:43AM +1100, Nick Piggin wrote:
> > On Thu, Oct 14, 2010 at 10:41:59AM -0400, Christoph Hellwig wrote:
> > > On Thu, Oct 14, 2010 at 08:06:09PM +1100, Nick Piggin wrote:
> > > > Shrinker and zone reclaim is definitely needed. It is needed for NUMA
> > > > scalability and locality of reclaim, and also for container and directed
> > > > dentry/inode reclaim. Google have a very similar patch and they've said
> > > > this is needed (and I already know it is needed for scalability on
> > > > large NUMA -- SGI were complaining about this nearly 5 years ago IIRC).
> > > > So that is _definitely_ going to be needed.
> > > 
> > > I'm sitll not sold on the per-zone shrinkers.  For one per-zone is
> > > a really weird concept.  per-node might make a lot more sense, but
> > > what we really need for doing things usefully is per-sb.  If that's
> > > not scalable we might have to for sb x zone.
> > 
> > Well I don't know what it means that you're "not sold" on them, and
> > then come up with ridiculous things like per-node might make a lot
> > more sense, or per-sb; and that per-zone is a really weird concept.
> >
> > Per-zone is the right way to drive reclaim, and it will allow locality
> > to work properly, as well as zone reclaim and zone targetted shortages
> > and policies, and it will also give good scalability. People need it for
> > all these reasons.
> 
> I don't have enough information to be able to say what is the
> correct way to improve the shrinkers, but I do have plenty of

I am a "VM guy", I'm telling you, this is the right way to do shrinkers.

> information on how the current unbound reclaim parallelism is an
> utter PITA to handle. Indeed, it was partially responsible for the
> recent kernel.org outage. Hence I really don't like anything that
> potentially increases reclaim parallelism until there's been
> discussion and work towards fixing these problems first.

There are two different issues. First one is unbounded entry to reclaim
paths by threads. Second one is scalability of reclaim paths.

The second problem is really the problem -- if it were totally solved,
then the first would not be a problem by definition. It's unlikely to
be totally solved, so one way to improve things is to limit
paralellism in reclaim, but that's totally different side of the
coin and doesn't affect what is the right thing to do with shrinkers.

But in fact, if we improve the second problem, then the first becomes
less of a problem as well.  Having a global spinlock protecting _all_
reclaim for each cache makes the lock contention and costs for reclaim
much worse, so more threads get piled up in reclaim.

I _never_ would have thought I would hear that we should be careful
about making things more scalable because of all the scary threads
trying to run the code :)

Yes it is likely that problems will often just get pushed into the
filesystems, but at least now it will be visible there and be able to
be solved incrementally.

> Beisdes, IMO, we don't need to rework shrinkers, add zone-based
> reclaim, use per-cpu inode lists, etc. to enable store-free path
> walking.

But we need it for proper zone aware reclaim, improving efficiency
and scalability of reclaim on NUMA systems, and a step towards
being able to control the memory properly. As I said, google have
a very similar patch (minus the fine grained locking) that they
need to make reclaim work properly.

"In your opinion" is fine, but please be prepared to change your
opinion when I tell you that it is needed.

>  You've shown it can be done, and that's great - it shows
> us the impact of making those changes, but they need to be analysed
> separately and treated on own their merits, not lumped with core
> locking changes necessary for store-free path walking.

Actually I didn't see anyone else object to doing this. Everybody
else it seems acknowledges that it needs to be done, and it gets
done naturally as a side effect of fine grained locking.

> We know what you think, but you have to let everyone else form their
> own opinions and then be convinced by code or discussion that your
> way is the right way to do it. This requires your tree to be broken
> down into it's component pieces so that us mere mortals can
> understand and test the impact of each separate set of changes has
> on the system.  It makes it easier to review, identify regressions,
> etc but it should not prevent us from reaching the end goal.

Of course, and I never dispute that it should be. It is actually in
my tree in pretty well reviewable pieces. The _really_ important part
is that there is also an end-goal if anybody is concerned about the
direction it is going in or the performance at the end of the day.

>
> > But you missed my point about that. My point is that we _know_ that
> > store free path walks are going to be merged, it is one of the more
> > desirable pieces of the series. So we _know_ RCU inodes are needed, so
> > we can happily use RCU work earlier in the series to make locking
> > better in the icache.
> 
> We've still got to do all the lock splitting work so we can update
> everything without contention. It doesn't matter if that is done
> before or after adding RCU - the end result _should_ be the same.

Right, but doing it with an eye to the end result gives less churn,
less releases with different locking models that have to be supported
and maintained, and my tree doesn't get wrecked.

If you agree that it doesn't matter that it is done before or after,
then I prefer to keep my tree intact. Thanks.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15  3:30                   ` Nick Piggin
@ 2010-10-15  3:44                     ` Nick Piggin
  2010-10-15  6:41                       ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15  3:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 02:30:17PM +1100, Nick Piggin wrote:
> On Fri, Oct 15, 2010 at 02:13:43PM +1100, Dave Chinner wrote:
> >  You've shown it can be done, and that's great - it shows
> > us the impact of making those changes, but they need to be analysed
> > separately and treated on own their merits, not lumped with core
> > locking changes necessary for store-free path walking.
> 
> Actually I didn't see anyone else object to doing this. Everybody
> else it seems acknowledges that it needs to be done, and it gets
> done naturally as a side effect of fine grained locking.

Let's just get back to this part, which seems to be one you have
the most issues with maybe?

You're objecting to per-zone locks and per-zone LRUs for inode and
dcache?

Well I have told you why per-zone LRUs are needed, I can expand on
any of the reasons if that is unclear. Per-zone locks I think come
naturally at the same time and they will expose some fs bottlenecks,
but that is simply how scalability development works.

So, do you object to per-zone LRUs in particular, or per-zone locks?
(Ie. the potentially changed reclaim pattern, or the increased
parallelism).

When you looked at this initially, you didn't understand how
reclaim works. It will not fill up a zone with inodes and then start
reclaiming all those inodes, leaving other nodes empty (unless that
is how you configure the machine, but it isn't the default). It
fills up inodes from all nodes (same as today) and it will start
reclaiming from all nodes at about the same pressure when there is
a shortage.

Reclaim basically approximates LRU by scanning a little from the top
of each LRU. When you have many thousands of objects, and reclaim is
a really failable and dumb process anyway, then the perturbation of
the reclaim pattern doesn't matter much. Our zone based page reclaim
works exactly the same way.

I don't think you can possibly be arguing against more scalable
locking in reclaim, so perhaps you are also worried about increased
parallelism in the filesystem callbacks from reclaim? I really can't
see this being a big problem, any more than any other increased
paralellism on fses or other subsystems caused by scaling vfs.

There might be some interesting issues with different locking
designs being hit in different ways, but really we can't stop
progress and test all loads on all filesystems. The way forward is
to fix the bottleneck in the filesystem, or the filesystem sucks
so bad it can't handle it, then just put a lock in there and not
peanalise others.

It's not like I haven't tested it, I've spent the better part of
the past year testing things. The I_FREEING batching stuff is one
example where I found and fixed a small problem exposed by the
reclaim changes.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15  3:44                     ` Nick Piggin
@ 2010-10-15  6:41                       ` Nick Piggin
  2010-10-15 10:59                         ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15  6:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 02:44:51PM +1100, Nick Piggin wrote:
> On Fri, Oct 15, 2010 at 02:30:17PM +1100, Nick Piggin wrote:
> > On Fri, Oct 15, 2010 at 02:13:43PM +1100, Dave Chinner wrote:
> > >  You've shown it can be done, and that's great - it shows
> > > us the impact of making those changes, but they need to be analysed
> > > separately and treated on own their merits, not lumped with core
> > > locking changes necessary for store-free path walking.
> > 
> > Actually I didn't see anyone else object to doing this. Everybody
> > else it seems acknowledges that it needs to be done, and it gets
> > done naturally as a side effect of fine grained locking.
> 
> Let's just get back to this part, which seems to be one you have
> the most issues with maybe?

[snip per-zone lru/locking]

And as far as the rest of the work goes, I much prefer to come
to a basic consensus about the overall design of the entire vfs
scale work, and then focus on the exact implementation and patch
series details. When there is a consensus, I think it makes much
more sense to merge it in quite large chunks.

Ie. all of the inode locking, then all of the dcache locking.

I do not want to just cherry pick things here and there and
leave the others because your particular workload doesn't care
about them, you haven't reviewed them yet, etc. Because that
just gets my plan into a mess.

I'm perfectly fine to change the design, drop some aspects of
it, etc. _if_ it is decided that we don't want them with reasonable
arguments and agreement among everybody. On the other hand I prefer
not to just merge a few bits and leave others out because we
_don't_ have a consensus about one aspect or another.

So if you don't agree with something, let's work out why not and
try to come to an agreement, rather than pushing bits and pieces
that you do happen to agree with.

You're worried about mere mortals reviewing and understanding it...
I don't really know. If you understand inode locking today, you
can understand the inode scaling series quite easily. Ditto for
dcache. rcu-walk path walking is trickier, but it is described in
detail in documentation and changelog.

And you can understand the high level approach without exactly
digesting every detail at once. The inode locking work goes to
break up all global locks:

- a single inode object is protected (to the same level as
  inode_lock) with i_lock. This makes it really trivial for
  filesystems to lock down the object without taking a global
  lock.

- inode hash rcuified and insertion/removal made per-bucket

- inode lru lists and locking made per-zone

- inode sb list made per-sb, per-cpu

- inode counters made per-cpu

- inode io lists and locking made per-bdi

So from the highest level snapshot, this is not rocket science.
And the way I've structured the patches, you can take almost
any of the above points and go look in the patch series to see
how it is implemented.

Is this where we want to go? My argument is yes, and I have
been gradually gathering real results and agreement from others.

I've demonstrated performance improvements, although many of them can
only be actually achieved when dcache, vfsmount, files lock etc scaling
is also implemented, which is another reason why it is so important to
keep everything together. And it's actually not always trivial to just
take a single change and document a performance improvement in
isolation.

But you can use your brain with scalability work, and if you're not
convinced about a particular patch, you can now actually take the
full series and revert a single patch (or add an artifical lock in
there to demonstrate the scalability overhead).

What I have done in the series is required to get almost linear
scalability up to the largest POWER7 system IBM has internally on
almost all important basic vfs operations. It should scale to the
largest UV systems from SGI. And it should scale on -rt. Put a
global lock in inode lru creation/destruction/touch/reclaim path,
and scalability is going to go to hell on these workloads on large
systems again.

And large isn't even that large these days. You can see these
problems clear as day on 2 and 4 socket small servers and we know
it is going to get worse for at least a few more doublings of
core count.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15  6:41                       ` Nick Piggin
@ 2010-10-15 10:59                         ` Dave Chinner
  2010-10-15 13:03                           ` Nick Piggin
  2010-10-15 20:50                           ` Nick Piggin
  0 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-15 10:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 05:41:50PM +1100, Nick Piggin wrote:
> You're worried about mere mortals reviewing and understanding it...
> I don't really know. If you understand inode locking today, you
> can understand the inode scaling series quite easily. Ditto for
> dcache. rcu-walk path walking is trickier, but it is described in
> detail in documentation and changelog.
> 
> And you can understand the high level approach without exactly
> digesting every detail at once. The inode locking work goes to
> break up all global locks:
> 
> - a single inode object is protected (to the same level as
>   inode_lock) with i_lock. This makes it really trivial for
>   filesystems to lock down the object without taking a global
>   lock.

Which is unnecessarily wide, and results in i_lock  having to have
list locks nested inside it, and that leads to the lock
inversion try-lock mess that several people have complained about.

My series uses i_lock only to protect i_state and i_ref. It does not
need to protect any more of the inode than that as other locks
protect the other list fields. As a result, it's still the inermost
lock and there are no trylocks in the code at all.

> - inode hash rcuified and insertion/removal made per-bucket

My series goes to per-bucket locks, and can easily be converted to
rcu lookups in the next series that introduces RCU inode freeing.

> - inode lru lists and locking made per-zone

Per-zone is problematic. The second hottest lock on my single node
8p VM right now is the inode_lru_lock. A per-zone lock for the LRU
on such a machine is still a global lock. Even on large NUMA
machines we'll end up with this as a hot lock, especially as we're
looking at 12 core CPUs in a few months time and 16c/32t in a year
or so.

As it is, the cause of the lock contention I'm seeing is unbound
direct reclaim parallelism - every CPU on the node running the inode
cache shrinker at the same time. This behaviour will not change by
making the locking per-zone, just cause hot nodes to occur.

One potential solution to this is that only kswapd runs shrinkers to
limit parallelism, and this would match up with per-node LRU list
and locking infrastructure.  And to tie that back to another thread,
you can probably see some of the reasoning behind Christoph's
suggestions that per-zone shrinkers, LRUs and locking may not be the
best way to scale cache reclaim.

IMO, this is definitely a change that needs further discussion and
is one of the reasons why I haven't pushed any further down this
path - there's unresolved issues with this approach. It is also a
completely separable piece of work and does not need to be solved to
implement to store-free path walking...

> - inode sb list made per-sb, per-cpu

I've gone as far as per-sb, so it still has a global lock. This is
enough to move the lock out contention out of the profiles at 8p,
and does not prevent a different method from being added later.
It's good enough for an average sized server - holding off
finegrained lists and locking for a release or two while everything
else is sorted out because the inode_lru_lock is hotter. Also it's
not necessary to solve the problem to implement store-free path
walking.

> - inode counters made per-cpu

I reworked this to make both the nr_inodes and nr_unused counters
per-cpu and did it in one patch up front instead of spread across
three separate patches.

> - inode io lists and locking made per-bdi

And I've pulled that in done that, too while dropping all the messy
list manipulation loops as the wrong bdi problem is fixed upstream now.

Nothing I've done prevents RCU-ising the inode cache, but I've
discovered some issues that you've missed in moving straight to
fine-grained per-zone LRU lists and locking. I think the code is
cleaner (no trylocks or loops to get locks out of order), the series
is cleaner and it has gone through a couple of rounds of review
already.  This is why I'd like you to try rebasing your tree on top
of it to determine if my assertions that there are no inhernet
problems are correct....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 10:59                         ` Dave Chinner
@ 2010-10-15 13:03                           ` Nick Piggin
  2010-10-15 13:29                             ` Nick Piggin
  2010-10-15 14:11                             ` Nick Piggin
  2010-10-15 20:50                           ` Nick Piggin
  1 sibling, 2 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 13:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 09:59:43PM +1100, Dave Chinner wrote:
> On Fri, Oct 15, 2010 at 05:41:50PM +1100, Nick Piggin wrote:
> > You're worried about mere mortals reviewing and understanding it...
> > I don't really know. If you understand inode locking today, you
> > can understand the inode scaling series quite easily. Ditto for
> > dcache. rcu-walk path walking is trickier, but it is described in
> > detail in documentation and changelog.
> > 
> > And you can understand the high level approach without exactly
> > digesting every detail at once. The inode locking work goes to
> > break up all global locks:
> > 
> > - a single inode object is protected (to the same level as
> >   inode_lock) with i_lock. This makes it really trivial for
> >   filesystems to lock down the object without taking a global
> >   lock.
> 
> Which is unnecessarily wide, and results in i_lock  having to have
> list locks nested inside it, and that leads to the lock
> inversion try-lock mess that several people have complained about.

That's how I've done it. When you objected to it, I pointed out
that I can use RCU _or_ lift out i_lock selectively from some of
the protected members like the lists (and I've pointed this out
in changelogs too).

I haven't quite decided or had time yet to decide which approach
I want to take there, but seeing as it is all confined to core
icache code, it wasn't as urgent (as, say, store-free path walking).
But it was always the intention to continue streamlining things.

> My series uses i_lock only to protect i_state and i_ref. It does not
> need to protect any more of the inode than that as other locks
> protect the other list fields. As a result, it's still the inermost
> lock and there are no trylocks in the code at all.

See if you actually understand how my patch series is structured,
and how I want to get a series of reviewable and bisectable,
small easy to understand transitions, then you know I don't
agree with doing it this way.

I am building up locks underneath inode_lock/dcache_lock in
steps until it can be taken out. After that, there is a series
to incrementally fine grain and streamline the locking.

Seeing as you objected strongly to the trylocks, and we discussed
this, it would have been productive for _you_ to build on top of
_my_ patch set and incrementally lift more i_locks away.

So your patch would contain the justification of why it is correct
and easily bisectable.

As far as you insistence of being "innermost lock", I'm baffled.
Firstly, it is not. dcache_lock nests inside it for christ's sake,
and lots of locks nest inside dcache_lock. Secondly, lock ordering
is not defined by some external entity, it is defined by the
code.

We discussed it and I didn't think latencies would be any worse
a problem than they are today. I agreed it may be an issue and
pointed out there are ways forward to fix it.

> > - inode hash rcuified and insertion/removal made per-bucket
> 
> My series goes to per-bucket locks, and can easily be converted to
> rcu lookups in the next series that introduces RCU inode freeing.
> 
> > - inode lru lists and locking made per-zone
> 
> Per-zone is problematic. The second hottest lock on my single node

Can you respond to my other email where I go into detail and try to
address your misconceptions and concerns? And acutally come up with
a real objection.

> 8p VM right now is the inode_lru_lock. A per-zone lock for the LRU
> on such a machine is still a global lock. Even on large NUMA
> machines we'll end up with this as a hot lock, especially as we're
> looking at 12 core CPUs in a few months time and 16c/32t in a year
> or so.
> 
> As it is, the cause of the lock contention I'm seeing is unbound
> direct reclaim parallelism - every CPU on the node running the inode
> cache shrinker at the same time. This behaviour will not change by
> making the locking per-zone, just cause hot nodes to occur.
> 
> One potential solution to this is that only kswapd runs shrinkers to
> limit parallelism, and this would match up with per-node LRU list
> and locking infrastructure.  And to tie that back to another thread,
> you can probably see some of the reasoning behind Christoph's
> suggestions that per-zone shrinkers, LRUs and locking may not be the
> best way to scale cache reclaim.
> 
> IMO, this is definitely a change that needs further discussion and
> is one of the reasons why I haven't pushed any further down this
> path - there's unresolved issues with this approach. It is also a
> completely separable piece of work and does not need to be solved to
> implement to store-free path walking...

I've pushed further down with this path, I re explained it to you
again in my other mail, and I've had reviews from VM guys, including
people at google who actually have similar patches and problems. So
at this point please either catch up or stop handwaving objections.

No, it is not infinitely scalable with respect to multi core nodes,
but that whole issue is a VM wide problem. The right thing to do
with the shrinkers is to get them in line with the rest of the VM
reclaim and subsequent scalability improvements there can be shared
without really any involvement from the vfs.

So that's all I have to say about per-zone lrus/locks unless you
actually have something constructive to say. All I hear is handwaving
and "oh we could do this or that or I don't know if it is the right
way to go and we should get VM guys to look at it". I've had enough
of that, it's not going anywhere.

I'm telling you for the last time, there are people that need per-zone
scanning, and there are people that need per-zone locking.

> > - inode sb list made per-sb, per-cpu
> 
> I've gone as far as per-sb, so it still has a global lock. This is
> enough to move the lock out contention out of the profiles at 8p,
> and does not prevent a different method from being added later.
> It's good enough for an average sized server - holding off
> finegrained lists and locking for a release or two while everything
> else is sorted out because the inode_lru_lock is hotter. Also it's
> not necessary to solve the problem to implement store-free path
> walking.

So that's another deficiency. I very quickly found on large machines
that any such locking like this would quickly cause big problems (and
limit parallelism when testing other improvements) so I fixed it.

And no, I don't want to just wait a release or two, change a bit
more, wait another release or two, change more, that we know all
along needs to be changed.

You're trying to paint my 3 patches to transform counters to per-cpu
as a negative, and yet you want to have the subsystem locking in
flux and in dispute over multiple kernel releases? Come on, are you
joking?

> > - inode counters made per-cpu
> 
> I reworked this to make both the nr_inodes and nr_unused counters
> per-cpu and did it in one patch up front instead of spread across
> three separate patches.
> 
> > - inode io lists and locking made per-bdi
> 
> And I've pulled that in done that, too while dropping all the messy
> list manipulation loops as the wrong bdi problem is fixed upstream now.
> 
> Nothing I've done prevents RCU-ising the inode cache, but I've
> discovered some issues that you've missed in moving straight to
> fine-grained per-zone LRU lists and locking. I think the code is
> cleaner (no trylocks or loops to get locks out of order), the series
> is cleaner and it has gone through a couple of rounds of review
> already.  This is why I'd like you to try rebasing your tree on top
> of it to determine if my assertions that there are no inhernet
> problems are correct....

You have it upside down, I think. I've got way more work sitting
here that has been tested and in my opinion has the better set
of steps to transform the locking.

I am looking at your changes and see if any might apply to my tree.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 13:03                           ` Nick Piggin
@ 2010-10-15 13:29                             ` Nick Piggin
  2010-10-15 17:33                               ` Nick Piggin
  2010-10-15 14:11                             ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 13:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:03:00AM +1100, Nick Piggin wrote:
> On Fri, Oct 15, 2010 at 09:59:43PM +1100, Dave Chinner wrote:
> > My series uses i_lock only to protect i_state and i_ref. It does not
> > need to protect any more of the inode than that as other locks
> > protect the other list fields. As a result, it's still the inermost
> > lock and there are no trylocks in the code at all.
> We discussed it and I didn't think latencies would be any worse
> a problem than they are today. I agreed it may be an issue and
> pointed out there are ways forward to fix it.

BTW. if a few trylocks are your biggest issue, this is a joke. I told
you how they can be fixed with incremental patches on top of the series
(which basically whittle down the lock coverage of the old inode_lock,
and so IMO need to be done in small chunks well bisectable and with
good rationale). So why you didn't submit a couple of incremental
patches to do just that is beyond me.

I've had prototypes in my tree actually to do that from a while back,
but actually I'm thinking that using RCU may be a better way to go
now that Linus has agreed on it and we have sketched a design to do
slab-free-RCU.

Either way, it's much easier to compare pros and cons of each, when
they are done incrementally on top of the existing base.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 13:29                             ` Nick Piggin
@ 2010-10-15 17:33                               ` Nick Piggin
  2010-10-15 17:52                                 ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 17:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:29:38AM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 12:03:00AM +1100, Nick Piggin wrote:
> > On Fri, Oct 15, 2010 at 09:59:43PM +1100, Dave Chinner wrote:
> > > My series uses i_lock only to protect i_state and i_ref. It does not
> > > need to protect any more of the inode than that as other locks
> > > protect the other list fields. As a result, it's still the inermost
> > > lock and there are no trylocks in the code at all.
> > We discussed it and I didn't think latencies would be any worse
> > a problem than they are today. I agreed it may be an issue and
> > pointed out there are ways forward to fix it.
> 
> BTW. if a few trylocks are your biggest issue, this is a joke. I told
> you how they can be fixed with incremental patches on top of the series
> (which basically whittle down the lock coverage of the old inode_lock,
> and so IMO need to be done in small chunks well bisectable and with
> good rationale). So why you didn't submit a couple of incremental
> patches to do just that is beyond me.
> 
> I've had prototypes in my tree actually to do that from a while back,
> but actually I'm thinking that using RCU may be a better way to go
> now that Linus has agreed on it and we have sketched a design to do
> slab-free-RCU.
> 
> Either way, it's much easier to compare pros and cons of each, when
> they are done incrementally on top of the existing base.

To illustrate my point, if it needs to be spelled out. If you rcu lock a
list then locking can be quite changed. Starting with my basic first
transform which has inode protected by i_lock:

/* pass1, dumb */
spin_lock(&inode->i_lock);
spin_lock(&list_lock);
list_del(&inode->list);

...

spin_lock(&list_lock);
list_for_each_entry(inode, list) {
    if (inode->blah ...) {
      if (!spin_trylock(&inode->i_lock))
          goto repeat; /* oh no, it's unmaintainable */
      do_something(inode);
   }
}

Common pattern, used by several of the inode lists, OK?

/* pass2a, RCU optimised */
rcu_read_lock()
list_for_each_entry(inode, list) {
    if (inode->blah ...) {
      spin_lock(&inode->i_lock); /* no trylock. easy. */
      if (unlikely(list_empty(&inode->i_list)))
        continue;
      do_something(inode);
   }
}


/* pass2b, reduced i_lock width */
/* no i_lock */
spin_lock(&list_lock);
list_del(&inode->list);

...

spin_lock(&list_lock);
list_for_each_entry(inode, list) {
    if (inode->blah ...) {
      do_something(inode);
   }
}

OK? Do you see a problem yet? Try turning into RCU.

rcu_read_lock();
list_for_each_entry(inode, list) {
    if (inode->blah ...) {
      spin_lock(&list_lock);
      if (unlikely(list_empty(&inode->i_list)))
        continue;
      do_something(inode);
   }
}

So you've saved one (extremely fine-grained, cache hot) lock in the
slowpath, but you've exchanged a fine grained, cache hot lock in the
fastpath for a more expensive lock. You've also lost the really nice
(for maintainability and simplicity) ability to freeze all modifications
to the inode with the single i_lock.

So... you just happened to decide for yourself that your approach is
better, just because. All your extensive trials and experiemnts with
different locking orders lead you to believe that my preferred RCU
approach is wrong and yours is right? Can you please share your working
out with the rest of the class?


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 17:33                               ` Nick Piggin
@ 2010-10-15 17:52                                 ` Christoph Hellwig
  2010-10-15 18:02                                   ` Nick Piggin
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-15 17:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

Thanks for trying to get back to a technical discussion.  Maybe we
can just move the technical comments to direct replies to the patches
and leave this not very helpful subthread behind?

> rcu_read_lock();
> list_for_each_entry(inode, list) {
>     if (inode->blah ...) {
>       spin_lock(&list_lock);
>       if (unlikely(list_empty(&inode->i_list)))
>         continue;
>       do_something(inode);
>    }
> }

But that't not what we do for icache.  For the validity checking during
lookup we check the I_FREEING bit, which is modified under i_lock
and can be read without any locking.  So everything is just fine
when moving on to RCU locking.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 17:52                                 ` Christoph Hellwig
@ 2010-10-15 18:02                                   ` Nick Piggin
  2010-10-15 18:14                                     ` Nick Piggin
  2010-10-16  2:09                                     ` Nick Piggin
  0 siblings, 2 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 18:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 01:52:47PM -0400, Christoph Hellwig wrote:
> Thanks for trying to get back to a technical discussion.  Maybe we
> can just move the technical comments to direct replies to the patches
> and leave this not very helpful subthread behind?

I had plenty of other technical comments which you ignored about
per-zone shrinkers and stragegy to merge the overall series. If you
are going to keep claiming these are show stoppers, please address
my comments there too.

> > rcu_read_lock();
> > list_for_each_entry(inode, list) {
> >     if (inode->blah ...) {
> >       spin_lock(&list_lock);
> >       if (unlikely(list_empty(&inode->i_list)))
> >         continue;
> >       do_something(inode);
> >    }
> > }
> 
> But that't not what we do for icache.  For the validity checking
> during lookup we check the I_FREEING bit, which is modified under
> i_lock and can be read without any locking.  So everything is just
> fine when moving on to RCU locking.

For those lookups where you are taking the i_lock anyway, they
will look the same, except the i_lock lock width reduction
loses the ability to lock all icache state of the inode (like
we can practically do today with inode_lock).

This was a key consideration for maintainability for me. I also
spelled out quite clearly the i_lock reduction approach for
future speedup if the single atomic in the slowpath is really
that important. But for those kinds of complications to the
locking rules I want to add as individual small patches to the
end of the series where the streamlining work is happening. If
they are not worth the gains, I much prefer to keep the locking
more regular.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 18:02                                   ` Nick Piggin
@ 2010-10-15 18:14                                     ` Nick Piggin
  2010-10-16  2:09                                     ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 18:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 05:02:19AM +1100, Nick Piggin wrote:
> For those lookups where you are taking the i_lock anyway, they
> will look the same, except the i_lock lock width reduction
> loses the ability to lock all icache state of the inode (like
> we can practically do today with inode_lock).
> 
> This was a key consideration for maintainability for me.

Maybe you've overlooked this point. It is, in fact, very important
in my opinion. With my locking approach, everywhere where today
we have:

spin_lock(&inode_lock);
do_something(inode);
spin_unlock(&inode_lock);

it can be replaced with

spin_lock(&inode->i_lock);
do_something(inode);
spin_unlock(&inode->i_lock);

Without worrying about the lock coverage. In fact, it is a tiny bit
stronger because you also get to hold the refcount at the same time
(doesn't really matter outside core icache though).

Ditto for my dcache_lock appraoch (it's far more important there, being
much more visible to filesystems IMO, but icache is still important).

I never totally objected to reductions in i_lock lock width if they
really are required for that last bit of performance, but I have always
maintained that I want these kinds of locking irregularities merged on
their own, on top of the base code. Especially with RCU inodes, I'm not
sure if they'll be needed, however.

Most of the slowpaths where that happens, the i_lock needs to be taken
somewhere anyway, so you probably don't really save anything.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 18:02                                   ` Nick Piggin
  2010-10-15 18:14                                     ` Nick Piggin
@ 2010-10-16  2:09                                     ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  2:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 05:02:19AM +1100, Nick Piggin wrote:
> On Fri, Oct 15, 2010 at 01:52:47PM -0400, Christoph Hellwig wrote:
> > Thanks for trying to get back to a technical discussion.  Maybe we
> > can just move the technical comments to direct replies to the patches
> > and leave this not very helpful subthread behind?
> 
> I had plenty of other technical comments which you ignored about
> per-zone shrinkers and stragegy to merge the overall series. If you
> are going to keep claiming these are show stoppers, please address
> my comments there too.

Ping?


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 13:03                           ` Nick Piggin
  2010-10-15 13:29                             ` Nick Piggin
@ 2010-10-15 14:11                             ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 14:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:03:00AM +1100, Nick Piggin wrote:
> On Fri, Oct 15, 2010 at 09:59:43PM +1100, Dave Chinner wrote:
> > On Fri, Oct 15, 2010 at 05:41:50PM +1100, Nick Piggin wrote:
> > > - inode io lists and locking made per-bdi
> > 
> > And I've pulled that in done that, too while dropping all the messy
> > list manipulation loops as the wrong bdi problem is fixed upstream now.
> > 
> > Nothing I've done prevents RCU-ising the inode cache, but I've
> > discovered some issues that you've missed in moving straight to
> > fine-grained per-zone LRU lists and locking. I think the code is
> > cleaner (no trylocks or loops to get locks out of order), the series
> > is cleaner and it has gone through a couple of rounds of review
> > already.  This is why I'd like you to try rebasing your tree on top
> > of it to determine if my assertions that there are no inhernet
> > problems are correct....
> 
> You have it upside down, I think. I've got way more work sitting
> here that has been tested and in my opinion has the better set
> of steps to transform the locking.

I mean, come on. You've done 2 _whole_ weeks of work on this,
(how many hours of testing, how many workloads, what sizes of
machines? google? intel? people who are actually reporting real
world problems?) and you want _me_ to rebase my entire tree on
some bits of yours that I don't even entirely agree with, and you
want me to tell you if you've stuffed anything up and that I
should go away and retest the whole stack and ask everyone else
to redo their testing results again? 

What do you want here? Do you think it's actually going to give
some results more quickly to go through with that?


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 10:59                         ` Dave Chinner
  2010-10-15 13:03                           ` Nick Piggin
@ 2010-10-15 20:50                           ` Nick Piggin
  2010-10-15 20:56                             ` Nick Piggin
  1 sibling, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 20:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 09:59:43PM +1100, Dave Chinner wrote:
> On Fri, Oct 15, 2010 at 05:41:50PM +1100, Nick Piggin wrote:
> > You're worried about mere mortals reviewing and understanding it...
> > I don't really know. If you understand inode locking today, you
> > can understand the inode scaling series quite easily. Ditto for
> > dcache. rcu-walk path walking is trickier, but it is described in
> > detail in documentation and changelog.
> > 
> > And you can understand the high level approach without exactly
> > digesting every detail at once. The inode locking work goes to
> > break up all global locks:
> > 
> > - a single inode object is protected (to the same level as
> >   inode_lock) with i_lock. This makes it really trivial for
> >   filesystems to lock down the object without taking a global
> >   lock.
> 
> Which is unnecessarily wide, and results in i_lock  having to have
> list locks nested inside it, and that leads to the lock
> inversion try-lock mess that several people have complained about.

Gee, you keep repeating this so often that you have me doubting
myself a tiny bit, so I have to check.

$ grep spin_trylock fs/inode.c fs/fs-writeback.c
fs/inode.c:             if (!spin_trylock(&inode->i_lock)) {
fs/inode.c:             if (!spin_trylock(&old->i_lock)) {
fs/inode.c:             if (!spin_trylock(&old->i_lock)) {
fs/fs-writeback.c:      if (!spin_trylock(&inode->i_lock)) {
fs/fs-writeback.c:      if (!spin_trylock(&inode->i_lock)) {
fs/fs-writeback.c:      if (!spin_trylock(&inode->i_lock)) {

This is your unmaintainable mess?  You decided to rewrite your own
vfs-scale tree because I wanted i_lock to protect the icache state (and
offered you very good reason for)?

Well, surely they must be horrible complex and unmaintainable
beasts....

repeat:
	/*
	 * Don't use RCU walks, common case of no old inode
	 * found requires hash lock.
	 */
        spin_lock_bucket(b);
        hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
                if (old->i_ino != ino)
                        continue;
                if (old->i_sb != sb)
                        continue;
                if (old->i_state & (I_FREEING|I_WILL_FREE))
                        continue;
                if (!spin_trylock(&old->i_lock)) {
                        spin_unlock_bucket(b);
                        cpu_relax();
                        goto repeat;
                }

Nope, no big deal. The rest are much the same. So thanks for the
repeated suggestion, but I'll actually prefer to keep my regular i_lock
locking scheme where you don't need to look up the documentation and
think hard about coherency between protected and unprotected parts of
the inode whenever you use it. I didn't stumble upon my locking design
by chance.

If you think a few trylocks == impending doom, then xfs is looking
pretty poorly at this point. So I would ask that you stop making things
up about my patch series. If you dislike the trylocks so much that you
think it is worth breaking the i_lock regularity or using RCU or
whatever, then please propose them as incremental patches to the end
of my series where you can see they logically will fit. You know I will
argue that locking consistency is more important for maintainability
than these few trylocks, however.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 20:50                           ` Nick Piggin
@ 2010-10-15 20:56                             ` Nick Piggin
  0 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 20:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:50:45AM +1100, Nick Piggin wrote:
> On Fri, Oct 15, 2010 at 09:59:43PM +1100, Dave Chinner wrote:
> > On Fri, Oct 15, 2010 at 05:41:50PM +1100, Nick Piggin wrote:
> > > You're worried about mere mortals reviewing and understanding it...
> > > I don't really know. If you understand inode locking today, you
> > > can understand the inode scaling series quite easily. Ditto for
> > > dcache. rcu-walk path walking is trickier, but it is described in
> > > detail in documentation and changelog.
> > > 
> > > And you can understand the high level approach without exactly
> > > digesting every detail at once. The inode locking work goes to
> > > break up all global locks:
> > > 
> > > - a single inode object is protected (to the same level as
> > >   inode_lock) with i_lock. This makes it really trivial for
> > >   filesystems to lock down the object without taking a global
> > >   lock.
> > 
> > Which is unnecessarily wide, and results in i_lock  having to have
> > list locks nested inside it, and that leads to the lock
> > inversion try-lock mess that several people have complained about.
> 
> Gee, you keep repeating this so often that you have me doubting
> myself a tiny bit, so I have to check.
> 
> $ grep spin_trylock fs/inode.c fs/fs-writeback.c
> fs/inode.c:             if (!spin_trylock(&inode->i_lock)) {
> fs/inode.c:             if (!spin_trylock(&old->i_lock)) {
> fs/inode.c:             if (!spin_trylock(&old->i_lock)) {
> fs/fs-writeback.c:      if (!spin_trylock(&inode->i_lock)) {
> fs/fs-writeback.c:      if (!spin_trylock(&inode->i_lock)) {
> fs/fs-writeback.c:      if (!spin_trylock(&inode->i_lock)) {
> 
> This is your unmaintainable mess?  You decided to rewrite your own

The other thing that strikes my mind is that maybe you're looking
at intermediate steps in the series, which admittedly grow ugly
before they start to get clean.

This was a deliberate choice in how I set out the patch series, to
show very clearly every small step where and how locking was being
changed along the way. It is made so it can be followed (from the
patch series) by fs developers or anyone porting code to it almost.
It is absolutely not supposed to look pretty or be run with only
the first steps done. It is supposed to be verifiable, auditable,
and bisectable.

I've split it up like this since I first posted anything to any
list and asked for comments. If you object to that now, I find it
a bit rich. I will change the way it is done if Al tells me to,
but ugly in the intermediate steps don't mean a thing if you're
ignoring the end result.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-14 14:41             ` Christoph Hellwig
  2010-10-15  0:14               ` Nick Piggin
@ 2010-10-15  4:04               ` Nick Piggin
  2010-10-15 11:33                 ` Dave Chinner
  1 sibling, 1 reply; 162+ messages in thread
From: Nick Piggin @ 2010-10-15  4:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2010 at 10:41:59AM -0400, Christoph Hellwig wrote:
> > Things
> > like path walks are nearly 50% faster single threaded, and perfectly
> > scalable. Linus actually wants the store-free path walk stuff
> > _before_ any of the other things, if that gives you an idea of where
> > other people are putting the priority of the patches.
> 
> Different people have different priorities.  In the end the person
> doing the work of actually getting it in a mergeable shape is setting
> the pace.  If you had started splitting out the RCU pathwalk bits half a
> year ago there's we already have it in now.  But that's now how it
> worked.

Also, I appreciate that you have run into some lock scaling on these XFS
workloads. But the right way to do it is not to quickly solve your own
issues and then sort out the rest, leaving my tree in wreckage.

Yes it will take a bit longer to actually solve *everyone*'s problems
(that is most definitely including NUMA reclaim, and path walk ping pong
and performance). But we can do it in a coherent way and we can do so
looking at and testing the _end_ result.

Once everyone is happy with where we want to go, it is a matter of
making nice mergable pieces, I agree my patchset still has some work to
do here. But this process is not going to "drag out", by doing it this
way. It can probably be done in 2 releases (first inode, then dcache).
In fact on the contrary I think it is much better to get the whole
group of changes merged at once.

If it drags out and we sqabble and don't agree on what locking is
required _before_ we start merging things, then it will end up with a
half finished mess of slowly changing locking over many kernel releases.

So rather than taking a few bits that you particularly want solved right
now and not bothering to look at the rest because you claim they're
weird or not needed or controversial is really not helping the way I
want to merge this.

And really, blaming me for a few weeks vacation and a few other weeks
on work related stuff for causing all these delays is ridiculous. I've
been posting bits and pieces and ideas and rfcs for a long time without
any real interest from vfs people at all.  The only real time I heard
from you about anything is a couple of times when I actually posted the
full patchset, you'd whinge about it was unreviewable (disregarding that
it was split into individually reviewable pieces and provided an overall
view of where I was going).

_I_ have actually been talking to people, running tests on big machines,
working with -rt guys, socket/networking people, etc. I've worked
through the store-free lookup design with Linus, we've agreed on RCU
inodes and contingency to manage unexpected regressions.

So when you just handwave away "little problems" like proper per-zone
reclaim, rcu-walk path lookup, or scaling the hash manipulations as
"controversial, weird, I'm not sold" it's really frustrating. Especially
when you turn around and accuse me of continually delaying things in the
same email.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15  4:04               ` Nick Piggin
@ 2010-10-15 11:33                 ` Dave Chinner
  2010-10-15 13:14                   ` Nick Piggin
  2010-10-15 15:38                   ` Nick Piggin
  0 siblings, 2 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-15 11:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 03:04:06PM +1100, Nick Piggin wrote:
> On Thu, Oct 14, 2010 at 10:41:59AM -0400, Christoph Hellwig wrote:
> > > Things
> > > like path walks are nearly 50% faster single threaded, and perfectly
> > > scalable. Linus actually wants the store-free path walk stuff
> > > _before_ any of the other things, if that gives you an idea of where
> > > other people are putting the priority of the patches.
> > 
> > Different people have different priorities.  In the end the person
> > doing the work of actually getting it in a mergeable shape is setting
> > the pace.  If you had started splitting out the RCU pathwalk bits half a
> > year ago there's we already have it in now.  But that's now how it
> > worked.
> 
> Also, I appreciate that you have run into some lock scaling on these XFS
> workloads. But the right way to do it is not to quickly solve your own
> issues and then sort out the rest, leaving my tree in wreckage.

I think you're misrepresenting both my motives and intentions with
the work I've been doing.  I've already told you that I have no
intention of just breaking up the inode_lock, but that I intend to
work towards getting all of your work merged into mainline.  That
is, I do not intend to leave your tree in a stage of wreckage - I
intend to make it redundant.

I started with breaking up the inode_lock because it was the easiest
to tackle and it was what the VFS people indicated they wanted to be
tackled first. I didn't make that decision myself - I accepted
guidance that it was the most appropriate way to proceed.

That doesn't mean that I don't consider the dcache_lock to be a
problem, nor does it mean that it's not a problem for the workloads
I'm running. In fact, it's more heavily contended than the
inode_lock, but I can only tackle one at a time.

It also does not mean that I'm only pulling the bits appropriate for
XFS - if that was the case then I wouldn't have pulled in the
last_ino changes nor the per-cpu inode counters, nor the iunique
changes, nor the iget cleanups, nor Christoph's cleanups, nor added
cleanups of my own. I've taken this on with the expectation I would
be following it through to the end, where-ever that may be.

A couple of weeks after I started on the inode_lock, Christoph
started looking at the dcache_lock part of your series and started
to clean it up. That's one less thing for me to worry about, and so I
will concentrate next on the inode cache RCU changes.

IOWs, there's more than one person actively trying to get your code
into mergable shape, and we are not stepping on each other's toes
because we know which bits each other is trying to address.  That
will get the work done faster, which is a good thing.

You have been gone quite a long while, and right now you're the only
one arguing and stepping on the toes of those progressing the code
towards a mainline merge. Please stop doing that so we can keep
making progress.

> So rather than taking a few bits that you particularly want solved right
> now and not bothering to look at the rest because you claim they're
> weird or not needed or controversial is really not helping the way I
> want to merge this.

"the way I want to merge this"

This process is not about how you want to merge the code - it's
about how the VFS maintainers want to review and merge the code. You
want the code merged - you jump through their hoops and _you like
it_.  I _had_ been enjoying jumping through those hoops over the past
few weeks. I don't enjoy writing emails likes this.

> _I_ have actually been talking to people, running tests on big machines,
> working with -rt guys, socket/networking people, etc. I've worked
> through the store-free lookup design with Linus, we've agreed on RCU
> inodes and contingency to manage unexpected regressions.

Yes, that's all great work and everyone recognises that you deserve
the credit for it.  You've blasted a path through the wilds and
demonstrated to us how to fix this longstanding problem. Nobody can
take that from you - all I ever hear about is "Nick Piggin's VFS
scalability patch set" and I don't want to change that.  Nobody
cares that it's you or me or Christoph or someone else that is doing
the grunt work to get it merged - it is still your work and
inspiration that has got us to this point and that is what people
will remember.

Still, you've done everything _except_ what the VFS maintainers have
asked you to do to get it reviewed and merged. From the start
they've been asking you to split up the patch series into smaller
chunks for review and stage the inclusion over a couple of kernel
releases, and you have not yet shown any sign you are willing to do
that.

Now other people have stepped up to do what the VFS maintainers have
asked.  Consider this an acknowledgement of the great work you have
done - there is enough desire to get your work completed that people
are willing to do it in your absence. If nobody valued your work,
then it would still be sitting there in a cold, decrepid git tree
growing cobwebs full of dead flies.

So, Nick, please, either help us get the code into a form acceptible
to the VFS maintainers and into mainline, or stay out of the way
while we go through the process as quickly as we can. This requireѕ
a co-ordinated group effort, so arguing is really quite
counter-productive. I'd prefer that you help us get through the
process by being constructive and reviewing and testing and keeping
us honest.

IMO, right now the best thing you could is rebase your tree on my work
and Christoph's prep dcache_lock patchset and tell us if we've
screwed anything up so far. I don't think we have, but that's no
guarantee.

I'm not going to try to convince you about the validity of what we
have been doing in your absence anymore; I've got better things to
do with my time. Instead I'm just going to continue to refine the
inode-scale series I have and address review comments until it is
acceptable for incluѕion and then move on to the next piece of the
puzzle.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 11:33                 ` Dave Chinner
@ 2010-10-15 13:14                   ` Nick Piggin
  2010-10-15 15:38                   ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 13:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 10:33:36PM +1100, Dave Chinner wrote:
> On Fri, Oct 15, 2010 at 03:04:06PM +1100, Nick Piggin wrote:
> > On Thu, Oct 14, 2010 at 10:41:59AM -0400, Christoph Hellwig wrote:
> > > > Things
> > > > like path walks are nearly 50% faster single threaded, and perfectly
> > > > scalable. Linus actually wants the store-free path walk stuff
> > > > _before_ any of the other things, if that gives you an idea of where
> > > > other people are putting the priority of the patches.
> > > 
> > > Different people have different priorities.  In the end the person
> > > doing the work of actually getting it in a mergeable shape is setting
> > > the pace.  If you had started splitting out the RCU pathwalk bits half a
> > > year ago there's we already have it in now.  But that's now how it
> > > worked.
> > 
> > Also, I appreciate that you have run into some lock scaling on these XFS
> > workloads. But the right way to do it is not to quickly solve your own
> > issues and then sort out the rest, leaving my tree in wreckage.
> 
> I think you're misrepresenting both my motives and intentions with
> the work I've been doing.  I've already told you that I have no
> intention of just breaking up the inode_lock, but that I intend to
> work towards getting all of your work merged into mainline.  That
> is, I do not intend to leave your tree in a stage of wreckage - I
> intend to make it redundant.

I think you're misunderstanding my intentions. I have an "end result"
tree. And I plan to continue maintaining it and I'm getting it updated
and planning to ask for reviews to start merging things.

Yes it needs some tweaks and some reviews, but I have agreement in
principle from the vfs maintainer, we hashed out the high level changes
to inode locking at LSF.

And I have had the kernel maintainer look over the series and ack it
in principle too.

You're acting like I'm ignoring everybody. Actually I have replied
to everybody and every comment until a few weeks ago. I often didn't
agree with you, but I explained why in every step.

> > So rather than taking a few bits that you particularly want solved right
> > now and not bothering to look at the rest because you claim they're
> > weird or not needed or controversial is really not helping the way I
> > want to merge this.
> 
> "the way I want to merge this"
> 
> This process is not about how you want to merge the code - it's
> about how the VFS maintainers want to review and merge the code. You
> want the code merged - you jump through their hoops and _you like
> it_.  I _had_ been enjoying jumping through those hoops over the past
> few weeks. I don't enjoy writing emails likes this.

I have a say in it too, actually. And it would seem I have a much better
overview of the series than you or Christoph, given your replies in the
last couple of days.

I definitely will accept directions from Al and Linus and suggestions
from everyone else about the series. I will also try to procede how I
have structured the series which I think is the right way to go.

You haven't answered my basic question whether it is better to have
agreement on all lock order changes; have a reviewable and testable
end-goal on locking; and merge the changes in large chunks so there
is minimal locking churn over released trees? This is "the way I want
to merge it".

> > _I_ have actually been talking to people, running tests on big machines,
> > working with -rt guys, socket/networking people, etc. I've worked
> > through the store-free lookup design with Linus, we've agreed on RCU
> > inodes and contingency to manage unexpected regressions.
> 
> Yes, that's all great work and everyone recognises that you deserve
> the credit for it.  You've blasted a path through the wilds and
> demonstrated to us how to fix this longstanding problem. Nobody can

It's not a demonstration or a prototype, it is a (pretty damn stable,
considering the scale of it, and demonstratably performant) design and
implementation for a scalable vfs.

> take that from you - all I ever hear about is "Nick Piggin's VFS
> scalability patch set" and I don't want to change that.  Nobody
> cares that it's you or me or Christoph or someone else that is doing
> the grunt work to get it merged - it is still your work and
> inspiration that has got us to this point and that is what people
> will remember.

It's not about that. It is about I don't agree with the way you're
going and I question your understanding of the bigger picture, and
that I'm still happy to maintain the tree and I am planning to get
it merged as soon as possible. I've gone as far as walking over each
step of the inode series with Al and asking Linus if he agrees with
the overall design of the series and the end results.

I would welcome help sending patches or suggesting different ways
to do things, definitely.

> Still, you've done everything _except_ what the VFS maintainers have
> asked you to do to get it reviewed and merged. From the start
> they've been asking you to split up the patch series into smaller
> chunks for review and stage the inclusion over a couple of kernel
> releases, and you have not yet shown any sign you are willing to do
> that.

Bullshit David. I'm quite willing to do that, and I started doing
that before I got a bit side tracked with changing jobs and going
on vacation for a few weeks. I got as far as files lock and vfsmount
lock and didn't want to post my inode batch then disappear.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-15 11:33                 ` Dave Chinner
  2010-10-15 13:14                   ` Nick Piggin
@ 2010-10-15 15:38                   ` Nick Piggin
  1 sibling, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-15 15:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Fri, Oct 15, 2010 at 10:33:36PM +1100, Dave Chinner wrote:
> So, Nick, please, either help us get the code into a form acceptible
> to the VFS maintainers and into mainline, or stay out of the way
> while we go through the process as quickly as we can. This requireѕ
> a co-ordinated group effort, so arguing is really quite
> counter-productive. I'd prefer that you help us get through the
> process by being constructive and reviewing and testing and keeping
> us honest.

This exact same statement applies to you, from me. We all know
how to cooperate, and forking a new tree because the maintainer
goes away for a few weeks (and you suddenly get interested after
all this time) is not the right way to do it.

Frankly if your attitude is now demanding that I cooperate with
you, then I'm actually quite happy to just go over your head.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 17/18] fs: icache remove inode_lock
  2010-10-08  5:21 ` [PATCH 17/18] fs: icache remove inode_lock Dave Chinner
  2010-10-08  8:03   ` Christoph Hellwig
  2010-10-13  7:20   ` Nick Piggin
@ 2010-10-16  7:57   ` Nick Piggin
  2 siblings, 0 replies; 162+ messages in thread
From: Nick Piggin @ 2010-10-16  7:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All the functionality that the inode_lock protected has now been
> wrapped up in new independent locks and/or functionality. Hence the
> inode_lock does not serve a purpose any longer and hence can now be
> removed.
> 
> Based on work originally done by Nick Piggin.

I don't really agree with how this is structured, doing lock splitups
and changes and things under the inode_lock and then lifting it here.

If nothing else, than the practical implication that any locking
problem in the patchset is going to bisect down to this patch.

My approach of being ultra conservative in the initial breaking steps,
then removing inode_lock, then implementing more aggressive locking
and less trivial transforms IMO is the better way to go. It has worked
very well for me doing both the inode and the dcache work, and it has
made it easy to understand and debug.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (16 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 17/18] fs: icache remove inode_lock Dave Chinner
@ 2010-10-08  5:21 ` Dave Chinner
  2010-10-08  8:11   ` Christoph Hellwig
                     ` (2 more replies)
  2010-10-09  8:08 ` [PATCH 19/18] fs: split __inode_add_to_list Christoph Hellwig
  2010-10-09 11:18 ` [PATCH 20/18] fs: do not assign default i_ino in new_inode Christoph Hellwig
  19 siblings, 3 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim can push many inodes into the I_FREEING state before
it actually frees them. During the time it gathers these inodes, it
can call iput(), invalidate_mapping_pages, be preempted, etc. As a
result, holding inodes in I_FREEING can cause pauses.

After the inode scalability work, there is not a big reason to batch
up inodes to reclaim them, so we can dispose them as they are found
from the LRU. With similar reasoning, we can do the same during
unmount, completely removing the need for the dispose_list()
function.

Further, iput_final() does the same inode cleanup as reclaim and
unmount, so convert them all to use a single function for destroying
inodes. This is written such that the callers can optimise list
removals to avoid unneccessary lock round trips when removing inodes
from lists.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |  150 +++++++++++++++++++++++++-----------------------------------
 1 files changed, 63 insertions(+), 87 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index c778ec4..03ddd19 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -29,6 +29,8 @@
 /*
  * Locking rules.
  *
+ * inode->i_lock is *always* the innermost lock.
+ *
  * inode->i_lock protects:
  *   i_ref i_state
  * inode_hash_bucket lock protects:
@@ -46,8 +48,15 @@
  *
  *   sb inode lock
  *     inode_lru_lock
- *       wb->b_lock
- *         inode->i_lock
+ *     wb->b_lock
+ *     inode->i_lock
+ *
+ *   wb->b_lock
+ *     sb_lock (pin sb for writeback)
+ *     inode->i_lock
+ *
+ *   inode_lru
+ *     inode->i_lock
  */
 /*
  * This is needed for the following functions:
@@ -434,13 +443,12 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
- *	__remove_inode_hash - remove an inode from the hash
+ *	remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
- *	Remove an inode from the superblock. inode->i_lock must be
- *	held.
+ *	Remove an inode from the superblock.
  */
-static void __remove_inode_hash(struct inode *inode)
+void remove_inode_hash(struct inode *inode)
 {
 	struct inode_hash_bucket *b;
 
@@ -449,17 +457,6 @@ static void __remove_inode_hash(struct inode *inode)
 	hlist_bl_del_init(&inode->i_hash);
 	spin_unlock_bucket(b);
 }
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	__remove_inode_hash(inode);
-}
 EXPORT_SYMBOL(remove_inode_hash);
 
 void end_writeback(struct inode *inode)
@@ -494,37 +491,53 @@ static void evict(struct inode *inode)
 }
 
 /*
- * dispose_list - dispose of the contents of a local list
- * @head: the head of the list to free
+ * Free the inode passed in, removing it from the lists it is still connected
+ * to but avoiding unnecessary lock round-trips for the lists it is no longer
+ * on.
  *
- * Dispose-list gets a local list with local inodes in it, so it doesn't
- * need to worry about list corruption and SMP locks.
+ * An inode must already be marked I_FREEING so that we avoid the inode being
+ * moved back onto lists if we race with other code that manipulates the lists
+ * (e.g. writeback_single_inode).
  */
-static void dispose_list(struct list_head *head)
+static void dispose_one_inode(struct inode *inode)
 {
-	while (!list_empty(head)) {
-		struct inode *inode;
+	BUG_ON(!(inode->i_state & I_FREEING));
 
-		inode = list_first_entry(head, struct inode, i_lru);
-		list_del_init(&inode->i_lru);
+	/*
+	 * move the inode off the IO lists and LRU once
+	 * I_FREEING is set so that it won't get moved back on
+	 * there if it is dirty.
+	 */
+	if (!list_empty(&inode->i_io)) {
+		struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-		evict(inode);
+		spin_lock(&bdi->wb.b_lock);
+		list_del_init(&inode->i_io);
+		spin_unlock(&bdi->wb.b_lock);
+	}
+
+	if (!list_empty(&inode->i_lru))
+		inode_lru_list_del(inode);
 
-		__remove_inode_hash(inode);
+	if (!list_empty(&inode->i_sb_list)) {
 		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode->i_sb->s_inodes_lock);
-
-		wake_up_inode(inode);
-		destroy_inode(inode);
 	}
+
+	evict(inode);
+
+	remove_inode_hash(inode);
+	wake_up_inode(inode);
+	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+	destroy_inode(inode);
 }
 
+
 /*
  * Invalidate all inodes for a device.
  */
-static int invalidate_list(struct super_block *sb, struct list_head *head,
-			struct list_head *dispose)
+static int invalidate_list(struct super_block *sb, struct list_head *head)
 {
 	struct list_head *next;
 	int busy = 0;
@@ -553,30 +566,22 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		}
 		invalidate_inode_buffers(inode);
 		if (!inode->i_ref) {
-			struct backing_dev_info *bdi = inode_to_bdi(inode);
-
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
 
-			/*
-			 * move the inode off the IO lists and LRU once
-			 * I_FREEING is set so that it won't get moved back on
-			 * there if it is dirty.
-			 */
-			spin_lock(&bdi->wb.b_lock);
-			list_del_init(&inode->i_io);
-			spin_unlock(&bdi->wb.b_lock);
+			/* save a lock round trip by removing the inode here. */
+			list_del_init(&inode->i_sb_list);
+			spin_unlock(&sb->s_inodes_lock);
 
-			spin_lock(&inode_lru_lock);
-			list_move(&inode->i_lru, dispose);
-			spin_unlock(&inode_lru_lock);
+			dispose_one_inode(inode);
 
-			percpu_counter_dec(&nr_inodes_unused);
+			spin_lock(&sb->s_inodes_lock);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
 		busy = 1;
+
 	}
 	return busy;
 }
@@ -592,15 +597,12 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 int invalidate_inodes(struct super_block *sb)
 {
 	int busy;
-	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
-	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
+	busy = invalidate_list(sb, &sb->s_inodes);
 	spin_unlock(&sb->s_inodes_lock);
-
-	dispose_list(&throw_away);
 	up_write(&iprune_sem);
 
 	return busy;
@@ -636,7 +638,6 @@ static int can_unuse(struct inode *inode)
  */
 static void prune_icache(int nr_to_scan)
 {
-	LIST_HEAD(freeable);
 	int nr_scanned;
 	unsigned long reap = 0;
 
@@ -644,7 +645,6 @@ static void prune_icache(int nr_to_scan)
 	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
-		struct backing_dev_info *bdi;
 
 		if (list_empty(&inode_lru))
 			break;
@@ -691,18 +691,15 @@ static void prune_icache(int nr_to_scan)
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
 
-		/*
-		 * move the inode off the IO lists and LRU once
-		 * I_FREEING is set so that it won't get moved back on
-		 * there if it is dirty.
-		 */
-		bdi = inode_to_bdi(inode);
-		spin_lock(&bdi->wb.b_lock);
-		list_del_init(&inode->i_io);
-		spin_unlock(&bdi->wb.b_lock);
-
-		list_move(&inode->i_lru, &freeable);
+		/* save a lock round trip by removing the inode here. */
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
+		spin_unlock(&inode_lru_lock);
+
+		dispose_one_inode(inode);
+		cond_resched();
+
+		spin_lock(&inode_lru_lock);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -710,7 +707,6 @@ static void prune_icache(int nr_to_scan)
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
 
-	dispose_list(&freeable);
 	up_read(&iprune_sem);
 }
 
@@ -1449,7 +1445,6 @@ static void iput_final(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	const struct super_operations *op = inode->i_sb->s_op;
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int drop;
 
 	assert_spin_locked(&inode->i_lock);
@@ -1475,35 +1470,16 @@ static void iput_final(struct inode *inode)
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
 		write_inode_now(inode, 1);
+		remove_inode_hash(inode);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		__remove_inode_hash(inode);
 	}
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
 
-	/*
-	 * move the inode off the IO lists and LRU once I_FREEING is set so
-	 * that it won't get moved back on there if it is dirty.
-	 * around.
-	 */
-	spin_lock(&bdi->wb.b_lock);
-	list_del_init(&inode->i_io);
-	spin_unlock(&bdi->wb.b_lock);
-
-	inode_lru_list_del(inode);
-
-	spin_lock(&sb->s_inodes_lock);
-	list_del_init(&inode->i_sb_list);
-	spin_unlock(&sb->s_inodes_lock);
-
-	evict(inode);
-	remove_inode_hash(inode);
-	wake_up_inode(inode);
-	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-	destroy_inode(inode);
+	dispose_one_inode(inode);
 }
 
 /**
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 162+ messages in thread

* Re: [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08  5:21 ` [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
@ 2010-10-08  8:11   ` Christoph Hellwig
  2010-10-08 10:18   ` Al Viro
  2010-10-09 17:22   ` Matthew Wilcox
  2 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-08  8:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> After the inode scalability work, there is not a big reason to batch
> up inodes to reclaim them, so we can dispose them as they are found
> from the LRU. With similar reasoning, we can do the same during
> unmount, completely removing the need for the dispose_list()
> function.

Given that two of the three callers already remove the inode from the
per-sb list what about doing that in the third also and stop bothering
with it in dispose_one_inode?


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08  5:21 ` [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
  2010-10-08  8:11   ` Christoph Hellwig
@ 2010-10-08 10:18   ` Al Viro
  2010-10-08 10:52     ` Dave Chinner
  2010-10-09 17:22   ` Matthew Wilcox
  2 siblings, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08 10:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:32PM +1100, Dave Chinner wrote:

> +			spin_unlock(&sb->s_inodes_lock);
>  
> -			spin_lock(&inode_lru_lock);
> -			list_move(&inode->i_lru, dispose);
> -			spin_unlock(&inode_lru_lock);
> +			dispose_one_inode(inode);
>  
> -			percpu_counter_dec(&nr_inodes_unused);
> +			spin_lock(&sb->s_inodes_lock);

And now you've unlocked the list and even blocked.  What's going to
keep next valid through that fun?

> +		spin_unlock(&inode_lru_lock);
> +
> +		dispose_one_inode(inode);
> +		cond_resched();
> +
> +		spin_lock(&inode_lru_lock);

Same, only worse - in the previous you might hope for lack of activity
on fs, in this one you really can't.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08 10:18   ` Al Viro
@ 2010-10-08 10:52     ` Dave Chinner
  2010-10-08 12:10       ` Al Viro
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-08 10:52 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 11:18:19AM +0100, Al Viro wrote:
> On Fri, Oct 08, 2010 at 04:21:32PM +1100, Dave Chinner wrote:
> 
> > +			spin_unlock(&sb->s_inodes_lock);
> >  
> > -			spin_lock(&inode_lru_lock);
> > -			list_move(&inode->i_lru, dispose);
> > -			spin_unlock(&inode_lru_lock);
> > +			dispose_one_inode(inode);
> >  
> > -			percpu_counter_dec(&nr_inodes_unused);
> > +			spin_lock(&sb->s_inodes_lock);
> 
> And now you've unlocked the list and even blocked.  What's going to
> keep next valid through that fun?

See the comment at the start of the loop in invalidate_list():

                /*
                 * We can reschedule here without worrying about the list's
                 * consistency because the per-sb list of inodes must not
                 * change during umount anymore, and because iprune_sem keeps
                 * shrink_icache_memory() away.
                 */
		cond_resched_lock(&sb->s_inodes_lock);

Hence I've assumed it's ok to add another point that drops locks and blocks
inside the loop and next will still be valid.

> 
> > +		spin_unlock(&inode_lru_lock);
> > +
> > +		dispose_one_inode(inode);
> > +		cond_resched();
> > +
> > +		spin_lock(&inode_lru_lock);
> 
> Same, only worse - in the previous you might hope for lack of activity
> on fs, in this one you really can't.

That one in prune_icache() is safe because the loop always gets the
first inod eon the list:

	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
		struct inode *inode;

		if (list_empty(&inode_lru))
			break;

		inode = list_entry(inode_lru.prev, struct inode, i_lru);
		.....

because there is pre-existing branch in the loop that drops all the
locks.

Cheers,

Dave.



> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08 10:52     ` Dave Chinner
@ 2010-10-08 12:10       ` Al Viro
  2010-10-08 13:55         ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Al Viro @ 2010-10-08 12:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 09:52:49PM +1100, Dave Chinner wrote:
> On Fri, Oct 08, 2010 at 11:18:19AM +0100, Al Viro wrote:
> > On Fri, Oct 08, 2010 at 04:21:32PM +1100, Dave Chinner wrote:
> > 
> > > +			spin_unlock(&sb->s_inodes_lock);
> > >  
> > > -			spin_lock(&inode_lru_lock);
> > > -			list_move(&inode->i_lru, dispose);
> > > -			spin_unlock(&inode_lru_lock);
> > > +			dispose_one_inode(inode);
> > >  
> > > -			percpu_counter_dec(&nr_inodes_unused);
> > > +			spin_lock(&sb->s_inodes_lock);
> > 
> > And now you've unlocked the list and even blocked.  What's going to
> > keep next valid through that fun?
> 
> See the comment at the start of the loop in invalidate_list():
> 
>                 /*
>                  * We can reschedule here without worrying about the list's
>                  * consistency because the per-sb list of inodes must not
>                  * change during umount anymore, and because iprune_sem keeps
>                  * shrink_icache_memory() away.
>                  */
> 		cond_resched_lock(&sb->s_inodes_lock);
> 
> Hence I've assumed it's ok to add another point that drops locks and blocks
> inside the loop and next will still be valid.

I'm not convinced, TBH; IOW, the original might have been broken by that.
The trouble is, this function is called not only on umount().  Block device
invalidation paths also can lead to it.  Moreover, even for umount-only
side of things, remember that there's fsnotify as well.  Original code
did _everything_ except the actual dropping inodes without releasing
inode_lock.  I'm not saying that change is broken (or, in case of
non-umount paths, makes breakage worse), but I'd like to see more analysis
of that area.

Umount races that hit only when you have the right subset of inodes with
idiotify watches on those are really not fun to debug post-factum...


> > > +		spin_unlock(&inode_lru_lock);
> > > +
> > > +		dispose_one_inode(inode);
> > > +		cond_resched();
> > > +
> > > +		spin_lock(&inode_lru_lock);
> > 
> > Same, only worse - in the previous you might hope for lack of activity
> > on fs, in this one you really can't.
> 
> That one in prune_icache() is safe because the loop always gets the
> first inod eon the list:
> 
> 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
> 		struct inode *inode;
> 
> 		if (list_empty(&inode_lru))
> 			break;
> 
> 		inode = list_entry(inode_lru.prev, struct inode, i_lru);
> 		.....

D'oh.  OK, that one looks all right.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08 12:10       ` Al Viro
@ 2010-10-08 13:55         ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-08 13:55 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 01:10:52PM +0100, Al Viro wrote:
> On Fri, Oct 08, 2010 at 09:52:49PM +1100, Dave Chinner wrote:
> > On Fri, Oct 08, 2010 at 11:18:19AM +0100, Al Viro wrote:
> > > On Fri, Oct 08, 2010 at 04:21:32PM +1100, Dave Chinner wrote:
> > > 
> > > > +			spin_unlock(&sb->s_inodes_lock);
> > > >  
> > > > -			spin_lock(&inode_lru_lock);
> > > > -			list_move(&inode->i_lru, dispose);
> > > > -			spin_unlock(&inode_lru_lock);
> > > > +			dispose_one_inode(inode);
> > > >  
> > > > -			percpu_counter_dec(&nr_inodes_unused);
> > > > +			spin_lock(&sb->s_inodes_lock);
> > > 
> > > And now you've unlocked the list and even blocked.  What's going to
> > > keep next valid through that fun?
> > 
> > See the comment at the start of the loop in invalidate_list():
> > 
> >                 /*
> >                  * We can reschedule here without worrying about the list's
> >                  * consistency because the per-sb list of inodes must not
> >                  * change during umount anymore, and because iprune_sem keeps
> >                  * shrink_icache_memory() away.
> >                  */
> > 		cond_resched_lock(&sb->s_inodes_lock);
> > 
> > Hence I've assumed it's ok to add another point that drops locks and blocks
> > inside the loop and next will still be valid.
> 
> I'm not convinced, TBH; IOW, the original might have been broken by that.
> The trouble is, this function is called not only on umount().  Block device
> invalidation paths also can lead to it. 

Yeah, I see that now. Thanks for pointing it out.

> Moreover, even for umount-only
> side of things, remember that there's fsnotify as well.

I thought that the fsnotify_unmount_inodes() cleaned everything up
before we called invalidate_list().

> Original code
> did _everything_ except the actual dropping inodes without releasing
> inode_lock.  I'm not saying that change is broken (or, in case of
> non-umount paths, makes breakage worse), but I'd like to see more analysis
> of that area.

I think I'll avoid the whole issue right now by not making this
change to invalidate_list()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-08  5:21 ` [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
  2010-10-08  8:11   ` Christoph Hellwig
  2010-10-08 10:18   ` Al Viro
@ 2010-10-09 17:22   ` Matthew Wilcox
  2 siblings, 0 replies; 162+ messages in thread
From: Matthew Wilcox @ 2010-10-09 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 08, 2010 at 04:21:32PM +1100, Dave Chinner wrote:
>  /*
> - * dispose_list - dispose of the contents of a local list
> - * @head: the head of the list to free
> + * Free the inode passed in, removing it from the lists it is still connected
> + * to but avoiding unnecessary lock round-trips for the lists it is no longer
> + * on.
>   *
> - * Dispose-list gets a local list with local inodes in it, so it doesn't
> - * need to worry about list corruption and SMP locks.
> + * An inode must already be marked I_FREEING so that we avoid the inode being
> + * moved back onto lists if we race with other code that manipulates the lists
> + * (e.g. writeback_single_inode).
>   */
> -static void dispose_list(struct list_head *head)
> +static void dispose_one_inode(struct inode *inode)
>  {
> -	while (!list_empty(head)) {
> -		struct inode *inode;
> +	BUG_ON(!(inode->i_state & I_FREEING));

I think this may be my favourite comment in this whole patch set.  It's a
real pain to hit BUG_ONs and not know why it's there.  Thank you.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 19/18] fs: split __inode_add_to_list
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (17 preceding siblings ...)
  2010-10-08  5:21 ` [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
@ 2010-10-09  8:08 ` Christoph Hellwig
  2010-10-12 10:47   ` Dave Chinner
  2010-10-09 11:18 ` [PATCH 20/18] fs: do not assign default i_ino in new_inode Christoph Hellwig
  19 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-09  8:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

__inode_add_to_list does two things that aren't related.  First it adds
the inode to the s_inodes list in the superblock, and second it optionally
adds the inode to the inode hash.  Now that these don't even share the
same lock there is no need to keeps this functionally together.  Split
out an add_to_inode_hash helper from __insert_inode_hash to add an inode
to a pre-calculated hash bucket for use by the various iget version, and
a inode_add_to_sb_list helper from __inode_add_to_list to just add an
inode to the per-sb list.  The inode.c-internal callers of
__inode_add_to_list are converted to a sequence of inode_add_to_sb_list
and __insert_inode_hash (if needed), and the only use of inode_add_to_list
in XFS is replaced with a call to inode_add_to_sb_list and insert_inode_hash.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-08 19:53:56.287013554 +0200
+++ linux-2.6/fs/inode.c	2010-10-08 19:54:50.717254091 +0200
@@ -423,6 +423,13 @@ static unsigned long hash(struct super_b
 	return tmp & I_HASHMASK;
 }
 
+static void add_to_inode_hash(struct inode_hash_bucket *b, struct inode *inode)
+{
+	spin_lock_bucket(b);
+	hlist_bl_add_head(&inode->i_hash, &b->head);
+	spin_unlock_bucket(b);
+}
+
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
@@ -433,12 +440,7 @@ static unsigned long hash(struct super_b
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct inode_hash_bucket *b;
-
-	b = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock_bucket(b);
-	hlist_bl_add_head(&inode->i_hash, &b->head);
-	spin_unlock_bucket(b);
+	add_to_inode_hash(inode_hashtable + hash(inode->i_sb, hashval), inode);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -805,39 +807,19 @@ repeat:
 	return node ? inode : NULL;
 }
 
-static inline void
-__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
-			struct inode *inode)
-{
-	spin_lock(&sb->s_inodes_lock);
-	list_add(&inode->i_sb_list, &sb->s_inodes);
-	spin_unlock(&sb->s_inodes_lock);
-	if (b) {
-		spin_lock_bucket(b);
-		hlist_bl_add_head(&inode->i_hash, &b->head);
-		spin_unlock_bucket(b);
-	}
-}
-
 /**
- * inode_add_to_lists - add a new inode to relevant lists
- * @sb: superblock inode belongs to
- * @inode: inode to mark in use
- *
- * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash.
- *
- * We calculate the hash list to add to here so it is all internal
- * which requires the caller to have already set up the inode number in the
- * inode to add.
+ * inode_add_to_sb_list - add inode to the superblock list of inodes
+ * @inode: inode to add
  */
-void inode_add_to_lists(struct super_block *sb, struct inode *inode)
+void inode_add_to_sb_list(struct inode *inode)
 {
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
+	struct super_block *sb = inode->i_sb;
 
-	__inode_add_to_lists(sb, b, inode);
+	spin_lock(&sb->s_inodes_lock);
+	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
 }
-EXPORT_SYMBOL_GPL(inode_add_to_lists);
+EXPORT_SYMBOL_GPL(inode_add_to_sb_list);
 
 /*
  * Each cpu owns a range of LAST_INO_BATCH numbers.
@@ -896,7 +878,7 @@ struct inode *new_inode(struct super_blo
 	if (inode) {
 		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
-		__inode_add_to_lists(sb, NULL, inode);
+		inode_add_to_sb_list(inode);
 	}
 	return inode;
 }
@@ -962,7 +944,8 @@ static struct inode *get_new_inode(struc
 				goto set_failed;
 
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, b, inode);
+			inode_add_to_sb_list(inode);
+			add_to_inode_hash(b, inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1005,7 +988,8 @@ static struct inode *get_new_inode_fast(
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, b, inode);
+			inode_add_to_sb_list(inode);
+			add_to_inode_hash(b, inode);
 			inode->i_state = I_NEW;
 
 			/* Return the locked inode with I_NEW set, the
Index: linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_iops.c	2010-10-08 19:53:56.299004405 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_iops.c	2010-10-08 19:54:31.258003987 +0200
@@ -795,7 +795,9 @@ xfs_setup_inode(
 
 	inode->i_ino = ip->i_ino;
 	inode->i_state = I_NEW;
-	inode_add_to_lists(ip->i_mount->m_super, inode);
+
+	inode_add_to_sb_list(inode);
+	insert_inode_hash(inode);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	inode->i_nlink	= ip->i_d.di_nlink;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-08 19:53:56.310012018 +0200
+++ linux-2.6/include/linux/fs.h	2010-10-08 19:54:50.871004265 +0200
@@ -2165,7 +2165,6 @@ extern loff_t vfs_llseek(struct file *fi
 
 extern int inode_init_always(struct super_block *, struct inode *);
 extern void inode_init_once(struct inode *);
-extern void inode_add_to_lists(struct super_block *, struct inode *);
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
@@ -2199,9 +2198,11 @@ extern int file_remove_suid(struct file
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 extern void remove_inode_hash(struct inode *);
-static inline void insert_inode_hash(struct inode *inode) {
+static inline void insert_inode_hash(struct inode *inode)
+{
 	__insert_inode_hash(inode, inode->i_ino);
 }
+extern void inode_add_to_sb_list(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 19/18] fs: split __inode_add_to_list
  2010-10-09  8:08 ` [PATCH 19/18] fs: split __inode_add_to_list Christoph Hellwig
@ 2010-10-12 10:47   ` Dave Chinner
  2010-10-12 11:31     ` Christoph Hellwig
  0 siblings, 1 reply; 162+ messages in thread
From: Dave Chinner @ 2010-10-12 10:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 09, 2010 at 04:08:54AM -0400, Christoph Hellwig wrote:
> __inode_add_to_list does two things that aren't related.  First it adds
> the inode to the s_inodes list in the superblock, and second it optionally
> adds the inode to the inode hash.  Now that these don't even share the
> same lock there is no need to keeps this functionally together.  Split
> out an add_to_inode_hash helper from __insert_inode_hash to add an inode
> to a pre-calculated hash bucket for use by the various iget version, and
> a inode_add_to_sb_list helper from __inode_add_to_list to just add an
> inode to the per-sb list.  The inode.c-internal callers of
> __inode_add_to_list are converted to a sequence of inode_add_to_sb_list
> and __insert_inode_hash (if needed), and the only use of inode_add_to_list
> in XFS is replaced with a call to inode_add_to_sb_list and insert_inode_hash.

The only reason XFS hashed the inodes was to avoid problems in the
generic code that checked for unhashed inodes during clear_inode(). The
evict() changeover moved that unhashed check into
generic_drop_inode(), which the filesystem can override. Hence if
you add a ->drop_inode() method for XFS that just checks the link
count, we can avoid haѕhing the inodes altogether for XFS.

I can add another patch on top of this one to do that if you want...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 19/18] fs: split __inode_add_to_list
  2010-10-12 10:47   ` Dave Chinner
@ 2010-10-12 11:31     ` Christoph Hellwig
  2010-10-12 12:05       ` Dave Chinner
  0 siblings, 1 reply; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-12 11:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Tue, Oct 12, 2010 at 09:47:27PM +1100, Dave Chinner wrote:
> The only reason XFS hashed the inodes was to avoid problems in the
> generic code that checked for unhashed inodes during clear_inode(). The
> evict() changeover moved that unhashed check into
> generic_drop_inode(), which the filesystem can override. Hence if
> you add a ->drop_inode() method for XFS that just checks the link
> count, we can avoid ha??hing the inodes altogether for XFS.
> 
> I can add another patch on top of this one to do that if you want...

It's unfortunately not that simple.  Take a look at the unhashed check
in __mark_inode_dirty.  The drop_inode check could be avoided for
quite a long time now.  What we could do however is the same hack as
JFS does in diReadSpecial().

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 19/18] fs: split __inode_add_to_list
  2010-10-12 11:31     ` Christoph Hellwig
@ 2010-10-12 12:05       ` Dave Chinner
  0 siblings, 0 replies; 162+ messages in thread
From: Dave Chinner @ 2010-10-12 12:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Tue, Oct 12, 2010 at 07:31:30AM -0400, Christoph Hellwig wrote:
> On Tue, Oct 12, 2010 at 09:47:27PM +1100, Dave Chinner wrote:
> > The only reason XFS hashed the inodes was to avoid problems in the
> > generic code that checked for unhashed inodes during clear_inode(). The
> > evict() changeover moved that unhashed check into
> > generic_drop_inode(), which the filesystem can override. Hence if
> > you add a ->drop_inode() method for XFS that just checks the link
> > count, we can avoid ha??hing the inodes altogether for XFS.
> > 
> > I can add another patch on top of this one to do that if you want...
> 
> It's unfortunately not that simple.  Take a look at the unhashed check
> in __mark_inode_dirty.

Damn - I forgot about that one. Does anyone know why that check is
there?

> The drop_inode check could be avoided for
> quite a long time now.  What we could do however is the same hack as
> JFS does in diReadSpecial().

Nasty, but effective. Worth considering, I think.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 20/18] fs: do not assign default i_ino in new_inode
  2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
                   ` (18 preceding siblings ...)
  2010-10-09  8:08 ` [PATCH 19/18] fs: split __inode_add_to_list Christoph Hellwig
@ 2010-10-09 11:18 ` Christoph Hellwig
  19 siblings, 0 replies; 162+ messages in thread
From: Christoph Hellwig @ 2010-10-09 11:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is 
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-08 19:54:31.257003917 +0200
+++ linux-2.6/fs/inode.c	2010-10-08 19:54:31.825011877 +0200
@@ -839,7 +839,7 @@ EXPORT_SYMBOL_GPL(inode_add_to_sb_list);
 #define LAST_INO_BATCH 1024
 static DEFINE_PER_CPU(unsigned int, last_ino);
 
-static unsigned int last_ino_get(void)
+unsigned int last_ino_get(void)
 {
 	unsigned int *p = &get_cpu_var(last_ino);
 	unsigned int res = *p;
@@ -857,6 +857,7 @@ static unsigned int last_ino_get(void)
 	put_cpu_var(last_ino);
 	return res;
 }
+EXPORT_SYMBOL(last_ino_get);
 
 /**
  *	new_inode 	- obtain an inode
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-08 19:54:31.262004266 +0200
+++ linux-2.6/include/linux/fs.h	2010-10-08 19:54:31.826022563 +0200
@@ -2184,6 +2184,7 @@ extern struct inode * iget_locked(struct
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
+extern unsigned int last_ino_get(void);
 
 extern void iref(struct inode *inode);
 extern void iref_locked(struct inode *inode);
Index: linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ipath/ipath_fs.c	2010-10-08 19:53:46.539005524 +0200
+++ linux-2.6/drivers/infiniband/hw/ipath/ipath_fs.c	2010-10-08 19:54:36.795003567 +0200
@@ -57,6 +57,7 @@ static int ipathfs_mknod(struct inode *d
 		goto bail;
 	}
 
+	inode->i_ino = last_ino_get();
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	inode->i_private = data;
Index: linux-2.6/drivers/infiniband/hw/qib/qib_fs.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/qib/qib_fs.c	2010-10-08 19:53:46.540005523 +0200
+++ linux-2.6/drivers/infiniband/hw/qib/qib_fs.c	2010-10-08 19:54:47.625206039 +0200
@@ -58,6 +58,7 @@ static int qibfs_mknod(struct inode *dir
 		goto bail;
 	}
 
+	inode->i_ino = last_ino_get();
 	inode->i_mode = mode;
 	inode->i_uid = 0;
 	inode->i_gid = 0;
Index: linux-2.6/drivers/misc/ibmasm/ibmasmfs.c
===================================================================
--- linux-2.6.orig/drivers/misc/ibmasm/ibmasmfs.c	2010-10-08 19:53:46.550006500 +0200
+++ linux-2.6/drivers/misc/ibmasm/ibmasmfs.c	2010-10-08 19:54:36.804260795 +0200
@@ -146,6 +147,7 @@ static struct inode *ibmasmfs_make_inode
 	struct inode *ret = new_inode(sb);
 
 	if (ret) {
+		ret->i_ino = last_ino_get();
 		ret->i_mode = mode;
 		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
 	}
Index: linux-2.6/drivers/oprofile/oprofilefs.c
===================================================================
--- linux-2.6.orig/drivers/oprofile/oprofilefs.c	2010-10-08 19:53:46.552018094 +0200
+++ linux-2.6/drivers/oprofile/oprofilefs.c	2010-10-08 19:54:36.811254300 +0200
@@ -28,6 +28,7 @@ static struct inode *oprofilefs_get_inod
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	}
Index: linux-2.6/drivers/usb/core/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/core/inode.c	2010-10-08 19:53:46.560005872 +0200
+++ linux-2.6/drivers/usb/core/inode.c	2010-10-08 19:54:36.818261564 +0200
@@ -276,6 +276,7 @@ static struct inode *usbfs_get_inode (st
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
Index: linux-2.6/drivers/usb/gadget/f_fs.c
===================================================================
--- linux-2.6.orig/drivers/usb/gadget/f_fs.c	2010-10-08 19:53:46.565007478 +0200
+++ linux-2.6/drivers/usb/gadget/f_fs.c	2010-10-08 19:54:36.828006430 +0200
@@ -980,6 +980,7 @@ ffs_sb_make_inode(struct super_block *sb
 	if (likely(inode)) {
 		struct timespec current_time = CURRENT_TIME;
 
+		inode->i_ino	 = usbfs_get_inode();
 		inode->i_mode    = perms->mode;
 		inode->i_uid     = perms->uid;
 		inode->i_gid     = perms->gid;
Index: linux-2.6/drivers/usb/gadget/inode.c
===================================================================
--- linux-2.6.orig/drivers/usb/gadget/inode.c	2010-10-08 19:53:46.569006151 +0200
+++ linux-2.6/drivers/usb/gadget/inode.c	2010-10-08 19:54:36.836003916 +0200
@@ -1994,6 +1994,7 @@ gadgetfs_make_inode (struct super_block
 	struct inode *inode = new_inode (sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_uid = default_uid;
 		inode->i_gid = default_gid;
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c	2010-10-08 19:53:46.579023472 +0200
+++ linux-2.6/fs/anon_inodes.c	2010-10-08 19:54:36.841260935 +0200
@@ -193,6 +193,7 @@ static struct inode *anon_inode_mkinode(
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
 
+	inode->i_ino = last_ino_get();
 	inode->i_fop = &anon_inode_fops;
 
 	inode->i_mapping->a_ops = &anon_aops;
Index: linux-2.6/fs/autofs4/inode.c
===================================================================
--- linux-2.6.orig/fs/autofs4/inode.c	2010-10-08 19:53:46.584004056 +0200
+++ linux-2.6/fs/autofs4/inode.c	2010-10-08 19:54:36.844257862 +0200
@@ -398,6 +398,7 @@ struct inode *autofs4_get_inode(struct s
 		inode->i_gid = sb->s_root->d_inode->i_gid;
 	}
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	inode->i_ino = last_ino_get();
 
 	if (S_ISDIR(inf->mode)) {
 		inode->i_nlink = 2;
Index: linux-2.6/fs/binfmt_misc.c
===================================================================
--- linux-2.6.orig/fs/binfmt_misc.c	2010-10-08 19:53:46.590254230 +0200
+++ linux-2.6/fs/binfmt_misc.c	2010-10-08 19:54:36.848004265 +0200
@@ -495,6 +495,7 @@ static struct inode *bm_get_inode(struct
 	struct inode * inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime =
 			current_fs_time(inode->i_sb);
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c	2010-10-08 19:53:46.596006221 +0200
+++ linux-2.6/fs/configfs/inode.c	2010-10-08 19:54:36.852004405 +0200
@@ -135,6 +135,7 @@ struct inode * configfs_new_inode(mode_t
 {
 	struct inode * inode = new_inode(configfs_sb);
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mapping->a_ops = &configfs_aops;
 		mapping_new_set_bdi(inode->i_mapping,
 				&configfs_backing_dev_info);
Index: linux-2.6/fs/debugfs/inode.c
===================================================================
--- linux-2.6.orig/fs/debugfs/inode.c	2010-10-08 19:53:46.601005522 +0200
+++ linux-2.6/fs/debugfs/inode.c	2010-10-08 19:54:36.856005522 +0200
@@ -40,6 +40,7 @@ static struct inode *debugfs_get_inode(s
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c	2010-10-08 19:53:46.608006221 +0200
+++ linux-2.6/fs/ext4/mballoc.c	2010-10-08 19:54:36.863255767 +0200
@@ -2373,6 +2373,7 @@ static int ext4_mb_init_backend(struct s
 		printk(KERN_ERR "EXT4-fs: can't get new inode\n");
 		goto err_freesgi;
 	}
+	sbi->s_buddy_cache->i_ino = last_ino_get();
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
 		desc = ext4_get_group_desc(sb, i, NULL);
Index: linux-2.6/fs/freevxfs/vxfs_inode.c
===================================================================
--- linux-2.6.orig/fs/freevxfs/vxfs_inode.c	2010-10-08 19:53:46.613006291 +0200
+++ linux-2.6/fs/freevxfs/vxfs_inode.c	2010-10-08 19:54:36.868255697 +0200
@@ -260,6 +260,7 @@ vxfs_get_fake_inode(struct super_block *
 	struct inode			*ip = NULL;
 
 	if ((ip = new_inode(sbp))) {
+		ip->i_ino = last_ino_get();
 		vxfs_iinit(ip, vip);
 		ip->i_mapping->a_ops = &vxfs_aops;
 	}
Index: linux-2.6/fs/fuse/control.c
===================================================================
--- linux-2.6.orig/fs/fuse/control.c	2010-10-08 19:53:46.615011669 +0200
+++ linux-2.6/fs/fuse/control.c	2010-10-08 19:54:36.872256884 +0200
@@ -218,6 +218,7 @@ static struct dentry *fuse_ctl_add_dentr
 	if (!inode)
 		return NULL;
 
+	inode->i_ino = last_ino_get();
 	inode->i_mode = mode;
 	inode->i_uid = fc->user_id;
 	inode->i_gid = fc->group_id;
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2010-10-08 19:53:46.619033250 +0200
+++ linux-2.6/fs/hugetlbfs/inode.c	2010-10-08 19:54:36.882256326 +0200
@@ -455,6 +455,7 @@ static struct inode *hugetlbfs_get_inode
 	inode = new_inode(sb);
 	if (inode) {
 		struct hugetlbfs_inode_info *info;
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
Index: linux-2.6/fs/ocfs2/dlmfs/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlmfs/dlmfs.c	2010-10-08 19:53:46.623010272 +0200
+++ linux-2.6/fs/ocfs2/dlmfs/dlmfs.c	2010-10-08 19:54:36.886255906 +0200
@@ -400,6 +400,7 @@ static struct inode *dlmfs_get_root_inod
 	if (inode) {
 		ip = DLMFS_I(inode);
 
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
@@ -425,6 +426,7 @@ static struct inode *dlmfs_get_inode(str
 	if (!inode)
 		return NULL;
 
+	inode->i_ino = last_ino_get();
 	inode->i_mode = mode;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c	2010-10-08 19:53:46.629004475 +0200
+++ linux-2.6/fs/pipe.c	2010-10-08 19:54:36.893255906 +0200
@@ -954,6 +954,8 @@ static struct inode * get_pipe_inode(voi
 	if (!inode)
 		goto fail_inode;
 
+	inode->i_ino = last_ino_get();
+
 	pipe = alloc_pipe_info(inode);
 	if (!pipe)
 		goto fail_iput;
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c	2010-10-08 19:53:46.634003986 +0200
+++ linux-2.6/fs/proc/base.c	2010-10-08 19:54:36.898253951 +0200
@@ -1600,6 +1600,7 @@ static struct inode *proc_pid_make_inode
 
 	/* Common stuff */
 	ei = PROC_I(inode);
+	inode->i_ino = last_ino_get();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	inode->i_op = &proc_def_inode_operations;
 
@@ -2542,6 +2543,7 @@ static struct dentry *proc_base_instanti
 
 	/* Initialize the inode */
 	ei = PROC_I(inode);
+	inode->i_ino = last_ino_get();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 
 	/*
Index: linux-2.6/fs/proc/proc_sysctl.c
===================================================================
--- linux-2.6.orig/fs/proc/proc_sysctl.c	2010-10-08 19:53:46.639011878 +0200
+++ linux-2.6/fs/proc/proc_sysctl.c	2010-10-08 19:54:36.902270154 +0200
@@ -23,6 +23,8 @@ static struct inode *proc_sys_make_inode
 	if (!inode)
 		goto out;
 
+	inode->i_ino = last_ino_get();
+
 	sysctl_head_get(head);
 	ei = PROC_I(inode);
 	ei->sysctl = head;
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c	2010-10-08 19:53:46.645254091 +0200
+++ linux-2.6/fs/ramfs/inode.c	2010-10-08 19:54:36.911256675 +0200
@@ -58,6 +58,7 @@ struct inode *ramfs_get_inode(struct sup
 	struct inode * inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		mapping_new_set_bdi(inode->i_mapping, &ramfs_backing_dev_info);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c	2010-10-08 19:53:46.652123695 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c	2010-10-08 19:54:36.918005662 +0200
@@ -1572,6 +1572,7 @@ xfs_mapping_buftarg(
 			XFS_BUFTARG_NAME(btp));
 		return ENOMEM;
 	}
+	inode->i_ino = last_ino_get();
 	inode->i_mode = S_IFBLK;
 	inode->i_bdev = bdev;
 	inode->i_rdev = bdev->bd_dev;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c	2010-10-08 19:53:46.656255837 +0200
+++ linux-2.6/ipc/mqueue.c	2010-10-08 19:54:36.923005872 +0200
@@ -116,6 +116,7 @@ static struct inode *mqueue_get_inode(st
 
 	inode = new_inode(sb);
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
Index: linux-2.6/kernel/cgroup.c
===================================================================
--- linux-2.6.orig/kernel/cgroup.c	2010-10-08 19:53:46.662006011 +0200
+++ linux-2.6/kernel/cgroup.c	2010-10-08 19:54:36.930005802 +0200
@@ -778,6 +778,7 @@ static struct inode *cgroup_new_inode(mo
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2010-10-08 19:53:46.665005872 +0200
+++ linux-2.6/mm/shmem.c	2010-10-08 19:54:36.934005453 +0200
@@ -1586,6 +1586,7 @@ static struct inode *shmem_get_inode(str
 
 	inode = new_inode(sb);
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		mapping_new_set_bdi(inode->i_mapping, &shmem_backing_dev_info);
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-10-08 19:53:46.673283354 +0200
+++ linux-2.6/net/socket.c	2010-10-08 19:54:36.943255418 +0200
@@ -480,6 +480,7 @@ static struct socket *sock_alloc(void)
 	sock = SOCKET_I(inode);
 
 	kmemcheck_annotate_bitfield(sock, type);
+	inode->i_ino = last_ino_get();
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c	2010-10-08 19:53:46.678255208 +0200
+++ linux-2.6/net/sunrpc/rpc_pipe.c	2010-10-08 19:54:36.953256186 +0200
@@ -453,6 +453,7 @@ rpc_get_inode(struct super_block *sb, um
 	struct inode *inode = new_inode(sb);
 	if (!inode)
 		return NULL;
+	inode->i_ino = last_ino_get();
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	switch(mode & S_IFMT) {
Index: linux-2.6/security/inode.c
===================================================================
--- linux-2.6.orig/security/inode.c	2010-10-08 19:53:46.685256326 +0200
+++ linux-2.6/security/inode.c	2010-10-08 19:54:36.958024869 +0200
@@ -60,6 +60,7 @@ static struct inode *get_inode(struct su
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = last_ino_get();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
Index: linux-2.6/security/selinux/selinuxfs.c
===================================================================
--- linux-2.6.orig/security/selinux/selinuxfs.c	2010-10-08 19:53:46.691005662 +0200
+++ linux-2.6/security/selinux/selinuxfs.c	2010-10-08 19:54:36.965015091 +0200
@@ -785,6 +785,7 @@ static struct inode *sel_make_inode(stru
 	struct inode *ret = new_inode(sb);
 
 	if (ret) {
+		ret->i_ino = last_ino_get();
 		ret->i_mode = mode;
 		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
 	}

^ permalink raw reply	[flat|nested] 162+ messages in thread

end of thread, other threads:[~2010-10-26  0:12 UTC | newest]

Thread overview: 162+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-08  5:21 fs: Inode cache scalability V2 Dave Chinner
2010-10-08  5:21 ` [PATCH 01/18] kernel: add bl_list Dave Chinner
2010-10-08  8:18   ` Andi Kleen
2010-10-08 10:33     ` Dave Chinner
2010-10-08  5:21 ` [PATCH 02/18] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
2010-10-08  7:01   ` Christoph Hellwig
2010-10-08  5:21 ` [PATCH 03/18] fs: keep inode with backing-dev Dave Chinner
2010-10-08  7:01   ` Christoph Hellwig
2010-10-08  7:27     ` Dave Chinner
2010-10-08  5:21 ` [PATCH 04/18] fs: Implement lazy LRU updates for inodes Dave Chinner
2010-10-08  7:08   ` Christoph Hellwig
2010-10-08  7:31     ` Dave Chinner
2010-10-08  9:08   ` Al Viro
2010-10-08  9:51     ` Dave Chinner
2010-10-08  5:21 ` [PATCH 05/18] fs: inode split IO and LRU lists Dave Chinner
2010-10-08  7:14   ` Christoph Hellwig
2010-10-08  7:38     ` Dave Chinner
2010-10-08  9:16   ` Al Viro
2010-10-08  9:58     ` Dave Chinner
2010-10-08  5:21 ` [PATCH 06/18] fs: Clean up inode reference counting Dave Chinner
2010-10-08  7:20   ` Christoph Hellwig
2010-10-08  7:46     ` Dave Chinner
2010-10-08  8:15       ` Christoph Hellwig
2010-10-08  5:21 ` [PATCH 07/18] exofs: use iput() for inode reference count decrements Dave Chinner
2010-10-08  7:21   ` Christoph Hellwig
2010-10-16  7:56   ` Nick Piggin
2010-10-16 16:29     ` Christoph Hellwig
2010-10-17 15:41       ` Boaz Harrosh
2010-10-08  5:21 ` [PATCH 08/18] fs: add inode reference coutn read accessor Dave Chinner
2010-10-08  7:24   ` Christoph Hellwig
2010-10-08  5:21 ` [PATCH 09/18] fs: rework icount to be a locked variable Dave Chinner
2010-10-08  7:27   ` Christoph Hellwig
2010-10-08  7:50     ` Dave Chinner
2010-10-08  8:17       ` Christoph Hellwig
2010-10-08 13:16         ` Chris Mason
2010-10-08  9:32   ` Al Viro
2010-10-08 10:15     ` Dave Chinner
2010-10-08 13:14       ` Chris Mason
2010-10-08 13:53       ` Christoph Hellwig
2010-10-08 14:09         ` Dave Chinner
2010-10-08  5:21 ` [PATCH 10/18] fs: Factor inode hash operations into functions Dave Chinner
2010-10-08  7:29   ` Christoph Hellwig
2010-10-08  9:41     ` Al Viro
2010-10-08  5:21 ` [PATCH 11/18] fs: Introduce per-bucket inode hash locks Dave Chinner
2010-10-08  7:33   ` Christoph Hellwig
2010-10-08  7:51     ` Dave Chinner
2010-10-08  9:49   ` Al Viro
2010-10-08  9:51     ` Christoph Hellwig
2010-10-08 13:43   ` Christoph Hellwig
2010-10-08 14:17     ` Dave Chinner
2010-10-08 18:54   ` Christoph Hellwig
2010-10-16  7:57     ` Nick Piggin
2010-10-16 16:16       ` Christoph Hellwig
2010-10-16 17:12         ` Nick Piggin
2010-10-17  0:45           ` Christoph Hellwig
2010-10-17  2:06             ` Nick Piggin
2010-10-17  0:46           ` Dave Chinner
2010-10-17  2:25             ` Nick Piggin
2010-10-18 16:16               ` Andi Kleen
2010-10-18 16:21                 ` Christoph Hellwig
2010-10-19  7:00                   ` Nick Piggin
2010-10-19 16:50                     ` Christoph Hellwig
2010-10-20  3:11                       ` Nick Piggin
2010-10-24 15:44                       ` Thomas Gleixner
2010-10-24 21:17                         ` Nick Piggin
2010-10-25  4:41                           ` Thomas Gleixner
2010-10-25  7:04                             ` Thomas Gleixner
2010-10-26  0:12                               ` Nick Piggin
2010-10-26  0:06                             ` Nick Piggin
2010-10-08  5:21 ` [PATCH 12/18] fs: add a per-superblock lock for the inode list Dave Chinner
2010-10-08  7:35   ` Christoph Hellwig
2010-10-08  5:21 ` [PATCH 13/18] fs: split locking of inode writeback and LRU lists Dave Chinner
2010-10-08  7:42   ` Christoph Hellwig
2010-10-08  8:00     ` Dave Chinner
2010-10-08  8:18       ` Christoph Hellwig
2010-10-16  7:57         ` Nick Piggin
2010-10-16 16:20           ` Christoph Hellwig
2010-10-16 17:19             ` Nick Piggin
2010-10-17  1:00               ` Dave Chinner
2010-10-17  2:20                 ` Nick Piggin
2010-10-08  5:21 ` [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock Dave Chinner
2010-10-08  7:49   ` Christoph Hellwig
2010-10-08  8:04     ` Dave Chinner
2010-10-08  8:18       ` Christoph Hellwig
2010-10-16  7:57         ` Nick Piggin
2010-10-16 16:19           ` Christoph Hellwig
2010-10-09  8:05       ` Christoph Hellwig
2010-10-09 14:52       ` Matthew Wilcox
2010-10-10  2:01         ` Dave Chinner
2010-10-08  5:21 ` [PATCH 15/18] fs: introduce a per-cpu last_ino allocator Dave Chinner
2010-10-08  7:53   ` Christoph Hellwig
2010-10-08  8:05     ` Dave Chinner
2010-10-08  8:22   ` Andi Kleen
2010-10-08  8:44     ` Christoph Hellwig
2010-10-08  9:58     ` Al Viro
2010-10-08 10:09       ` Andi Kleen
2010-10-08 10:19         ` Al Viro
2010-10-08 10:20           ` Eric Dumazet
2010-10-08  9:56   ` Al Viro
2010-10-08 10:03     ` Christoph Hellwig
2010-10-08 10:20       ` Eric Dumazet
2010-10-08 13:48         ` Christoph Hellwig
2010-10-08 14:06           ` Eric Dumazet
2010-10-08 19:10             ` Christoph Hellwig
2010-10-09 17:14             ` Matthew Wilcox
2010-10-16  7:57       ` Nick Piggin
2010-10-16 16:22         ` Christoph Hellwig
2010-10-16 17:21           ` Nick Piggin
2010-10-08  5:21 ` [PATCH 16/18] fs: Make iunique independent of inode_lock Dave Chinner
2010-10-08  7:55   ` Christoph Hellwig
2010-10-08  8:06     ` Dave Chinner
2010-10-08  8:19       ` Christoph Hellwig
2010-10-08  5:21 ` [PATCH 17/18] fs: icache remove inode_lock Dave Chinner
2010-10-08  8:03   ` Christoph Hellwig
2010-10-08  8:09     ` Dave Chinner
2010-10-13  7:20   ` Nick Piggin
2010-10-13  7:27     ` Nick Piggin
2010-10-13 11:28       ` Christoph Hellwig
2010-10-13 12:03         ` Nick Piggin
2010-10-13 12:20           ` Christoph Hellwig
2010-10-13 12:25             ` Nick Piggin
2010-10-13 10:42     ` Eric Dumazet
2010-10-13 12:07       ` Nick Piggin
2010-10-13 11:25     ` Christoph Hellwig
2010-10-13 12:30       ` Nick Piggin
2010-10-13 23:23         ` Dave Chinner
2010-10-14  9:06           ` Nick Piggin
2010-10-14  9:13             ` Nick Piggin
2010-10-14 14:41             ` Christoph Hellwig
2010-10-15  0:14               ` Nick Piggin
2010-10-15  3:13                 ` Dave Chinner
2010-10-15  3:30                   ` Nick Piggin
2010-10-15  3:44                     ` Nick Piggin
2010-10-15  6:41                       ` Nick Piggin
2010-10-15 10:59                         ` Dave Chinner
2010-10-15 13:03                           ` Nick Piggin
2010-10-15 13:29                             ` Nick Piggin
2010-10-15 17:33                               ` Nick Piggin
2010-10-15 17:52                                 ` Christoph Hellwig
2010-10-15 18:02                                   ` Nick Piggin
2010-10-15 18:14                                     ` Nick Piggin
2010-10-16  2:09                                     ` Nick Piggin
2010-10-15 14:11                             ` Nick Piggin
2010-10-15 20:50                           ` Nick Piggin
2010-10-15 20:56                             ` Nick Piggin
2010-10-15  4:04               ` Nick Piggin
2010-10-15 11:33                 ` Dave Chinner
2010-10-15 13:14                   ` Nick Piggin
2010-10-15 15:38                   ` Nick Piggin
2010-10-16  7:57   ` Nick Piggin
2010-10-08  5:21 ` [PATCH 18/18] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
2010-10-08  8:11   ` Christoph Hellwig
2010-10-08 10:18   ` Al Viro
2010-10-08 10:52     ` Dave Chinner
2010-10-08 12:10       ` Al Viro
2010-10-08 13:55         ` Dave Chinner
2010-10-09 17:22   ` Matthew Wilcox
2010-10-09  8:08 ` [PATCH 19/18] fs: split __inode_add_to_list Christoph Hellwig
2010-10-12 10:47   ` Dave Chinner
2010-10-12 11:31     ` Christoph Hellwig
2010-10-12 12:05       ` Dave Chinner
2010-10-09 11:18 ` [PATCH 20/18] fs: do not assign default i_ino in new_inode Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).