linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Inode Lock Scalability V4
@ 2010-10-16  8:13 Dave Chinner
  2010-10-16  8:13 ` [PATCH 01/19] fs: switch bdev inode bdi's correctly Dave Chinner
                   ` (19 more replies)
  0 siblings, 20 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This patch set is derived from Nick Piggin's VFS scalability tree.
there doesn't appear to be any push to get that tree into shape for
.37, so this is an attempt to start the process of finer grained
review of the series for upstream inclusion. I'm hitting VFS lock
contention problems with XFS on 8-16p machines now, so I need to get
this stuff moving.

This patch set is just the basic inode_lock breakup patches plus a
few more simple changes to the inode code. It stops short of
introducing RCU inode freeing because those changes are not
completely baked yet.

As a result, the full inode handling improvements of Nick's patch
set are not realised with this short series. However, my own testing
indicates that the amount of lock traffic and contention is down by
an order of magnitude on an 8-way box for parallel inode create and
unlink workloads, so there is still significant improvements from
just this patch set.

Version 2 of this series is a complete rework of the original patch
series.  Nick's original code nested list locks inside the the
inode->i_lock, resulting in a large mess of trylock operations to
get locks out of order all over the place. In many cases, the reason
fo this lock ordering is removed later on in Nick's series as
cleanups are introduced.

As a result I've pulled in several of the cleanups and re-ordered
the series such that cleanups, factoring and list splitting are done
before any of the locking changes. Instead of converting the inode
state flags first, I've converted them last, ensuring that
manipulations are kept inside other locks rather than outside them.

The series is made up of the following steps:

	- inode counters are made per-cpu
	- inode LRU manipulations are made lazy
	- i_list is split into two lists (grows inode by 2
	  pointers), one for tracking lru status, one for writeback
	  status
	- reference counting is factored, then renamed and locked
	  differently
	- inode hash operations are factored, then locked per bucket
	- superblock inode listis locked per-superblock
	- inode LRU is locked via a global lock
		- unclear what the best way to split this up from
		  here is, so no attempt is made to optimise
		  further.
	- inode IO list are locked via a per-BDI lock
		- further analysis needed to determine the next step
		  in optimising this list. It is extremely contended
		  under parallel workloads because foreground
		  throttling (balance_dirty_pages) causes unbound
		  writeback parallelism and contention. Fixing the
		  unbound parallelism, I think, is a more important
		  first optimisation step than making the list
		  per-cpu.
	- lock i_state operations with i_lock
	- convert last_ino allocation to a percpu counter
	- protect iunique counter with it's own lock
	- remove inode_lock
	- factor destroying an inode into dispose_one_inode() which
	  is called from reclaim, dispose_list and iput_final.

None of the patcheѕ are unchanged, and several of them are new or
completely rewritten, so any previous testing is completely
invalidated. I have not tried to optimise locking by using trylock
loops - anywhere that requires out-of-order locking drops locks and
regains the locks needed for the next operation. This approach
simplified the code and lead to several improvments in the patch
series (e.g. moving inode->i_lock inside writeback_single_inode(),
and the dispose_one_inode factoring) that would have gone unnoticed
if I'd gone down the same trylock loop path that Nick used.

I've done some testing so far on ext3, ext4 and XFS (mostly sanity
and lock_stat profile testing), but I have not tested any other
filesystems. IOWs, it is light on testing at this point. I'm sending
out for review now that it passes basic sanity tests so that
comments on the reworked approach can be made.

Version 4:
- re-added inode reference count check in writeback_single_inode()
  when the inode is clean and only attempt to add the inode to the
  LRU if the inodis unreferenced.
- moved hash_bl_[un]lock into hlist_bl.h introductory patch.
- updated documentation and comments still referencing i_count
- updated documentation and comments still referencing inode_lock
- removed a couple of unneeded include files.
- writeback_single_inode() and sync_inode are now the same, so fold
  writeback_single_inode() into sync_inode.
- moved lock ordering comments around into the patches that
  introduce the locks or change the ordering.
- cleaned up dispose_one_inode comments and layout.
- added patch to start of series to move bdev inodes around bdi's
  as they change the bdi in the inode mapping during the final put
  of the bdev. Changes to this new code propagate throw the subsequent
  scalability patches.

Version 3:
- whitespace fix in inode_init_early.
- dropped patch that moves inodes around bdi lists as problem is now
  fixed in mainline.
- added comments explaining lazy inode LRU manipulations.
- added inode_lru_list_{add,del} helpers much earlier to avoid
  needing to export then unexport inode counters.
- renamed i_io to i_wb_list.
- removed iref_locked and just open code internal inode reference
  increments.
- added a WARN_ON() condition to detect iref() being called without
  a pre-existing reference count.
- added kerneldoc comment to iref().
- dropped iref_read() wrapper function patch
- killed the inode_hash_bucket wrapper, use hlist_bl_head directly
- moved spin_[un]lock_bucket wrappers to list_bl.h, and renamed them
  hlist_bl_[un]lock()
- added inode_unhashed() helper function.
- documented use of I_FREEING to ensure removal from inode lru and
  writeback lists is kept sane when the inode is being freed.
- added inode_wb_list_del() helper to avoid exporting the
  inode_to_bdi() function.
- added comments to explain why we need to set the i_state field
  before adding new inodes to various lists
- renamed last_ino_get() to get_next_ino().
- kept invalidate_list/dispose_list pairing for invalidate_inodes(),
  but changed the dispose list to use the i_sb_list pointer in the
  inode instead of the i_lru to avoid needing to take the
  inode_lru_lock for every inode on the superblock list.
- added patch from Christoph Hellwig to spilt up inode_add_to_lists.
  Modified the new function names to match the naming convention
  used by all the other list helpers in inode.c, and added a
  matching inode_sb_list_del() function for symmetry.
- added patch from Christoph Hellwig to move inode number assignment
  in get_new_inode() to the callers that don't directly assign an
  inode number.

Version 2:
- complete rework of series

----

The following changes since commit cb655d0f3d57c23db51b981648e452988c0223f9:

  Linux 2.6.36-rc7 (2010-10-06 13:39:52 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Christoph Hellwig (2):
      fs: split __inode_add_to_list
      fs: do not assign default i_ino in new_inode

Dave Chinner (12):
      fs: switch bdev inode bdi's correctly
      fs: Convert nr_inodes and nr_unused to per-cpu counters
      fs: Clean up inode reference counting
      exofs: use iput() for inode reference count decrements
      fs: rework icount to be a locked variable
      fs: Factor inode hash operations into functions
      fs: Introduce per-bucket inode hash locks
      fs: add a per-superblock lock for the inode list
      fs: split locking of inode writeback and LRU lists
      fs: Protect inode->i_state with the inode->i_lock
      fs: icache remove inode_lock
      fs: Reduce inode I_FREEING and factor inode disposal

Eric Dumazet (1):
      fs: introduce a per-cpu last_ino allocator

Nick Piggin (4):
      kernel: add bl_list
      fs: Implement lazy LRU updates for inodes.
      fs: inode split IO and LRU lists
      fs: Make iunique independent of inode_lock

 Documentation/filesystems/Locking        |    2 +-
 Documentation/filesystems/porting        |    8 +-
 Documentation/filesystems/vfs.txt        |   16 +-
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/infiniband/hw/ipath/ipath_fs.c   |    1 +
 drivers/infiniband/hw/qib/qib_fs.c       |    1 +
 drivers/misc/ibmasm/ibmasmfs.c           |    1 +
 drivers/oprofile/oprofilefs.c            |    1 +
 drivers/usb/core/inode.c                 |    1 +
 drivers/usb/gadget/f_fs.c                |    1 +
 drivers/usb/gadget/inode.c               |    1 +
 fs/9p/vfs_inode.c                        |    5 +-
 fs/affs/inode.c                          |    2 +-
 fs/afs/dir.c                             |    2 +-
 fs/anon_inodes.c                         |    8 +-
 fs/autofs4/inode.c                       |    1 +
 fs/bfs/dir.c                             |    2 +-
 fs/binfmt_misc.c                         |    1 +
 fs/block_dev.c                           |   42 ++-
 fs/btrfs/inode.c                         |   18 +-
 fs/buffer.c                              |    2 +-
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/coda/dir.c                            |    2 +-
 fs/configfs/inode.c                      |    1 +
 fs/debugfs/inode.c                       |    1 +
 fs/drop_caches.c                         |   19 +-
 fs/exofs/inode.c                         |    6 +-
 fs/exofs/namei.c                         |    2 +-
 fs/ext2/namei.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext3/namei.c                          |    2 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/ext4/mballoc.c                        |    1 +
 fs/ext4/namei.c                          |    2 +-
 fs/freevxfs/vxfs_inode.c                 |    1 +
 fs/fs-writeback.c                        |  234 +++++----
 fs/fuse/control.c                        |    1 +
 fs/gfs2/ops_inode.c                      |    2 +-
 fs/hfs/hfs_fs.h                          |    2 +-
 fs/hfs/inode.c                           |    2 +-
 fs/hfsplus/dir.c                         |    2 +-
 fs/hfsplus/hfsplus_fs.h                  |    2 +-
 fs/hfsplus/inode.c                       |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/hugetlbfs/inode.c                     |    1 +
 fs/inode.c                               |  781 +++++++++++++++++++-----------
 fs/internal.h                            |   11 +
 fs/jffs2/dir.c                           |    4 +-
 fs/jfs/jfs_txnmgr.c                      |    2 +-
 fs/jfs/namei.c                           |    2 +-
 fs/libfs.c                               |    2 +-
 fs/locks.c                               |    2 +-
 fs/logfs/dir.c                           |    2 +-
 fs/logfs/inode.c                         |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/minix/namei.c                         |    2 +-
 fs/namei.c                               |    2 +-
 fs/nfs/dir.c                             |    2 +-
 fs/nfs/getroot.c                         |    2 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nfs/write.c                           |    2 +-
 fs/nilfs2/gcdat.c                        |    1 +
 fs/nilfs2/gcinode.c                      |   22 +-
 fs/nilfs2/mdt.c                          |    5 +-
 fs/nilfs2/namei.c                        |    2 +-
 fs/nilfs2/segment.c                      |    2 +-
 fs/nilfs2/the_nilfs.h                    |    2 +-
 fs/notify/inode_mark.c                   |   46 +-
 fs/notify/mark.c                         |    1 -
 fs/notify/vfsmount_mark.c                |    1 -
 fs/ntfs/inode.c                          |   10 +-
 fs/ntfs/super.c                          |    6 +-
 fs/ocfs2/dlmfs/dlmfs.c                   |    2 +
 fs/ocfs2/inode.c                         |    2 +-
 fs/ocfs2/namei.c                         |    2 +-
 fs/pipe.c                                |    2 +
 fs/proc/base.c                           |    2 +
 fs/proc/proc_sysctl.c                    |    2 +
 fs/quota/dquot.c                         |   32 +-
 fs/ramfs/inode.c                         |    1 +
 fs/reiserfs/namei.c                      |    2 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/reiserfs/xattr.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/super.c                               |    1 +
 fs/sysv/namei.c                          |    2 +-
 fs/ubifs/dir.c                           |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/inode.c                           |    2 +-
 fs/udf/namei.c                           |    2 +-
 fs/ufs/namei.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c               |    1 +
 fs/xfs/linux-2.6/xfs_iops.c              |    6 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    3 +-
 include/linux/backing-dev.h              |    3 +
 include/linux/fs.h                       |   43 ++-
 include/linux/list_bl.h                  |  146 ++++++
 include/linux/poison.h                   |    2 +
 include/linux/writeback.h                |    4 -
 ipc/mqueue.c                             |    3 +-
 kernel/cgroup.c                          |    1 +
 kernel/futex.c                           |    2 +-
 kernel/sysctl.c                          |    4 +-
 mm/backing-dev.c                         |   29 +-
 mm/filemap.c                             |    6 +-
 mm/rmap.c                                |    6 +-
 mm/shmem.c                               |    7 +-
 net/socket.c                             |    3 +-
 net/sunrpc/rpc_pipe.c                    |    1 +
 security/inode.c                         |    1 +
 security/selinux/selinuxfs.c             |    1 +
 114 files changed, 1108 insertions(+), 575 deletions(-)
 create mode 100644 include/linux/list_bl.h

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/19] fs: switch bdev inode bdi's correctly
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
@ 2010-10-16  8:13 ` Dave Chinner
  2010-10-16  9:30   ` Nick Piggin
  2010-10-16 16:31   ` Christoph Hellwig
  2010-10-16  8:13 ` [PATCH 02/19] kernel: add bl_list Dave Chinner
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

bdev inodes can remain dirty even after their last close. Hence the
BDI associated with the bdev->inode gets modified duringthe last
close to point to the default BDI. However, the bdev inode still
needs to be moved to the dirty lists of the new BDI, otherwise it
will corrupt the writeback list is was left on.

Add a new function bdev_inode_switch_bdi() to move all the bdi state
from the old bdi to the new one safely. This is only a temporary
measure until the bdev inode<->bdi lifecycle problems are sorted
out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/block_dev.c |   26 +++++++++++++++++++++-----
 1 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 50e8c85..501eab5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -48,6 +48,21 @@ inline struct block_device *I_BDEV(struct inode *inode)
 
 EXPORT_SYMBOL(I_BDEV);
 
+/*
+ * move the inode from it's current bdi to the a new bdi. if the inode is dirty
+ * we need to move it onto the dirty list of @dst so that the inode is always
+ * on the right list.
+ */
+static void bdev_inode_switch_bdi(struct inode *inode,
+			struct backing_dev_info *dst)
+{
+	spin_lock(&inode_lock);
+	inode->i_data.backing_dev_info = dst;
+	if (inode->i_state & I_DIRTY)
+		list_move(&inode->i_list, &dst->wb.b_dirty);
+	spin_unlock(&inode_lock);
+}
+
 static sector_t max_block(struct block_device *bdev)
 {
 	sector_t retval = ~((sector_t)0);
@@ -1390,7 +1405,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				bdi = blk_get_backing_dev_info(bdev);
 				if (bdi == NULL)
 					bdi = &default_backing_dev_info;
-				bdev->bd_inode->i_data.backing_dev_info = bdi;
+				bdev_inode_switch_bdi(bdev->bd_inode, bdi);
 			}
 			if (bdev->bd_invalidated)
 				rescan_partitions(disk, bdev);
@@ -1405,8 +1420,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			if (ret)
 				goto out_clear;
 			bdev->bd_contains = whole;
-			bdev->bd_inode->i_data.backing_dev_info =
-			   whole->bd_inode->i_data.backing_dev_info;
+			bdev_inode_switch_bdi(bdev->bd_inode,
+				whole->bd_inode->i_data.backing_dev_info);
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
 			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
@@ -1439,7 +1454,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+	bdev_inode_switch_bdi(bdev->bd_inode, &default_backing_dev_info);
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
@@ -1533,7 +1548,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
-		bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+		bdev_inode_switch_bdi(bdev->bd_inode,
+					&default_backing_dev_info);
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/19] kernel: add bl_list
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
  2010-10-16  8:13 ` [PATCH 01/19] fs: switch bdev inode bdi's correctly Dave Chinner
@ 2010-10-16  8:13 ` Dave Chinner
  2010-10-16  9:51   ` Nick Piggin
  2010-10-16  8:13 ` [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Introduce a type of hlist that can support the use of the lowest bit
in the hlist_head. This will be subsequently used to implement
per-bucket bit spinlock for inode hashes.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_bl.h |  145 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/poison.h  |    2 +
 2 files changed, 147 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/list_bl.h

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
new file mode 100644
index 0000000..0d791ff
--- /dev/null
+++ b/include/linux/list_bl.h
@@ -0,0 +1,145 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ *
+ * For modification operations, the 0 bit of hlist_bl_head->first
+ * pointer must be set.
+ */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define LIST_BL_LOCKMASK	1UL
+#else
+#define LIST_BL_LOCKMASK	0UL
+#endif
+
+#ifdef CONFIG_DEBUG_LIST
+#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#else
+#define LIST_BL_BUG_ON(x)
+#endif
+
+
+struct hlist_bl_head {
+	struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+	struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+	((ptr)->first = NULL)
+
+static inline void init_hlist_bl_node(struct hlist_bl_node *h)
+{
+	h->next = NULL;
+	h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr, type, member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+	return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+	return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+	hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+	struct hlist_bl_node *next = n->next;
+	struct hlist_bl_node **pprev = n->pprev;
+
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	*pprev = (struct hlist_bl_node *)
+			((unsigned long)next |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->next = BL_LIST_POISON1;
+	n->pprev = BL_LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		init_hlist_bl_node(n);
+	}
+}
+
+/**
+ * hlist_bl_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first(head);				\
+	     pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+	     pos = pos->next)
+
+#endif
+
+/**
+ * hlist_bl_lock	- lock a hash list
+ * @h:	hash list head to lock
+ */
+static inline void hlist_bl_lock(struct hlist_bl_head *h)
+{
+	bit_spin_lock(0, (unsigned long *)h);
+}
+
+/**
+ * hlist_bl_unlock	- unlock a hash list
+ * @h:	hash list head to unlock
+ */
+static inline void hlist_bl_unlock(struct hlist_bl_head *h)
+{
+	__bit_spin_unlock(0, (unsigned long *)h);
+}
diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..d367d39 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -22,6 +22,8 @@
 #define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
 #define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
 
+#define BL_LIST_POISON1  ((void *) 0x00300300 + POISON_POINTER_DELTA)
+#define BL_LIST_POISON2  ((void *) 0x00400400 + POISON_POINTER_DELTA)
 /********** include/linux/timer.h **********/
 /*
  * Magic number "tsta" to indicate a static timer initializer
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
  2010-10-16  8:13 ` [PATCH 01/19] fs: switch bdev inode bdi's correctly Dave Chinner
  2010-10-16  8:13 ` [PATCH 02/19] kernel: add bl_list Dave Chinner
@ 2010-10-16  8:13 ` Dave Chinner
  2010-10-16  8:29   ` Eric Dumazet
  2010-10-16  8:13 ` [PATCH 04/19] fs: Implement lazy LRU updates for inodes Dave Chinner
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/fs-writeback.c  |    5 +--
 fs/inode.c         |   64 ++++++++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |    4 ++-
 kernel/sysctl.c    |    4 +-
 4 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ab38fef..58a95b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -723,7 +723,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			get_nr_dirty_inodes();
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1090,8 +1090,7 @@ void writeback_inodes_sb(struct super_block *sb)
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	work.nr_pages = nr_dirty + nr_unstable + get_nr_dirty_inodes();
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..b3b6a4b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -103,8 +103,41 @@ static DECLARE_RWSEM(iprune_sem);
  */
 struct inodes_stat_t inodes_stat;
 
+static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+static struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
+
 static struct kmem_cache *inode_cachep __read_mostly;
 
+static inline int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+static inline int get_nr_inodes_unused(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes_unused);
+}
+
+int get_nr_dirty_inodes(void)
+{
+	int nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
+	return nr_dirty > 0 ? nr_dirty : 0;
+
+}
+
+/*
+ * Handle nr_inode sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	inodes_stat.nr_unused = get_nr_inodes_unused();
+	return proc_dointvec(table, write, buffer, lenp, ppos);
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -192,6 +225,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 
+	percpu_counter_inc(&nr_inodes);
+
 	return 0;
 out:
 	return -ENOMEM;
@@ -232,6 +267,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	percpu_counter_dec(&nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
@@ -286,7 +322,7 @@ void __iget(struct inode *inode)
 
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
-	inodes_stat.nr_unused--;
+	percpu_counter_dec(&nr_inodes_unused);
 }
 
 void end_writeback(struct inode *inode)
@@ -327,8 +363,6 @@ static void evict(struct inode *inode)
  */
 static void dispose_list(struct list_head *head)
 {
-	int nr_disposed = 0;
-
 	while (!list_empty(head)) {
 		struct inode *inode;
 
@@ -344,11 +378,7 @@ static void dispose_list(struct list_head *head)
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
-		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -357,7 +387,7 @@ static void dispose_list(struct list_head *head)
 static int invalidate_list(struct list_head *head, struct list_head *dispose)
 {
 	struct list_head *next;
-	int busy = 0, count = 0;
+	int busy = 0;
 
 	next = head->next;
 	for (;;) {
@@ -383,13 +413,11 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-			count++;
+			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		busy = 1;
 	}
-	/* only unused inodes may be cached with i_count zero */
-	inodes_stat.nr_unused -= count;
 	return busy;
 }
 
@@ -448,7 +476,6 @@ static int can_unuse(struct inode *inode)
 static void prune_icache(int nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_pruned = 0;
 	int nr_scanned;
 	unsigned long reap = 0;
 
@@ -484,9 +511,8 @@ static void prune_icache(int nr_to_scan)
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
-		nr_pruned++;
+		percpu_counter_dec(&nr_inodes_unused);
 	}
-	inodes_stat.nr_unused -= nr_pruned;
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -518,7 +544,7 @@ static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 			return -1;
 		prune_icache(nr);
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+	return (get_nr_inodes_unused() / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
@@ -595,7 +621,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
 	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
@@ -1215,7 +1240,7 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
+		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode_lock);
 			return;
@@ -1227,14 +1252,13 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
+		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
@@ -1503,6 +1527,8 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	percpu_counter_init(&nr_inodes, 0);
+	percpu_counter_init(&nr_inodes_unused, 0);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d069b..1fb92f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -407,6 +407,7 @@ extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
 extern int sysctl_nr_open;
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_dirty_inodes(void);
 extern int leases_enable, lease_break_time;
 
 struct buffer_head;
@@ -2474,7 +2475,8 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf,
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
-
+int proc_nr_inodes(struct ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..33d1733 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1340,14 +1340,14 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "inode-state",
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-10-16  8:13 ` [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
@ 2010-10-16  8:13 ` Dave Chinner
  2010-10-16  9:29   ` Nick Piggin
  2010-10-16  8:13 ` [PATCH 05/19] fs: inode split IO and LRU lists Dave Chinner
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/fs-writeback.c         |   14 +++---
 fs/inode.c                |   92 +++++++++++++++++++++++++++++++--------------
 fs/internal.h             |    6 +++
 include/linux/fs.h        |   13 +++---
 include/linux/writeback.h |    1 -
 5 files changed, 84 insertions(+), 42 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 58a95b7..33e9857 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,16 +408,16 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
 			/*
-			 * The inode is clean, unused
+			 * The inode is clean. If it is unused, then make sure
+			 * that it is put on the LRU correctly as iput_final()
+			 * does not move dirty inodes to the LRU and dirty
+			 * inodes are removed from the LRU during scanning.
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			list_del_init(&inode->i_list);
+			if (!atomic_read(&inode->i_count))
+				inode_lru_list_add(inode);
 		}
 	}
 	inode_sync_complete(inode);
diff --git a/fs/inode.c b/fs/inode.c
index b3b6a4b..5dd74b4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -72,7 +72,6 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 
@@ -291,6 +290,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
+	INIT_LIST_HEAD(&inode->i_list);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -317,12 +317,21 @@ static void init_once(void *foo)
  */
 void __iget(struct inode *inode)
 {
-	if (atomic_inc_return(&inode->i_count) != 1)
-		return;
+	atomic_inc(&inode->i_count);
+}
 
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-		list_move(&inode->i_list, &inode_in_use);
-	percpu_counter_dec(&nr_inodes_unused);
+void inode_lru_list_add(struct inode *inode)
+{
+	list_add(&inode->i_list, &inode_unused);
+	percpu_counter_inc(&nr_inodes_unused);
+}
+
+void inode_lru_list_del(struct inode *inode)
+{
+	if (!list_empty(&inode->i_list)) {
+		list_del_init(&inode->i_list);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
 }
 
 void end_writeback(struct inode *inode)
@@ -367,7 +376,7 @@ static void dispose_list(struct list_head *head)
 		struct inode *inode;
 
 		inode = list_first_entry(head, struct inode, i_list);
-		list_del(&inode->i_list);
+		list_del_init(&inode->i_list);
 
 		evict(inode);
 
@@ -410,9 +419,9 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			continue;
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
-			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			list_move(&inode->i_list, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
@@ -461,17 +470,20 @@ static int can_unuse(struct inode *inode)
 }
 
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
+ * temporary list and then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
- * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
- * inode is still freeable, proceed.  The right inode is found 99.9% of the
- * time in testing on a 4-way.
+ * pagecache removed.  If the inode has metadata buffers attached to
+ * mapping->private_list then try to remove them.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has the I_REFERENCED flag set, it means that it has been used
+ * recently - the flag is set in iput_final(). When we encounter such an inode,
+ * clear the flag and move it to the back of the LRU so it gets another pass
+ * through the LRU before it gets reclaimed. This is necessary because of the
+ * fact we are doing lazy LRU updates to minimise lock contention so the LRU
+ * does not have strict ordering. Hence we don't want to reclaim inodes with
+ * this flag set because they are the inodes that are out of order...
  */
 static void prune_icache(int nr_to_scan)
 {
@@ -489,8 +501,21 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		/*
+		 * Referenced or dirty inodes are still in use. Give them
+		 * another pass through the LRU as we canot reclaim them now.
+		 */
+		if (atomic_read(&inode->i_count) ||
+		    (inode->i_state & ~I_REFERENCED)) {
+			list_del_init(&inode->i_list);
+			percpu_counter_dec(&nr_inodes_unused);
+			continue;
+		}
+
+		/* recently referenced inodes get one more pass */
+		if (inode->i_state & I_REFERENCED) {
 			list_move(&inode->i_list, &inode_unused);
+			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -502,11 +527,15 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 
-			if (inode != list_entry(inode_unused.next,
-						struct inode, i_list))
-				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
+			/*
+			 * if we can't reclaim this inode immediately, give it
+			 * another pass through the free list so we don't spin
+			 * on it.
+			 */
+			if (!can_unuse(inode)) {
+				list_move(&inode->i_list, &inode_unused);
 				continue;
+			}
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
@@ -621,7 +650,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
@@ -1238,10 +1266,12 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-			list_move(&inode->i_list, &inode_unused);
-		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
+			inode->i_state |= I_REFERENCED;
+			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+				list_del_init(&inode->i_list);
+				inode_lru_list_add(inode);
+			}
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1252,13 +1282,19 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
-	list_del_init(&inode->i_list);
-	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+
+	/*
+	 * After we delete the inode from the LRU here, we avoid moving dirty
+	 * inodes back onto the LRU now because I_FREEING is set and hence
+	 * writeback_single_inode() won't move the inode around.
+	 */
+	inode_lru_list_del(inode);
+
+	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
diff --git a/fs/internal.h b/fs/internal.h
index a6910e9..ece3565 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -101,3 +101,9 @@ extern void put_super(struct super_block *sb);
 struct nameidata;
 extern struct file *nameidata_to_filp(struct nameidata *);
 extern void release_open_intent(struct nameidata *);
+
+/*
+ * inode.c
+ */
+extern void inode_lru_list_add(struct inode *inode);
+extern void inode_lru_list_del(struct inode *inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fb92f9..af1d516 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1632,16 +1632,17 @@ struct super_operations {
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
+#define I_DIRTY_SYNC		0x01
+#define I_DIRTY_DATASYNC	0x02
+#define I_DIRTY_PAGES		0x04
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_WILL_FREE		0x10
+#define I_FREEING		0x20
+#define I_CLEAR			0x40
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
+#define I_REFERENCED		0x100
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..f956b66 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/19] fs: inode split IO and LRU lists
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (3 preceding siblings ...)
  2010-10-16  8:13 ` [PATCH 04/19] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-10-16  8:13 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 06/19] fs: Clean up inode reference counting Dave Chinner
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c            |    4 ++--
 fs/fs-writeback.c         |   27 ++++++++++++++-------------
 fs/inode.c                |   40 ++++++++++++++++++++++------------------
 fs/nilfs2/mdt.c           |    3 ++-
 include/linux/fs.h        |    3 ++-
 include/linux/writeback.h |    1 -
 mm/backing-dev.c          |    6 +++---
 7 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 501eab5..63b1c4c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,8 +58,8 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 {
 	spin_lock(&inode_lock);
 	inode->i_data.backing_dev_info = dst;
-	if (inode->i_state & I_DIRTY)
-		list_move(&inode->i_list, &dst->wb.b_dirty);
+	if (!list_empty(&inode->i_wb_list))
+		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode_lock);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 33e9857..92d73b6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,11 +172,11 @@ static void redirty_tail(struct inode *inode)
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_wb_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &wb->b_dirty);
+	list_move(&inode->i_wb_list, &wb->b_dirty);
 }
 
 /*
@@ -186,7 +186,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
-	list_move(&inode->i_list, &wb->b_more_io);
+	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -227,14 +227,15 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	int do_sb_sort = 0;
 
 	while (!list_empty(delaying_queue)) {
-		inode = list_entry(delaying_queue->prev, struct inode, i_list);
+		inode = list_entry(delaying_queue->prev,
+						struct inode, i_wb_list);
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_list, &tmp);
+		list_move(&inode->i_wb_list, &tmp);
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
@@ -245,12 +246,12 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		inode = list_entry(tmp.prev, struct inode, i_list);
+		inode = list_entry(tmp.prev, struct inode, i_wb_list);
 		sb = inode->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = list_entry(pos, struct inode, i_list);
+			inode = list_entry(pos, struct inode, i_wb_list);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_list, dispatch_queue);
+				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
 }
@@ -415,7 +416,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * does not move dirty inodes to the LRU and dirty
 			 * inodes are removed from the LRU during scanning.
 			 */
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_wb_list);
 			if (!atomic_read(&inode->i_count))
 				inode_lru_list_add(inode);
 		}
@@ -466,7 +467,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_wb_list);
 
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
@@ -537,7 +538,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_wb_list);
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
@@ -676,7 +677,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_list);
+						struct inode, i_wb_list);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -990,7 +991,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index 5dd74b4..fd65368 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -72,7 +72,7 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_lru);
 static struct hlist_head *inode_hashtable __read_mostly;
 
 /*
@@ -272,6 +272,7 @@ EXPORT_SYMBOL(__destroy_inode);
 
 void destroy_inode(struct inode *inode)
 {
+	BUG_ON(!list_empty(&inode->i_lru));
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
@@ -290,7 +291,8 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -322,14 +324,16 @@ void __iget(struct inode *inode)
 
 void inode_lru_list_add(struct inode *inode)
 {
-	list_add(&inode->i_list, &inode_unused);
-	percpu_counter_inc(&nr_inodes_unused);
+	if (list_empty(&inode->i_lru)) {
+		list_add(&inode->i_lru, &inode_lru);
+		percpu_counter_inc(&nr_inodes_unused);
+	}
 }
 
 void inode_lru_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_list)) {
-		list_del_init(&inode->i_list);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 }
@@ -375,8 +379,8 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_list);
-		list_del_init(&inode->i_list);
+		inode = list_first_entry(head, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
 
 		evict(inode);
 
@@ -421,7 +425,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (!atomic_read(&inode->i_count)) {
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-			list_move(&inode->i_list, dispose);
+			list_move(&inode->i_lru, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
@@ -496,10 +500,10 @@ static void prune_icache(int nr_to_scan)
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(inode_lru.prev, struct inode, i_lru);
 
 		/*
 		 * Referenced or dirty inodes are still in use. Give them
@@ -507,14 +511,14 @@ static void prune_icache(int nr_to_scan)
 		 */
 		if (atomic_read(&inode->i_count) ||
 		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 
 		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
@@ -533,11 +537,12 @@ static void prune_icache(int nr_to_scan)
 			 * on it.
 			 */
 			if (!can_unuse(inode)) {
-				list_move(&inode->i_list, &inode_unused);
+				list_move(&inode->i_lru, &inode_lru);
 				continue;
 			}
 		}
-		list_move(&inode->i_list, &freeable);
+		list_move(&inode->i_lru, &freeable);
+		list_del_init(&inode->i_wb_list);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		percpu_counter_dec(&nr_inodes_unused);
@@ -1268,10 +1273,8 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
-			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-				list_del_init(&inode->i_list);
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 				inode_lru_list_add(inode);
-			}
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1284,6 +1287,7 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
 	}
+	list_del_init(&inode->i_wb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index d01aff4..62756b4 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -504,7 +504,8 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 #endif
 		inode->dirtied_when = 0;
 
-		INIT_LIST_HEAD(&inode->i_list);
+		INIT_LIST_HEAD(&inode->i_wb_list);
+		INIT_LIST_HEAD(&inode->i_lru);
 		INIT_LIST_HEAD(&inode->i_sb_list);
 		inode->i_state = 0;
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index af1d516..90d2b47 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -725,7 +725,8 @@ struct posix_acl;
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;		/* backing dev IO list */
+	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index f956b66..242b6f8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_unused;
 
 /*
  * fs/fs-writeback.c
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 65d4204..15d5097 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_list)
+	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/19] fs: Clean up inode reference counting
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (4 preceding siblings ...)
  2010-10-16  8:13 ` [PATCH 05/19] fs: inode split IO and LRU lists Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 07/19] exofs: use iput() for inode reference count decrements Dave Chinner
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Lots of filesystem code open codes the act of getting a reference to
an inode.  Factor the open coded inode lock, increment, unlock into
a function iref(). This removes most direct external references to
the inode reference count.

Originally based on a patch from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/9p/vfs_inode.c           |    5 +++--
 fs/affs/inode.c             |    2 +-
 fs/afs/dir.c                |    2 +-
 fs/anon_inodes.c            |    7 +------
 fs/bfs/dir.c                |    2 +-
 fs/block_dev.c              |   13 ++++++-------
 fs/btrfs/inode.c            |    2 +-
 fs/coda/dir.c               |    2 +-
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    2 +-
 fs/exofs/namei.c            |    2 +-
 fs/ext2/namei.c             |    2 +-
 fs/ext3/namei.c             |    2 +-
 fs/ext4/namei.c             |    2 +-
 fs/fs-writeback.c           |    7 +++----
 fs/gfs2/ops_inode.c         |    2 +-
 fs/hfsplus/dir.c            |    2 +-
 fs/inode.c                  |   34 ++++++++++++++++++++++------------
 fs/jffs2/dir.c              |    4 ++--
 fs/jfs/jfs_txnmgr.c         |    2 +-
 fs/jfs/namei.c              |    2 +-
 fs/libfs.c                  |    2 +-
 fs/logfs/dir.c              |    2 +-
 fs/minix/namei.c            |    2 +-
 fs/namei.c                  |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/nfs/getroot.c            |    2 +-
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    2 +-
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 ++--
 fs/ocfs2/namei.c            |    2 +-
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    2 +-
 fs/sysv/namei.c             |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/udf/namei.c              |    2 +-
 fs/ufs/namei.c              |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c |    2 +-
 fs/xfs/xfs_inode.h          |    2 +-
 include/linux/fs.h          |    2 +-
 ipc/mqueue.c                |    2 +-
 kernel/futex.c              |    2 +-
 mm/shmem.c                  |    2 +-
 net/socket.c                |    2 +-
 45 files changed, 80 insertions(+), 76 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 9e670d5..1f76624 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1789,9 +1789,10 @@ v9fs_vfs_link_dotl(struct dentry *old_dentry, struct inode *dir,
 		kfree(st);
 	} else {
 		/* Caching disabled. No need to get upto date stat info.
-		 * This dentry will be released immediately. So, just i_count++
+		 * This dentry will be released immediately. So, just take
+		 * a reference.
 		 */
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 
 	dentry->d_op = old_dentry->d_op;
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 3a0fdec..2100852 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -388,7 +388,7 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 0d38c09..87d8c03 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -1045,7 +1045,7 @@ static int afs_link(struct dentry *from, struct inode *dir,
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	iref(&vnode->vfs_inode);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index e4b75d6..451be78 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -109,12 +109,7 @@ struct file *anon_inode_getfile(const char *name,
 		goto err_module;
 
 	path.mnt = mntget(anon_inode_mnt);
-	/*
-	 * We know the anon_inode inode count is always greater than zero,
-	 * so we can avoid doing an igrab() and we can use an open-coded
-	 * atomic_inc().
-	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	iref(anon_inode_inode);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index d967e05..6e93a37 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -176,7 +176,7 @@ static int bfs_link(struct dentry *old, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 63b1c4c..11ad53d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -565,7 +565,7 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	atomic_inc(&bdev->bd_inode->i_count);
+	iref(bdev->bd_inode);
 	return bdev;
 }
 
@@ -595,7 +595,7 @@ static struct block_device *bd_acquire(struct inode *inode)
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		bdgrab(bdev);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -606,12 +606,11 @@ static struct block_device *bd_acquire(struct inode *inode)
 		spin_lock(&bdev_lock);
 		if (!inode->i_bdev) {
 			/*
-			 * We take an additional bd_inode->i_count for inode,
-			 * and it's released in clear_inode() of inode.
-			 * So, we can access it via ->i_mapping always
-			 * without igrab().
+			 * We take an additional bdev reference here so
+			 * we can access it via ->i_mapping always
+			 * without first needing to grab a reference.
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			bdgrab(bdev);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..80e28bf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4758,7 +4758,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ccd98b0..ac8b913 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -303,7 +303,7 @@ static int coda_link(struct dentry *source_de, struct inode *dir_inode,
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 2195c21..c2721fa 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -22,7 +22,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index eb7368e..b631ff3 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1154,7 +1154,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	ios->done = create_done;
 	ios->private = inode;
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index b7dd0c2..f2a30a0 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -153,7 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return exofs_add_nondir(dentry, inode);
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..b15435f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -206,7 +206,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 2b35ddb..6c7a5d6 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2260,7 +2260,7 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 314c0d3..a406a85 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2312,7 +2312,7 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 92d73b6..595dfc6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -298,8 +298,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 
 /*
  * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ * caller has a reference on the inode or the inode has I_WILL_FREE set.
  *
  * If `wait' is set, wait on the writeout.
  *
@@ -500,7 +499,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1046,7 +1045,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 1009be2..508407d 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -253,7 +253,7 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 764fd1b..e2ce54d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -301,7 +301,7 @@ static int hfsplus_link(struct dentry *src_dentry, struct inode *dst_dir,
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
diff --git a/fs/inode.c b/fs/inode.c
index fd65368..ea4cb31 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -314,13 +314,23 @@ static void init_once(void *foo)
 	inode_init_once(inode);
 }
 
-/*
- * inode_lock must be held
+/**
+ * iref - increment the reference count on an inode
+ * @inode:	inode to take a reference on
+ *
+ * iref() should be called to take an extra reference to an inode. The inode
+ * must already have a reference count obtained via igrab() as iref() does not
+ * do checks for the inode being freed and hence cannot be used to initially
+ * obtain a reference to the inode.
  */
-void __iget(struct inode *inode)
+void iref(struct inode *inode)
 {
+	WARN_ON(atomic_read(&inode->i_count) < 1);
+	spin_lock(&inode_lock);
 	atomic_inc(&inode->i_count);
+	spin_unlock(&inode_lock);
 }
+EXPORT_SYMBOL_GPL(iref);
 
 void inode_lru_list_add(struct inode *inode)
 {
@@ -523,7 +533,7 @@ static void prune_icache(int nr_to_scan)
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
+			atomic_inc(&inode->i_count);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -589,7 +599,7 @@ static struct shrinker icache_shrinker = {
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call __iget()
+ * NOTE: we are not increasing the inode-refcount, you must take a reference
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
@@ -793,7 +803,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -840,7 +850,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -893,7 +903,7 @@ struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 	else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
@@ -934,7 +944,7 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -967,7 +977,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1150,7 +1160,7 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1189,7 +1199,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index ed78a3c..797a034 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -289,7 +289,7 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 	return ret;
 }
@@ -864,7 +864,7 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index d945ea7..3e6dd08 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,7 @@ int txCommit(tid_t tid,		/* transaction identifier */
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		iref(tblk->u.ip);
 		/*
 		 * Avoid a rare deadlock
 		 *
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index a9cf8e8..3d3566e 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -839,7 +839,7 @@ static int jfs_link(struct dentry *old_dentry,
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	iref(ip);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0a9da95..f190d73 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -255,7 +255,7 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 9777eb5..8522edc 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -569,7 +569,7 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index f3f3578..7563a82 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -101,7 +101,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	return add_nondir(dentry, inode);
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 24896e8..5fb93f3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2291,7 +2291,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 			goto slashes;
 		inode = dentry->d_inode;
 		if (inode)
-			atomic_inc(&inode->i_count);
+			iref(inode);
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e257172..5482ede 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1580,7 +1580,7 @@ nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_add(dentry, inode);
 	}
 	return error;
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index a70e446..5aaa2be 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -55,7 +55,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 874972d..d1c2f08 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		igrab(inode);
+		iref(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index ad6ed2c..fbd3348 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -219,7 +219,7 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 33297c0..fa7f3b8 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -244,7 +244,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		struct inode *need_iput_tmp;
 
 		/*
-		 * We cannot __iget() an inode in state I_FREEING,
+		 * We cannot iref() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
@@ -253,7 +253,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * doing an iref/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
@@ -265,7 +265,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			__iget(inode);
+			atomic_inc(&inode->i_count);
 		else
 			need_iput_tmp = NULL;
 
@@ -273,7 +273,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		if ((&next_i->i_sb_list != list) &&
 		    atomic_read(&next_i->i_count) &&
 		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			__iget(next_i);
+			atomic_inc(&next_i->i_count);
 			need_iput = next_i;
 		}
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 5128061..52b48e3 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2929,8 +2929,8 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 		goto unl_upcase_iput_tmp_ino_err_out_now;
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
-		/* We increment i_count simulating an ntfs_iget(). */
-		atomic_inc(&vol->root_ino->i_count);
+		/* Simulate an ntfs_iget() call */
+		iref(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index a00dda2..0e002f6 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -741,7 +741,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index aad1316..38d4304 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		if (!dqinit_needed(inode, type))
 			continue;
 
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index ee78d4a..f19bb3d 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1156,7 +1156,7 @@ static int reiserfs_link(struct dentry *old_dentry, struct inode *dir,
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 33e047b..765974f 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -126,7 +126,7 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return add_nondir(dentry, inode);
 }
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 87ebcce..9e8281f 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -550,7 +550,7 @@ static int ubifs_link(struct dentry *old_dentry, struct inode *dir,
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..f6e232a 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1101,7 +1101,7 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index b056f02..2a598eb 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -180,7 +180,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b1fc2a6..b7ec465 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -352,7 +352,7 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0898c54..cbb4791 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -482,7 +482,7 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 #define IHOLD(ip) \
 do { \
 	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 90d2b47..6eb94b0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2184,7 +2184,7 @@ extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struc
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
-extern void __iget(struct inode * inode);
+extern void iref(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c60e519..d53a2c1 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -769,7 +769,7 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 
 	inode = dentry->d_inode;
 	if (inode)
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
diff --git a/kernel/futex.c b/kernel/futex.c
index 6a3a5fa..3bb418c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -168,7 +168,7 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		iref(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..7d0bc16 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,7 +1903,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	iref(inode);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
diff --git a/net/socket.c b/net/socket.c
index 2270b94..715ca57 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,7 +377,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		atomic_inc(&path.dentry->d_inode->i_count);
+		iref(path.dentry->d_inode);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/19] exofs: use iput() for inode reference count decrements
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (5 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 06/19] fs: Clean up inode reference counting Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 08/19] fs: rework icount to be a locked variable Dave Chinner
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Direct modification of the inode reference count is a no-no. Convert
the exofs decrements to call iput() instead of acting directly on
i_count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/exofs/inode.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b631ff3..0fb4d4c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1101,7 +1101,7 @@ static void create_done(struct exofs_io_state *ios, void *p)
 
 	set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	iput(inode);
 	wake_up(&oi->i_wq);
 }
 
@@ -1161,7 +1161,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	ios->cred = oi->i_cred;
 	ret = exofs_sbi_create(ios);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		iput(inode);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/19] fs: rework icount to be a locked variable
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (6 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 07/19] exofs: use iput() for inode reference count decrements Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 09/19] fs: Factor inode hash operations into functions Dave Chinner
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The inode reference count is currently an atomic variable so that it
can be sampled/modified outside the inode_lock. However, the
inode_lock is still needed to synchronise the final reference count
and checks against the inode state.

To avoid needing the protection of the inode lock, protect the inode
reference count with the per-inode i_lock and convert it to a normal
variable. To avoid existing out-of-tree code accidentally compiling
against the new method, rename the i_count field to i_ref. This is
relatively straight forward as there are limited external references
to the i_count field remaining.

Based on work originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/vfs.txt        |   14 +++---
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 fs/btrfs/inode.c                         |   14 ++++--
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/drop_caches.c                         |    4 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/fs-writeback.c                        |   12 +++--
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   83 ++++++++++++++++++++++--------
 fs/locks.c                               |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nilfs2/mdt.c                          |    2 +-
 fs/notify/inode_mark.c                   |   25 ++++++---
 fs/ntfs/inode.c                          |    6 +-
 fs/ntfs/super.c                          |    2 +-
 fs/quota/dquot.c                         |    4 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/inode.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    1 -
 include/linux/fs.h                       |    4 +-
 27 files changed, 134 insertions(+), 73 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..0dbbbe4 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -347,8 +347,8 @@ otherwise noted.
   lookup: called when the VFS needs to look up an inode in a parent
 	directory. The name to look for is found in the dentry. This
 	method must call d_add() to insert the found inode into the
-	dentry. The "i_count" field in the inode structure should be
-	incremented. If the named inode does not exist a NULL inode
+	dentry. A reference to the inode should be taken via the
+	iref() function.  If the named inode does not exist a NULL inode
 	should be inserted into the dentry (this is called a negative
 	dentry). Returning an error code from this routine must only
 	be done on a real error, otherwise creating inodes with system
@@ -926,11 +926,11 @@ manipulate dentries:
 	d_instantiate()
 
   d_instantiate: add a dentry to the alias hash list for the inode and
-	updates the "d_inode" member. The "i_count" member in the
-	inode structure should be set/incremented. If the inode
-	pointer is NULL, the dentry is called a "negative
-	dentry". This function is commonly called when an inode is
-	created for an existing negative dentry
+	updates the "d_inode" member. A reference to the inode
+	should be taken via the iref() function.  If the inode
+	pointer is NULL, the dentry is called a "negative dentry".
+	This function is commonly called when an inode is created
+	for an existing negative dentry
 
   d_lookup: look up a dentry given its parent and path name component
 	It looks up the child of that given name from the dcache
diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 1a40da9..03d8ed3 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1549,7 +1549,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (inode->i_ref != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80e28bf..7947bf0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1964,8 +1964,14 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct delayed_iput *delayed;
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	/* XXX: filesystems should not play refcount games like this */
+	spin_lock(&inode->i_lock);
+	if (inode->i_ref > 1) {
+		inode->i_ref--;
+		spin_unlock(&inode->i_lock);
 		return;
+	}
+	spin_unlock(&inode->i_lock);
 
 	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 	delayed->inode = inode;
@@ -2718,10 +2724,10 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+	if (S_ISDIR(inode->i_mode) && inode->i_ref > 1)
 		return ERR_PTR(-ENOSPC);
 
-	if (atomic_read(&inode->i_count) > 2)
+	if (inode->i_ref > 2)
 		return ERR_PTR(-ENOSPC);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3939,7 +3945,7 @@ again:
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (inode->i_ref > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index fad95f8..1217580 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1102,7 +1102,7 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, inode->i_ref);
 		return 0;
 	}
 
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 53cce8c..f13f2d0 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1641,7 +1641,7 @@ int cifs_revalidate_dentry(struct dentry *dentry)
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count.counter,
+		 "jiffies %ld", full_path, inode, inode->i_ref,
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c2721fa..10c8c5a 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -22,7 +22,9 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4ab72db..fb20ac7 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_ref > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					inode->i_ref);
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..56d0bb0 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -189,9 +189,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_ref > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       inode->i_ref);
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 595dfc6..9832beb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -315,7 +315,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
+	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -416,7 +416,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * inodes are removed from the LRU during scanning.
 			 */
 			list_del_init(&inode->i_wb_list);
-			if (!atomic_read(&inode->i_count))
+			if (!inode->i_ref)
 				inode_lru_list_add(inode);
 		}
 	}
@@ -499,7 +499,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1045,7 +1047,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 56f0da1..67147bf 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !i->i_ref) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
diff --git a/fs/inode.c b/fs/inode.c
index ea4cb31..8f73ea1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -26,6 +26,15 @@
 #include <linux/posix_acl.h>
 
 /*
+ * Locking rules.
+ *
+ * inode->i_lock is *always* the innermost lock.
+ *
+ * inode->i_lock protects:
+ *   i_ref
+ */
+
+/*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
@@ -64,9 +73,9 @@ static unsigned int i_hash_shift __read_mostly;
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
- *  "in_use" - valid inode, i_count > 0, i_nlink > 0
+ *  "in_use" - valid inode, i_ref > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
- *  "unused" - valid inode, i_count = 0
+ *  "unused" - valid inode, i_ref = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
@@ -164,7 +173,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_ref = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -325,9 +334,11 @@ static void init_once(void *foo)
  */
 void iref(struct inode *inode)
 {
-	WARN_ON(atomic_read(&inode->i_count) < 1);
+	WARN_ON(inode->i_ref < 1);
 	spin_lock(&inode_lock);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_ref++;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
@@ -432,13 +443,16 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (inode->i_state & I_NEW)
 			continue;
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			list_move(&inode->i_lru, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	return busy;
@@ -476,7 +490,7 @@ static int can_unuse(struct inode *inode)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (atomic_read(&inode->i_count))
+	if (inode->i_ref)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -519,8 +533,9 @@ static void prune_icache(int nr_to_scan)
 		 * Referenced or dirty inodes are still in use. Give them
 		 * another pass through the LRU as we canot reclaim them now.
 		 */
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
+			spin_unlock(&inode->i_lock);
 			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
@@ -528,12 +543,14 @@ static void prune_icache(int nr_to_scan)
 
 		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
+			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			atomic_inc(&inode->i_count);
+			inode->i_ref++;
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -550,7 +567,8 @@ static void prune_icache(int nr_to_scan)
 				list_move(&inode->i_lru, &inode_lru);
 				continue;
 			}
-		}
+		} else
+			spin_unlock(&inode->i_lock);
 		list_move(&inode->i_lru, &freeable);
 		list_del_init(&inode->i_wb_list);
 		WARN_ON(inode->i_state & I_NEW);
@@ -803,7 +821,9 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -850,7 +870,9 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -902,15 +924,19 @@ EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		atomic_inc(&inode->i_count);
-	else
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+	} else {
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
 		inode = NULL;
+	}
 	spin_unlock(&inode_lock);
 	return inode;
 }
@@ -944,7 +970,9 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -977,7 +1005,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1160,7 +1190,9 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1199,7 +1231,9 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1333,8 +1367,15 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
+		if (--inode->i_ref == 0) {
+			spin_unlock(&inode->i_lock);
 			iput_final(inode);
+			return;
+		}
+		spin_unlock(&inode->i_lock);
+		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
diff --git a/fs/locks.c b/fs/locks.c
index ab24d49..4dec81a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1376,7 +1376,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 			goto out;
 		if ((arg == F_WRLCK)
 		    && ((atomic_read(&dentry->d_count) > 1)
-			|| (atomic_read(&inode->i_count) > 1)))
+			|| inode->i_ref > 1))
 			goto out;
 	}
 
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 6127baf..1b26a8d 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -1002,7 +1002,7 @@ static int __logfs_is_valid_block(struct inode *inode, u64 bix, u64 ofs)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+	if ((inode->i_nlink == 0) && inode->i_ref == 1)
 		return 0;
 
 	if (bix < I0_BLOCKS)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..32a9c69 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -384,7 +384,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		inode->i_ref);
 
 out:
 	return inode;
@@ -1190,7 +1190,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			inode->i_ref, fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 3e2f19b..d7fc5d0 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -506,8 +506,8 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
 		state->owner = owner;
 		atomic_inc(&owner->so_count);
 		list_add(&state->inode_states, &nfsi->open_states);
-		state->inode = igrab(inode);
 		spin_unlock(&inode->i_lock);
+		state->inode = igrab(inode);
 		/* Note: The reclaim code dictates that we add stateless
 		 * and read-only stateids to the end of the list */
 		list_add_tail(&state->open_states, &owner->so_states);
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 62756b4..939459d 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -480,7 +480,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_ref = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index fa7f3b8..1a4c117 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -252,29 +252,36 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			continue;
 
 		/*
-		 * If i_count is zero, the inode cannot have any watches and
+		 * If i_ref is zero, the inode cannot have any watches and
 		 * doing an iref/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
+		 * evict all inodes with zero i_ref from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			atomic_inc(&inode->i_count);
+			inode->i_ref++;
 		else
 			need_iput_tmp = NULL;
+		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-		    atomic_read(&next_i->i_count) &&
-		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			atomic_inc(&next_i->i_count);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (inode->i_ref &&
+			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
+				next_i->i_ref++;
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 93622b1..07fdef8 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -531,7 +531,7 @@ err_corrupt_attr:
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_ref is set to 1, so it is not going to go away
  *    i_flags is set to 0 and we have no business touching it.  Only an ioctl()
  *    is allowed to write to them. We should of course be honouring them but
  *    we need to do that using the IS_* macros defined in include/linux/fs.h.
@@ -1208,7 +1208,7 @@ err_out:
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_ref is set to 1, so it is not going to go away
  *
  * Return 0 on success and -errno on error.  In the error case, the inode will
  * have had make_bad_inode() executed on it.
@@ -1475,7 +1475,7 @@ err_out:
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_ref is set to 1, so it is not going to go away
  *
  * Return 0 on success and -errno on error.  In the error case, the inode will
  * have had make_bad_inode() executed on it.
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 52b48e3..181eddb 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2689,7 +2689,7 @@ static const struct super_operations ntfs_sops = {
 	//					   held. See fs/inode.c::
 	//					   generic_drop_inode(). */
 	//.delete_inode	= NULL,			/* VFS: Delete inode from disk.
-	//					   Called when i_count becomes
+	//					   Called when i_ref becomes
 	//					   0 and i_nlink is also 0. */
 	//.write_super	= NULL,			/* Flush dirty super block to
 	//					   disk. */
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 38d4304..326df72 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		if (!dqinit_needed(inode, type))
 			continue;
 
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 313d39d..42d3311 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1477,7 +1477,7 @@ static int maybe_indirect_to_direct(struct reiserfs_transaction_handle *th,
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (inode->i_ref > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index 450c919..85ff606 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -320,7 +320,7 @@ out:
 }
 
 /*
- * This routine is called when i_nlink == 0 and i_count goes to 0.
+ * This routine is called when i_nlink == 0 and i_ref goes to 0.
  * All blocking cleanup operations need to go here to avoid races.
  */
 static void
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cd5900b..ead1f89 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -342,7 +342,7 @@ static void ubifs_evict_inode(struct inode *inode)
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!inode->i_ref);
 
 	truncate_inode_pages(&inode->i_data, 0);
 
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index fc48f37..05b0445 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1071,7 +1071,7 @@ static void __udf_read_inode(struct inode *inode)
 	 *      i_flags = sb->s_flags
 	 *      i_state = 0
 	 * clean_inode(): zero fills and sets
-	 *      i_count = 1
+	 *      i_ref = 1
 	 *      i_nlink = 1
 	 *      i_op = NULL;
 	 */
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..0428b06 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -599,7 +599,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = VFS_I(ip)->i_ref;
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index cbb4791..1e41fa8 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -481,7 +481,6 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
 	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6eb94b0..c720d65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -730,7 +730,7 @@ struct inode {
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_ref;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
@@ -1612,7 +1612,7 @@ struct super_operations {
  *			also cause waiting on I_NEW, without I_NEW actually
  *			being set.  find_inode() uses this to prevent returning
  *			nearly-dead inodes.
- * I_WILL_FREE		Must be set when calling write_inode_now() if i_count
+ * I_WILL_FREE		Must be set when calling write_inode_now() if i_ref
  *			is zero.  I_FREEING must be set when I_WILL_FREE is
  *			cleared.
  * I_FREEING		Set when inode is about to be freed but still has dirty
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/19] fs: Factor inode hash operations into functions
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (7 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 08/19] fs: rework icount to be a locked variable Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 10/19] fs: Introduce per-bucket inode hash locks Dave Chinner
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/inode.c |  100 +++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 56 insertions(+), 44 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8f73ea1..077080c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -359,6 +359,59 @@ void inode_lru_list_del(struct inode *inode)
 	}
 }
 
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+	unsigned long tmp;
+
+	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+			L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
+	return tmp & I_HASHMASK;
+}
+
+/**
+ *	__insert_inode_hash - hash an inode
+ *	@inode: unhashed inode
+ *	@hashval: unsigned long value used to locate this object in the
+ *		inode_hashtable.
+ *
+ *	Add an inode to the inode hash for this superblock.
+ */
+void __insert_inode_hash(struct inode *inode, unsigned long hashval)
+{
+	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	spin_lock(&inode_lock);
+	hlist_add_head(&inode->i_hash, head);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(__insert_inode_hash);
+
+/**
+ *	__remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock. inode->i_lock must be
+ *	held.
+ */
+static void __remove_inode_hash(struct inode *inode)
+{
+	hlist_del_init(&inode->i_hash);
+}
+
+/**
+ *	remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock.
+ */
+void remove_inode_hash(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(remove_inode_hash);
+
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
@@ -406,7 +459,7 @@ static void dispose_list(struct list_head *head)
 		evict(inode);
 
 		spin_lock(&inode_lock);
-		hlist_del_init(&inode->i_hash);
+		__remove_inode_hash(inode);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode_lock);
 
@@ -669,16 +722,6 @@ repeat:
 	return node ? inode : NULL;
 }
 
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-	unsigned long tmp;
-
-	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-			L1_CACHE_BYTES;
-	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
-	return tmp & I_HASHMASK;
-}
-
 static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
@@ -1245,36 +1288,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 }
 EXPORT_SYMBOL(insert_inode_locked4);
 
-/**
- *	__insert_inode_hash - hash an inode
- *	@inode: unhashed inode
- *	@hashval: unsigned long value used to locate this object in the
- *		inode_hashtable.
- *
- *	Add an inode to the inode hash for this superblock.
- */
-void __insert_inode_hash(struct inode *inode, unsigned long hashval)
-{
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL(__insert_inode_hash);
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL(remove_inode_hash);
 
 int generic_delete_inode(struct inode *inode)
 {
@@ -1330,6 +1343,7 @@ static void iput_final(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
+		__remove_inode_hash(inode);
 	}
 	list_del_init(&inode->i_wb_list);
 	WARN_ON(inode->i_state & I_NEW);
@@ -1345,9 +1359,7 @@ static void iput_final(struct inode *inode)
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
-	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_lock);
+	remove_inode_hash(inode);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/19] fs: Introduce per-bucket inode hash locks
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (8 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 09/19] fs: Factor inode hash operations into functions Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 11/19] fs: add a per-superblock lock for the inode list Dave Chinner
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Protect the inode hash with a single lock is not scalable.  Convert
the inode hash to use the new bit-locked hash list implementation
that allows per-bucket locks to be used. This allows us to replace
the global inode_lock with finer grained locking without increasing
the size of the hash table.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        |    2 +-
 fs/fs-writeback.c       |    2 +-
 fs/hfs/hfs_fs.h         |    2 +-
 fs/hfs/inode.c          |    2 +-
 fs/hfsplus/hfsplus_fs.h |    2 +-
 fs/hfsplus/inode.c      |    2 +-
 fs/inode.c              |  148 ++++++++++++++++++++++++++++------------------
 fs/nilfs2/gcinode.c     |   22 ++++---
 fs/nilfs2/segment.c     |    2 +-
 fs/nilfs2/the_nilfs.h   |    2 +-
 fs/reiserfs/xattr.c     |    2 +-
 include/linux/fs.h      |    8 ++-
 include/linux/list_bl.h |    1 +
 mm/shmem.c              |    4 +-
 14 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7947bf0..c7a2bef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3855,7 +3855,7 @@ again:
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_unhashed(&inode->i_hash))
+	if (inode_unhashed(inode))
 		return;
 
 	spin_lock(&root->inode_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9832beb..1fb5d95 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -964,7 +964,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
+			if (inode_unhashed(inode))
 				goto out;
 		}
 		if (inode->i_state & I_FREEING)
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 4f55651..24591be 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {
 
 	int fs_div;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFS_FLG_BITMAP_DIRTY	0
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 397b7ad..7778298 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -524,7 +524,7 @@ static struct dentry *hfs_file_lookup(struct inode *dir, struct dentry *dentry,
 	HFS_I(inode)->rsrc_inode = dir;
 	HFS_I(dir)->rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index dc856be..499f5a5 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -144,7 +144,7 @@ struct hfsplus_sb_info {
 
 	unsigned long flags;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFSPLUS_SB_WRITEBACKUP	0x0001
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index c5a979d..b755cf0 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -202,7 +202,7 @@ static struct dentry *hfsplus_file_lookup(struct inode *dir, struct dentry *dent
 	HFSPLUS_I(inode).rsrc_inode = dir;
 	HFSPLUS_I(dir).rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/inode.c b/fs/inode.c
index 077080c..80692c5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -32,6 +32,13 @@
  *
  * inode->i_lock protects:
  *   i_ref
+ * inode hash lock protects:
+ *   inode hash table, i_hash
+ *
+ * Lock orders
+ * inode_lock
+ *   inode hash bucket lock
+ *     inode->i_lock
  */
 
 /*
@@ -68,6 +75,7 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
+static struct hlist_bl_head *inode_hashtable __read_mostly;
 
 /*
  * Each inode can be on two separate lists. One is
@@ -80,9 +88,7 @@ static unsigned int i_hash_shift __read_mostly;
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
  */
-
 static LIST_HEAD(inode_lru);
-static struct hlist_head *inode_hashtable __read_mostly;
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -297,7 +303,7 @@ void destroy_inode(struct inode *inode)
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	init_hlist_bl_node(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_wb_list);
@@ -379,9 +385,13 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_bl_head *b;
+
+	b = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
+	hlist_bl_lock(b);
+	hlist_bl_add_head(&inode->i_hash, b);
+	hlist_bl_unlock(b);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -395,7 +405,12 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 static void __remove_inode_hash(struct inode *inode)
 {
-	hlist_del_init(&inode->i_hash);
+	struct hlist_bl_head *b;
+
+	b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+	hlist_bl_lock(b);
+	hlist_bl_del_init(&inode->i_hash);
+	hlist_bl_unlock(b);
 }
 
 /**
@@ -407,7 +422,7 @@ static void __remove_inode_hash(struct inode *inode)
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
+	__remove_inode_hash(inode);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -675,25 +690,28 @@ static void __wait_on_freeing_inode(struct inode *inode);
  * add any additional branch in the common code.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	hlist_bl_unlock(b);
 	return node ? inode : NULL;
 }
 
@@ -702,33 +720,40 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_bl_head *b,
+				unsigned long ino)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	hlist_bl_unlock(b);
 	return node ? inode : NULL;
 }
 
 static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
+__inode_add_to_lists(struct super_block *sb, struct hlist_bl_head *b,
 			struct inode *inode)
 {
 	list_add(&inode->i_sb_list, &sb->s_inodes);
-	if (head)
-		hlist_add_head(&inode->i_hash, head);
+	if (b) {
+		hlist_bl_lock(b);
+		hlist_bl_add_head(&inode->i_hash, b);
+		hlist_bl_unlock(b);
+	}
 }
 
 /**
@@ -745,10 +770,10 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
  */
 void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&inode_lock);
-	__inode_add_to_lists(sb, head, inode);
+	__inode_add_to_lists(sb, b, inode);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -831,7 +856,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -844,12 +869,12 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, b, test, data);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -885,7 +910,7 @@ set_failed:
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_bl_head *b, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -895,10 +920,10 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 
 		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -947,7 +972,7 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 */
 	static unsigned int counter;
 	struct inode *inode;
-	struct hlist_head *head;
+	struct hlist_bl_head *b;
 	ino_t res;
 
 	spin_lock(&inode_lock);
@@ -955,8 +980,8 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		head = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, head, res);
+		b = inode_hashtable + hash(sb, res);
+		inode = find_inode_fast(sb, b, res);
 	} while (inode != NULL);
 	spin_unlock(&inode_lock);
 
@@ -1005,13 +1030,14 @@ EXPORT_SYMBOL(igrab);
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct hlist_head *head, int (*test)(struct inode *, void *),
+		struct hlist_bl_head *b,
+		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
-	inode = find_inode(sb, head, test, data);
+	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
@@ -1041,12 +1067,13 @@ static struct inode *ifind(struct super_block *sb,
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct hlist_head *head, unsigned long ino)
+		struct hlist_bl_head *b,
+		unsigned long ino)
 {
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
-	inode = find_inode_fast(sb, head, ino);
+	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
@@ -1083,9 +1110,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 0);
+	return ifind(sb, b, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1111,9 +1138,9 @@ EXPORT_SYMBOL(ilookup5_nowait);
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 1);
+	return ifind(sb, b, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 
@@ -1133,9 +1160,9 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 
-	return ifind_fast(sb, head, ino);
+	return ifind_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(ilookup);
 
@@ -1163,17 +1190,17 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	inode = ifind(sb, head, test, data, 1);
+	inode = ifind(sb, b, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode(sb, head, test, set, data);
+	return get_new_inode(sb, b, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 
@@ -1194,17 +1221,17 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	inode = ifind_fast(sb, head, ino);
+	inode = ifind_fast(sb, b, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode_fast(sb, head, ino);
+	return get_new_inode_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 
@@ -1212,14 +1239,15 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 
 	inode->i_state |= I_NEW;
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 		spin_lock(&inode_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1229,16 +1257,18 @@ int insert_inode_locked(struct inode *inode)
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1251,16 +1281,17 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
 	inode->i_state |= I_NEW;
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1270,16 +1301,18 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1302,7 +1335,7 @@ EXPORT_SYMBOL(generic_delete_inode);
  */
 int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || hlist_unhashed(&inode->i_hash);
+	return !inode->i_nlink || inode_unhashed(inode);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 
@@ -1342,7 +1375,6 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		hlist_del_init(&inode->i_hash);
 		__remove_inode_hash(inode);
 	}
 	list_del_init(&inode->i_wb_list);
@@ -1606,7 +1638,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1639,7 +1671,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					0,
@@ -1648,7 +1680,7 @@ void __init inode_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index bed3a78..ce7344e 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -196,13 +196,13 @@ int nilfs_init_gccache(struct the_nilfs *nilfs)
 	INIT_LIST_HEAD(&nilfs->ns_gc_inodes);
 
 	nilfs->ns_gc_inodes_h =
-		kmalloc(sizeof(struct hlist_head) * NILFS_GCINODE_HASH_SIZE,
+		kmalloc(sizeof(struct hlist_bl_head) * NILFS_GCINODE_HASH_SIZE,
 			GFP_NOFS);
 	if (nilfs->ns_gc_inodes_h == NULL)
 		return -ENOMEM;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++)
-		INIT_HLIST_HEAD(&nilfs->ns_gc_inodes_h[loop]);
+		INIT_HLIST_BL_HEAD(&nilfs->ns_gc_inodes_h[loop]);
 	return 0;
 }
 
@@ -254,18 +254,18 @@ static unsigned long ihash(ino_t ino, __u64 cno)
  */
 struct inode *nilfs_gc_iget(struct the_nilfs *nilfs, ino_t ino, __u64 cno)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
-	struct hlist_node *node;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino == ino && NILFS_I(inode)->i_cno == cno)
 			return inode;
 	}
 
 	inode = alloc_gcinode(nilfs, ino, cno);
 	if (likely(inode)) {
-		hlist_add_head(&inode->i_hash, head);
+		hlist_bl_add_head(&inode->i_hash, head);
 		list_add(&NILFS_I(inode)->i_dirty, &nilfs->ns_gc_inodes);
 	}
 	return inode;
@@ -284,16 +284,18 @@ void nilfs_clear_gcinode(struct inode *inode)
  */
 void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h;
-	struct hlist_node *node, *n;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
+	struct hlist_bl_node *node;
 	struct inode *inode;
 	int loop;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
-		hlist_for_each_entry_safe(inode, node, n, head, i_hash) {
-			hlist_del_init(&inode->i_hash);
+restart:
+		hlist_bl_for_each_entry(inode, node, head, i_hash) {
+			hlist_bl_del_init(&inode->i_hash);
 			list_del_init(&NILFS_I(inode)->i_dirty);
 			nilfs_clear_gcinode(inode); /* might sleep */
+			goto restart;
 		}
 	}
 }
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9fd051a..038251c 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -2452,7 +2452,7 @@ nilfs_remove_written_gcinodes(struct the_nilfs *nilfs, struct list_head *head)
 	list_for_each_entry_safe(ii, n, head, i_dirty) {
 		if (!test_bit(NILFS_I_UPDATED, &ii->i_state))
 			continue;
-		hlist_del_init(&ii->vfs_inode.i_hash);
+		hlist_bl_del_init(&ii->vfs_inode.i_hash);
 		list_del_init(&ii->i_dirty);
 		nilfs_clear_gcinode(&ii->vfs_inode);
 	}
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index f785a7b..1ab441a 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -167,7 +167,7 @@ struct the_nilfs {
 
 	/* GC inode list and hash table head */
 	struct list_head	ns_gc_inodes;
-	struct hlist_head      *ns_gc_inodes_h;
+	struct hlist_bl_head      *ns_gc_inodes_h;
 
 	/* Disk layout information (static) */
 	unsigned int		ns_blocksize_bits;
diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index 8c4cf27..b246e3c 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -424,7 +424,7 @@ int reiserfs_prepare_write(struct file *f, struct page *page,
 static void update_ctime(struct inode *inode)
 {
 	struct timespec now = current_fs_time(inode->i_sb);
-	if (hlist_unhashed(&inode->i_hash) || !inode->i_nlink ||
+	if (inode_unhashed(inode) || !inode->i_nlink ||
 	    timespec_equal(&inode->i_ctime, &now))
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c720d65..88e457f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -383,6 +383,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/list_bl.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -724,7 +725,7 @@ struct posix_acl;
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
 	struct list_head	i_wb_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
@@ -789,6 +790,11 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 };
 
+static inline int inode_unhashed(struct inode *inode)
+{
+	return hlist_bl_unhashed(&inode->i_hash);
+}
+
 /*
  * inode->i_mutex nesting subclasses for the lock validator:
  *
diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index 0d791ff..5bb2370 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -126,6 +126,7 @@ static inline void hlist_bl_del_init(struct hlist_bl_node *n)
 
 #endif
 
+
 /**
  * hlist_bl_lock	- lock a hash list
  * @h:	hash list head to lock
diff --git a/mm/shmem.c b/mm/shmem.c
index 7d0bc16..419de2c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2146,7 +2146,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 	if (*len < 3)
 		return 255;
 
-	if (hlist_unhashed(&inode->i_hash)) {
+	if (inode_unhashed(inode)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
 		 * so as we hash inodes here rather than at creation
 		 * time, we need a lock to ensure we only try
@@ -2154,7 +2154,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 		 */
 		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
-		if (hlist_unhashed(&inode->i_hash))
+		if (inode_unhashed(inode))
 			__insert_inode_hash(inode,
 					    inode->i_ino + inode->i_generation);
 		spin_unlock(&lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/19] fs: add a per-superblock lock for the inode list
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (9 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 10/19] fs: Introduce per-bucket inode hash locks Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 12/19] fs: split locking of inode writeback and LRU lists Dave Chinner
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

To allow removal of the inode_lock, we first need to protect the
superblock inode list with its own lock instead of using the
inode_lock. Add a lock to the superblock to protect this list and
nest the new lock inside the inode_lock around the list operations
it needs to protect.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/drop_caches.c       |    4 ++++
 fs/fs-writeback.c      |    4 ++++
 fs/inode.c             |   22 +++++++++++++++++++---
 fs/notify/inode_mark.c |    3 +++
 fs/quota/dquot.c       |    6 ++++++
 fs/super.c             |    1 +
 include/linux/fs.h     |    1 +
 7 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 10c8c5a..dfe8cb1 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,6 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -25,12 +26,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1fb5d95..676e048 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1031,6 +1031,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1050,6 +1051,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
@@ -1067,7 +1069,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		cond_resched();
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 80692c5..9f7bebd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -34,13 +34,18 @@
  *   i_ref
  * inode hash lock protects:
  *   inode hash table, i_hash
+ * sb inode lock protects:
+ *   s_inodes, i_sb_list
  *
  * Lock orders
  * inode_lock
  *   inode hash bucket lock
  *     inode->i_lock
+ *
+ * inode_lock
+ *   sb inode lock
+ *     inode->i_lock
  */
-
 /*
  * This is needed for the following functions:
  *  - inode_has_buffers
@@ -475,7 +480,9 @@ static void dispose_list(struct list_head *head)
 
 		spin_lock(&inode_lock);
 		__remove_inode_hash(inode);
+		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode->i_sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -486,7 +493,8 @@ static void dispose_list(struct list_head *head)
 /*
  * Invalidate all inodes for a device.
  */
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_list(struct super_block *sb, struct list_head *head,
+			struct list_head *dispose)
 {
 	struct list_head *next;
 	int busy = 0;
@@ -503,6 +511,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -541,8 +550,10 @@ int invalidate_inodes(struct super_block *sb)
 
 	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
-	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -748,7 +759,9 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_bl_head *b,
 			struct inode *inode)
 {
+	spin_lock(&sb->s_inodes_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
 	if (b) {
 		hlist_bl_lock(b);
 		hlist_bl_add_head(&inode->i_hash, b);
@@ -1388,7 +1401,10 @@ static void iput_final(struct inode *inode)
 	 */
 	inode_lru_list_del(inode);
 
+	spin_lock(&sb->s_inodes_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb->s_inodes_lock);
+
 	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 1a4c117..4ed0e43 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -242,6 +242,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
+		struct super_block *sb = inode->i_sb;
 
 		/*
 		 * We cannot iref() an inode in state I_FREEING,
@@ -290,6 +291,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -303,5 +305,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		iput(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 326df72..7ef5411 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -897,6 +897,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 #endif
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -912,6 +913,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -923,7 +925,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 
@@ -1006,6 +1010,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	int reserved = 0;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1019,6 +1024,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
diff --git a/fs/super.c b/fs/super.c
index 8819e3a..c5332e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,6 +76,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		spin_lock_init(&s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 88e457f..962a606 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1347,6 +1347,7 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
+	spinlock_t		s_inodes_lock;	/* lock for s_inodes */
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/19] fs: split locking of inode writeback and LRU lists
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (10 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 11/19] fs: add a per-superblock lock for the inode list Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 13/19] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Given that the inode LRU and IO lists are split apart, they do not
need to be protected by the same lock. So in preparation for removal
of the inode_lock, add new locks for them. The writeback lists are
only ever accessed in the context of a bdi, so add a per-BDI lock to
protect manipulations of these lists.

For the inode LRU, introduce a simple global lock to protect it.
While this could be made per-sb, it is unclear yet as to what is the
next step for optimising/parallelising reclaim of inodes. Rather
than optimise now, leave it as a global list and lock until further
analysis can be done.

Because there will now be a situation where the inode is on
different lists protected by different locks during the freeing of
the inode (i.e. not an atomic state transition), we need to ensure
that we set the I_FREEING state flag before we start removing inodes
from the IO and LRU lists. This ensures that if we race with other
threads during freeing, they will notice the I_FREEING flag is set
and be able to take appropriate action to avoid problems.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c              |    5 +++
 fs/fs-writeback.c           |   51 +++++++++++++++++++++++++++++++++---
 fs/inode.c                  |   61 ++++++++++++++++++++++++++++++++++++++-----
 fs/internal.h               |    5 +++
 include/linux/backing-dev.h |    3 ++
 mm/backing-dev.c            |   19 +++++++++++++
 6 files changed, 133 insertions(+), 11 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 11ad53d..7909775 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -56,10 +56,15 @@ EXPORT_SYMBOL(I_BDEV);
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
 	spin_lock(&inode_lock);
+	bdi_lock_two(old, dst);
 	inode->i_data.backing_dev_info = dst;
 	if (!list_empty(&inode->i_wb_list))
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
+	spin_unlock(&old->wb.b_lock);
+	spin_unlock(&dst->wb.b_lock);
 	spin_unlock(&inode_lock);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 676e048..f159141 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -157,6 +157,18 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 }
 
 /*
+ * Remove the inode from the writeback list it is on.
+ */
+void inode_wb_list_del(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.b_lock);
+	list_del_init(&inode->i_wb_list);
+	spin_unlock(&bdi->wb.b_lock);
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -169,6 +181,7 @@ static void redirty_tail(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -186,6 +199,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -269,6 +283,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	assert_spin_locked(&wb->b_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -311,6 +326,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
@@ -330,7 +346,9 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
+			spin_unlock(&bdi->wb.b_lock);
 			return 0;
 		}
 
@@ -385,6 +403,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
@@ -400,6 +419,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 				 */
 				redirty_tail(inode);
 			}
+			spin_unlock(&bdi->wb.b_lock);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
@@ -407,7 +427,9 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
+			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/*
 			 * The inode is clean. If it is unused, then make sure
@@ -415,7 +437,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * does not move dirty inodes to the LRU and dirty
 			 * inodes are removed from the LRU during scanning.
 			 */
-			list_del_init(&inode->i_wb_list);
+			inode_wb_list_del(inode);
 			if (!inode->i_ref)
 				inode_lru_list_add(inode);
 		}
@@ -463,6 +485,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
+	assert_spin_locked(&wb->b_lock);
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -478,7 +501,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 				redirty_tail(inode);
 				continue;
 			}
-
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -487,7 +509,15 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
-		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+		/*
+		 * We can see I_FREEING here when the inod isin the process of
+		 * being reclaimed. In that case the freer is waiting on the
+		 * wb->b_lock that we currently hold to remove the inode from
+		 * the writeback list. So we don't spin on it here, requeue it
+		 * and move on to the next inode, which will allow the other
+		 * thread to free the inode when we drop the lock.
+		 */
+		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
 			requeue_io(inode);
 			continue;
 		}
@@ -498,10 +528,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		if (inode_dirtied_after(inode, wbc->wb_start))
 			return 1;
 
-		BUG_ON(inode->i_state & I_FREEING);
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&wb->b_lock);
+
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -509,12 +540,15 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
+			spin_lock(&wb->b_lock);
 			redirty_tail(inode);
+			spin_unlock(&wb->b_lock);
 		}
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -534,6 +568,8 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
+
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -552,6 +588,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		if (ret)
 			break;
 	}
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
@@ -562,9 +599,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -677,8 +716,10 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 */
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
+			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_wb_list);
+			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -991,8 +1032,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			spin_unlock(&bdi->wb.b_lock);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index 9f7bebd..55b062e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,6 +25,8 @@
 #include <linux/async.h>
 #include <linux/posix_acl.h>
 
+#include "internal.h"
+
 /*
  * Locking rules.
  *
@@ -36,6 +38,10 @@
  *   inode hash table, i_hash
  * sb inode lock protects:
  *   s_inodes, i_sb_list
+ * bdi writeback lock protects:
+ *   b_io, b_more_io, b_dirty, i_wb_list
+ * inode_lru_lock protects:
+ *   inode_lru, i_lru
  *
  * Lock orders
  * inode_lock
@@ -44,6 +50,17 @@
  *
  * inode_lock
  *   sb inode lock
+ *     inode_lru_lock
+ *       wb->b_lock
+ *         inode->i_lock
+ *
+ * inode_lock
+ *   wb->b_lock
+ *     sb_lock (pin sb for writeback)
+ *     inode->i_lock
+ *
+ * inode_lock
+ *   inode_lru
  *     inode->i_lock
  */
 /*
@@ -94,6 +111,7 @@ static struct hlist_bl_head *inode_hashtable __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 static LIST_HEAD(inode_lru);
+static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -354,20 +372,28 @@ void iref(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(iref);
 
+/*
+ * check against I_FREEING as inode writeback completion could race with
+ * setting the I_FREEING and removing the inode from the LRU.
+ */
 void inode_lru_list_add(struct inode *inode)
 {
-	if (list_empty(&inode->i_lru)) {
+	spin_lock(&inode_lru_lock);
+	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
 		list_add(&inode->i_lru, &inode_lru);
 		percpu_counter_inc(&nr_inodes_unused);
 	}
+	spin_unlock(&inode_lru_lock);
 }
 
 void inode_lru_list_del(struct inode *inode)
 {
+	spin_lock(&inode_lru_lock);
 	if (!list_empty(&inode->i_lru)) {
 		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
+	spin_unlock(&inode_lru_lock);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -525,8 +551,18 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+
+			/*
+			 * move the inode off the IO lists and LRU once
+			 * I_FREEING is set so that it won't get moved back on
+			 * there if it is dirty.
+			 */
+			inode_wb_list_del(inode);
+
+			spin_lock(&inode_lru_lock);
 			list_move(&inode->i_lru, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
+			spin_unlock(&inode_lru_lock);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
@@ -600,6 +636,7 @@ static void prune_icache(int nr_to_scan)
 
 	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
@@ -630,12 +667,14 @@ static void prune_icache(int nr_to_scan)
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			inode->i_ref++;
 			spin_unlock(&inode->i_lock);
+			spin_unlock(&inode_lru_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
+			spin_lock(&inode_lru_lock);
 
 			/*
 			 * if we can't reclaim this inode immediately, give it
@@ -648,16 +687,24 @@ static void prune_icache(int nr_to_scan)
 			}
 		} else
 			spin_unlock(&inode->i_lock);
-		list_move(&inode->i_lru, &freeable);
-		list_del_init(&inode->i_wb_list);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+
+		/*
+		 * move the inode off the io lists and lru once
+		 * i_freeing is set so that it won't get moved back on
+		 * there if it is dirty.
+		 */
+		inode_wb_list_del(inode);
+
+		list_move(&inode->i_lru, &freeable);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
+	spin_unlock(&inode_lru_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
@@ -1390,15 +1437,15 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
-	list_del_init(&inode->i_wb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
 	/*
-	 * After we delete the inode from the LRU here, we avoid moving dirty
-	 * inodes back onto the LRU now because I_FREEING is set and hence
-	 * writeback_single_inode() won't move the inode around.
+	 * After we delete the inode from the LRU and IO lists here, we avoid
+	 * moving dirty inodes back onto the LRU now because I_FREEING is set
+	 * and hence writeback_single_inode() won't move the inode around.
 	 */
+	inode_wb_list_del(inode);
 	inode_lru_list_del(inode);
 
 	spin_lock(&sb->s_inodes_lock);
diff --git a/fs/internal.h b/fs/internal.h
index ece3565..f8825ae 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -107,3 +107,8 @@ extern void release_open_intent(struct nameidata *);
  */
 extern void inode_lru_list_add(struct inode *inode);
 extern void inode_lru_list_del(struct inode *inode);
+
+/*
+ * fs-writeback.c
+ */
+extern void inode_wb_list_del(struct inode *inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..995a3ad 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t b_lock;		/* writeback lists lock */
 };
 
 struct backing_dev_info {
@@ -108,6 +109,8 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 15d5097..40b84c6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,12 +74,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -634,6 +636,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->b_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -671,6 +674,19 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
+void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2)
+{
+	if (bdi1 < bdi2) {
+		spin_lock(&bdi1->wb.b_lock);
+		spin_lock_nested(&bdi2->wb.b_lock, 1);
+	} else {
+		spin_lock(&bdi2->wb.b_lock);
+		spin_lock_nested(&bdi1->wb.b_lock, 1);
+	}
+}
+EXPORT_SYMBOL(bdi_lock_two);
+
 void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
@@ -683,9 +699,12 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
 		spin_lock(&inode_lock);
+		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		spin_unlock(&bdi->wb.b_lock);
+		spin_unlock(&dst->b_lock);
 		spin_unlock(&inode_lock);
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/19] fs: Protect inode->i_state with the inode->i_lock
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (11 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 12/19] fs: split locking of inode writeback and LRU lists Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 14/19] fs: introduce a per-cpu last_ino allocator Dave Chinner
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state.  Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.

Based on work originally written by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/drop_caches.c       |    9 +++--
 fs/fs-writeback.c      |   48 +++++++++++++++++++------
 fs/inode.c             |   93 ++++++++++++++++++++++++++++++++++-------------
 fs/nilfs2/gcdat.c      |    1 +
 fs/notify/inode_mark.c |    6 ++-
 fs/quota/dquot.c       |   12 +++---
 6 files changed, 120 insertions(+), 49 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dfe8cb1..f958dd8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
-			continue;
 		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (inode->i_mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f159141..b75678b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -304,10 +304,12 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 	}
 }
 
@@ -331,6 +333,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
+	spin_lock(&inode->i_lock);
 	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
@@ -346,6 +349,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
 			spin_unlock(&bdi->wb.b_lock);
@@ -363,6 +367,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -384,8 +389,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * write_inode()
 	 */
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -395,6 +402,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -403,6 +411,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
@@ -427,6 +436,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
 			spin_unlock(&bdi->wb.b_lock);
@@ -437,10 +447,15 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * does not move dirty inodes to the LRU and dirty
 			 * inodes are removed from the LRU during scanning.
 			 */
+			int unused = inode->i_ref == 0;
+			spin_unlock(&inode->i_lock);
 			inode_wb_list_del(inode);
-			if (!inode->i_ref)
+			if (unused)
 				inode_lru_list_add(inode);
 		}
+	} else {
+		/* freer will clean up */
+		spin_unlock(&inode->i_lock);
 	}
 	inode_sync_complete(inode);
 	return ret;
@@ -517,7 +532,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * and move on to the next inode, which will allow the other
 		 * thread to free the inode when we drop the lock.
 		 */
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
+			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			continue;
 		}
@@ -525,10 +542,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
+		if (inode_dirtied_after(inode, wbc->wb_start)) {
+			spin_unlock(&inode->i_lock);
 			return 1;
+		}
 
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->b_lock);
@@ -719,9 +737,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_wb_list);
+			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
+			spin_unlock(&inode->i_lock);
 		}
 		spin_unlock(&inode_lock);
 	}
@@ -987,6 +1007,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -998,7 +1019,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * superblock list, based upon its state.
 		 */
 		if (inode->i_state & I_SYNC)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * Only add valid (hashed) inodes to the superblock's
@@ -1006,10 +1027,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
 			if (inode_unhashed(inode))
-				goto out;
+				goto out_unlock;
 		}
 		if (inode->i_state & I_FREEING)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * If the inode was already on b_dirty/b_io/b_more_io, don't
@@ -1032,12 +1053,16 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.b_lock);
+			goto out;
 		}
 	}
+out_unlock:
+	spin_unlock(&inode->i_lock);
 out:
 	spin_unlock(&inode_lock);
 
@@ -1086,12 +1111,13 @@ static void wait_sb_inodes(struct super_block *sb)
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
+		spin_lock(&inode->i_lock);
 		mapping = inode->i_mapping;
-		if (mapping->nrpages == 0)
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
 			continue;
-		spin_lock(&inode->i_lock);
+		}
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
diff --git a/fs/inode.c b/fs/inode.c
index 55b062e..1ac9a61 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -33,7 +33,7 @@
  * inode->i_lock is *always* the innermost lock.
  *
  * inode->i_lock protects:
- *   i_ref
+ *   i_ref i_state
  * inode hash lock protects:
  *   inode hash table, i_hash
  * sb inode lock protects:
@@ -178,7 +178,7 @@ int proc_nr_inodes(ctl_table *table, int write,
 static void wake_up_inode(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_NEW);
@@ -466,7 +466,9 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
+	spin_lock(&inode->i_lock);
 	inode->i_state = I_FREEING | I_CLEAR;
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(end_writeback);
 
@@ -543,14 +545,16 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
-		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
-			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			spin_unlock(&inode->i_lock);
 
 			/*
 			 * move the inode off the IO lists and LRU once
@@ -601,6 +605,7 @@ EXPORT_SYMBOL(invalidate_inodes);
 
 static int can_unuse(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
 	if (inode->i_state)
 		return 0;
 	if (inode_has_buffers(inode))
@@ -659,9 +664,9 @@ static void prune_icache(int nr_to_scan)
 
 		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
+			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_lru);
-			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -675,6 +680,7 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
+			spin_lock(&inode->i_lock);
 
 			/*
 			 * if we can't reclaim this inode immediately, give it
@@ -683,12 +689,14 @@ static void prune_icache(int nr_to_scan)
 			 */
 			if (!can_unuse(inode)) {
 				list_move(&inode->i_lru, &inode_lru);
+				spin_unlock(&inode->i_lock);
 				continue;
 			}
-		} else
-			spin_unlock(&inode->i_lock);
+		}
+
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 
 		/*
 		 * move the inode off the io lists and lru once
@@ -742,7 +750,7 @@ static struct shrinker icache_shrinker = {
 
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
- * Called with the inode lock held.
+ * Called with the inode->i_lock held.
  * NOTE: we are not increasing the inode-refcount, you must take a reference
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
@@ -760,8 +768,11 @@ repeat:
 	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		spin_lock(&inode->i_lock);
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
@@ -791,6 +802,7 @@ repeat:
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
@@ -865,9 +877,14 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
+
+		/*
+		 * set the inode state before we make the inode accessible to
+		 * the outside world.
+		 */
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -934,8 +951,12 @@ static struct inode *get_new_inode(struct super_block *sb,
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, b, inode);
+			/*
+			 * Set the inode state before we make the inode
+			 * visible to the outside world.
+			 */
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -949,7 +970,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
@@ -982,9 +1002,13 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
+			/*
+			 * Set the inode state before we make the inode
+			 * visible to the outside world.
+			 */
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -998,7 +1022,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
@@ -1099,7 +1122,6 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
@@ -1135,7 +1157,6 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
@@ -1312,8 +1333,11 @@ int insert_inode_locked(struct inode *inode)
 				continue;
 			if (old->i_sb != sb)
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1322,7 +1346,6 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
@@ -1343,6 +1366,10 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 	struct super_block *sb = inode->i_sb;
 	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
+	/*
+	 * Nobody else can see the new inode yet, so it is safe to set flags
+	 * without locking here.
+	 */
 	inode->i_state |= I_NEW;
 
 	while (1) {
@@ -1356,8 +1383,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 				continue;
 			if (!test(old, data))
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1366,7 +1396,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
@@ -1415,6 +1444,8 @@ static void iput_final(struct inode *inode)
 	const struct super_operations *op = inode->i_sb->s_op;
 	int drop;
 
+	assert_spin_locked(&inode->i_lock);
+
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1423,22 +1454,30 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
-			if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
+			    list_empty(&inode->i_lru)) {
+				spin_unlock(&inode->i_lock);
 				inode_lru_list_add(inode);
+				return;
+			}
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		__remove_inode_hash(inode);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		__remove_inode_hash(inode);
 	}
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
 	/*
 	 * After we delete the inode from the LRU and IO lists here, we avoid
@@ -1472,12 +1511,11 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		BUG_ON(inode->i_state & I_CLEAR);
-
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
+		BUG_ON(inode->i_state & I_CLEAR);
+
 		if (--inode->i_ref == 0) {
-			spin_unlock(&inode->i_lock);
 			iput_final(inode);
 			return;
 		}
@@ -1663,6 +1701,8 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1670,6 +1710,7 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
diff --git a/fs/nilfs2/gcdat.c b/fs/nilfs2/gcdat.c
index 84a45d1..c51f0e8 100644
--- a/fs/nilfs2/gcdat.c
+++ b/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 4ed0e43..203146b 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -249,8 +249,11 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		/*
 		 * If i_ref is zero, the inode cannot have any watches and
@@ -258,7 +261,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * evict all inodes with zero i_ref from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
 			spin_unlock(&inode->i_lock);
 			continue;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 7ef5411..b02a3e1 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -899,18 +899,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    !atomic_read(&inode->i_writecount) ||
+		    !dqinit_needed(inode, type)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount))
-			continue;
-		if (!dqinit_needed(inode, type))
-			continue;
 
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 14/19] fs: introduce a per-cpu last_ino allocator
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (12 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 13/19] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 15/19] fs: Make iunique independent of inode_lock Dave Chinner
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>

new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations.  This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/inode.c |   45 ++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 1ac9a61..05852a1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -850,6 +850,43 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+#define LAST_INO_BATCH 1024
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int get_next_ino(void)
+{
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -864,12 +901,6 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static unsigned int last_ino;
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -882,7 +913,7 @@ struct inode *new_inode(struct super_block *sb)
 		 * set the inode state before we make the inode accessible to
 		 * the outside world.
 		 */
-		inode->i_ino = ++last_ino;
+		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 15/19] fs: Make iunique independent of inode_lock
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (13 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 14/19] fs: introduce a per-cpu last_ino allocator Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 16/19] fs: icache remove inode_lock Dave Chinner
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Before removing the inode_lock, the iunique counter needs to be made
independent of the inode_lock. Add a new lock to protect the iunique
counter and nest it inside the inode_lock to provide the same
protection that the inode_lock currently provides.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/inode.c |   35 ++++++++++++++++++++++++++++-------
 1 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 05852a1..6526d70 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1063,6 +1063,30 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * search the inode cache for a matching inode number.
+ * If we find one, then the inode number we are trying to
+ * allocate is not unique and so we should not use it.
+ *
+ * Returns 1 if the inode number is unique, 0 if it is not.
+ */
+static int test_inode_iunique(struct super_block * sb, unsigned long ino)
+{
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_node *node;
+	struct inode *inode;
+
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb) {
+			hlist_bl_unlock(b);
+			return 0;
+		}
+	}
+	hlist_bl_unlock(b);
+	return 1;
+}
+
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
@@ -1084,20 +1108,17 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
+	static DEFINE_SPINLOCK(iunique_lock);
 	static unsigned int counter;
-	struct inode *inode;
-	struct hlist_bl_head *b;
 	ino_t res;
 
-	spin_lock(&inode_lock);
+	spin_lock(&iunique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		b = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, b, res);
-	} while (inode != NULL);
-	spin_unlock(&inode_lock);
+	} while (!test_inode_iunique(sb, res));
+	spin_unlock(&iunique_lock);
 
 	return res;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 16/19] fs: icache remove inode_lock
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (14 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 15/19] fs: Make iunique independent of inode_lock Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

All the functionality that the inode_lock protected has now been
wrapped up in new independent locks and/or functionality. Hence the
inode_lock does not serve a purpose any longer and hence can now be
removed.

Based on work originally done by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/Locking |    2 +-
 Documentation/filesystems/porting |    8 ++-
 Documentation/filesystems/vfs.txt |    2 +-
 fs/block_dev.c                    |    2 -
 fs/buffer.c                       |    2 +-
 fs/drop_caches.c                  |    4 -
 fs/fs-writeback.c                 |   84 +++++++------------------
 fs/inode.c                        |  125 +++++++------------------------------
 fs/logfs/inode.c                  |    2 +-
 fs/notify/inode_mark.c            |   10 +--
 fs/notify/mark.c                  |    1 -
 fs/notify/vfsmount_mark.c         |    1 -
 fs/ntfs/inode.c                   |    4 +-
 fs/ocfs2/inode.c                  |    2 +-
 fs/quota/dquot.c                  |   12 +---
 include/linux/fs.h                |    2 +-
 include/linux/writeback.h         |    2 -
 mm/backing-dev.c                  |    4 -
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 20 files changed, 73 insertions(+), 208 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283..7f98cd5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -114,7 +114,7 @@ alloc_inode:
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..f182795 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -299,7 +299,7 @@ be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
 been for ->delete_inode().
-	->drop_inode() returns int now; it's called on final iput() with inode_lock
+	->drop_inode() returns int now; it's called on final iput() with i_lock
 held and it returns true if filesystems wants the inode to be dropped.  As before,
 generic_drop_inode() is still the default and it's been updated appropriately.
 generic_delete_inode() is also alive and it consists simply of return 1.  Note that
@@ -318,3 +318,9 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+[mandatory]
+        The i_count field in the inode has been replaced with i_ref, which is
+a regular integer instead of an atomic_t.  Filesystems should not manipulate
+it directly but use helpers like igrab(), iref() and iput().
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 0dbbbe4..7ab923c 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -246,7 +246,7 @@ or bottom half).
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the inode_lock spinlock held.
+	with the i_lock and sb_inode_list_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7909775..dae9871 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,14 +58,12 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 {
 	struct backing_dev_info *old = inode->i_data.backing_dev_info;
 
-	spin_lock(&inode_lock);
 	bdi_lock_two(old, dst);
 	inode->i_data.backing_dev_info = dst;
 	if (!list_empty(&inode->i_wb_list))
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&old->wb.b_lock);
 	spin_unlock(&dst->wb.b_lock);
-	spin_unlock(&inode_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..66f7afd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f958dd8..bd39f65 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b75678b..5761751 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -206,7 +206,7 @@ static void requeue_io(struct inode *inode)
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -306,27 +306,30 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
-/*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has a reference on the inode or the inode has I_WILL_FREE set.
+/**
+ * sync_inode - write an inode and its pages to disk.
+ * @inode: the inode to sync
+ * @wbc: controls the writeback mode
+ *
+ * sync_inode() will write an inode and its pages to disk.  It will also
+ * correctly update the inode on its superblock's dirty inode lists and will
+ * update inode->i_state.
+ *
+ * The caller must have a ref on the inode or the inode has I_WILL_FREE set.
  *
- * If `wait' is set, wait on the writeout.
+ * If @wbc->sync_mode == WB_SYNC_ALL the we are doing a data integrity
+ * operation so we need to wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
- *
- * Called under inode_lock.
  */
-static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
@@ -368,7 +371,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -388,12 +390,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * due to delalloc, clear dirty metadata flags right before
 	 * write_inode()
 	 */
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -401,7 +401,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -460,6 +459,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode_sync_complete(inode);
 	return ret;
 }
+EXPORT_SYMBOL(sync_inode);
 
 /*
  * For background writeback the caller does not have the sb pinned
@@ -552,7 +552,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		spin_unlock(&wb->b_lock);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		sync_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -562,10 +562,8 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			redirty_tail(inode);
 			spin_unlock(&wb->b_lock);
 		}
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -585,9 +583,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
-
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -607,7 +603,6 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 			break;
 	}
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -616,13 +611,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -732,7 +725,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
@@ -743,7 +735,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 			inode_wait_for_writeback(inode);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_lock);
 	}
 
 	return wrote;
@@ -1006,7 +997,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -1064,8 +1054,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 out_unlock:
 	spin_unlock(&inode->i_lock);
 out:
-	spin_unlock(&inode_lock);
-
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
 }
@@ -1098,7 +1086,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 
 	/*
@@ -1121,14 +1108,12 @@ static void wait_sb_inodes(struct super_block *sb)
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * inode_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it
-		 * under inode_lock. So we keep the reference and iput
-		 * it later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * s_inodes_lock.  We cannot iput the inode now as we can be
+		 * holding the last reference and we cannot iput it under
+		 * s_inodes_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1137,11 +1122,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -1244,33 +1227,10 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
-	ret = writeback_single_inode(inode, &wbc);
-	spin_unlock(&inode_lock);
+	ret = sync_inode(inode, &wbc);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
 }
 EXPORT_SYMBOL(write_inode_now);
 
-/**
- * sync_inode - write an inode and its pages to disk.
- * @inode: the inode to sync
- * @wbc: controls the writeback mode
- *
- * sync_inode() will write an inode and its pages to disk.  It will also
- * correctly update the inode on its superblock's dirty inode lists and will
- * update inode->i_state.
- *
- * The caller must have a ref on the inode.
- */
-int sync_inode(struct inode *inode, struct writeback_control *wbc)
-{
-	int ret;
-
-	spin_lock(&inode_lock);
-	ret = writeback_single_inode(inode, wbc);
-	spin_unlock(&inode_lock);
-	return ret;
-}
-EXPORT_SYMBOL(sync_inode);
diff --git a/fs/inode.c b/fs/inode.c
index 6526d70..5cddf45 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -44,24 +44,20 @@
  *   inode_lru, i_lru
  *
  * Lock orders
- * inode_lock
- *   inode hash bucket lock
- *     inode->i_lock
+ * inode hash bucket lock
+ *   inode->i_lock
  *
- * inode_lock
- *   sb inode lock
- *     inode_lru_lock
- *       wb->b_lock
- *         inode->i_lock
+ * sb inode lock
+ *   inode_lru_lock
+ *     wb->b_lock
+ *       inode->i_lock
  *
- * inode_lock
- *   wb->b_lock
- *     sb_lock (pin sb for writeback)
- *     inode->i_lock
+ * wb->b_lock
+ *   sb_lock (pin sb for writeback)
+ *   inode->i_lock
  *
- * inode_lock
- *   inode_lru
- *     inode->i_lock
+ * inode_lru
+ *   inode->i_lock
  */
 /*
  * This is needed for the following functions:
@@ -114,14 +110,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
- * A simple spinlock to protect the list manipulations.
- *
- * NOTE! You also have to own the lock if you change
- * the i_state of an inode while it is in use..
- */
-DEFINE_SPINLOCK(inode_lock);
-
-/*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
@@ -364,11 +352,9 @@ static void init_once(void *foo)
 void iref(struct inode *inode)
 {
 	WARN_ON(inode->i_ref < 1);
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_ref++;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
 
@@ -419,22 +405,19 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 	struct hlist_bl_head *b;
 
 	b = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock(&inode_lock);
 	hlist_bl_lock(b);
 	hlist_bl_add_head(&inode->i_hash, b);
 	hlist_bl_unlock(b);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
- *	__remove_inode_hash - remove an inode from the hash
+ *	remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
- *	Remove an inode from the superblock. inode->i_lock must be
- *	held.
+ *	Remove an inode from the superblock.
  */
-static void __remove_inode_hash(struct inode *inode)
+void remove_inode_hash(struct inode *inode)
 {
 	struct hlist_bl_head *b;
 
@@ -443,19 +426,6 @@ static void __remove_inode_hash(struct inode *inode)
 	hlist_bl_del_init(&inode->i_hash);
 	hlist_bl_unlock(b);
 }
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	spin_lock(&inode_lock);
-	__remove_inode_hash(inode);
-	spin_unlock(&inode_lock);
-}
 EXPORT_SYMBOL(remove_inode_hash);
 
 void end_writeback(struct inode *inode)
@@ -506,12 +476,10 @@ static void dispose_list(struct list_head *head)
 
 		evict(inode);
 
-		spin_lock(&inode_lock);
-		__remove_inode_hash(inode);
+		remove_inode_hash(inode);
 		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode->i_sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -538,7 +506,6 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
@@ -589,12 +556,10 @@ int invalidate_inodes(struct super_block *sb)
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -619,7 +584,7 @@ static int can_unuse(struct inode *inode)
 
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
- * temporary list and then are freed outside inode_lock by dispose_list().
+ * temporary list and then are freed outside locks by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  If the inode has metadata buffers attached to
@@ -640,7 +605,6 @@ static void prune_icache(int nr_to_scan)
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -673,12 +637,10 @@ static void prune_icache(int nr_to_scan)
 			inode->i_ref++;
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lru_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
 			spin_lock(&inode->i_lock);
 
@@ -713,7 +675,6 @@ static void prune_icache(int nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -834,9 +795,9 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_bl_head *b,
  * @inode: inode to mark in use
  *
  * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash. This needs to be done under
- * the inode_lock, so export a function to do this rather than the inode lock
- * itself. We calculate the hash list to add to here so it is all internal
+ * list, the owning superblock and the inode hash.
+ *
+ * We calculate the hash list to add to here so it is all internal
  * which requires the caller to have already set up the inode number in the
  * inode to add.
  */
@@ -844,9 +805,7 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
 	struct hlist_bl_head *b = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&inode_lock);
 	__inode_add_to_lists(sb, b, inode);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
@@ -903,12 +862,8 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_lock);
-
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
-
 		/*
 		 * set the inode state before we make the inode accessible to
 		 * the outside world.
@@ -916,7 +871,6 @@ struct inode *new_inode(struct super_block *sb)
 		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
@@ -975,7 +929,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, b, test, data);
 		if (!old) {
@@ -988,7 +941,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 			 */
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, b, inode);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1003,7 +955,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 */
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1011,7 +962,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	return inode;
 
 set_failed:
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -1029,7 +979,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
@@ -1040,7 +989,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, b, inode);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1055,7 +1003,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 */
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1126,7 +1073,6 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
 		inode->i_ref++;
@@ -1140,7 +1086,6 @@ struct inode *igrab(struct inode *inode)
 		 */
 		inode = NULL;
 	}
-	spin_unlock(&inode_lock);
 	return inode;
 }
 EXPORT_SYMBOL(igrab);
@@ -1162,7 +1107,7 @@ EXPORT_SYMBOL(igrab);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct hlist_bl_head *b,
@@ -1171,17 +1116,14 @@ static struct inode *ifind(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1206,16 +1148,13 @@ static struct inode *ifind_fast(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1238,7 +1177,7 @@ static struct inode *ifind_fast(struct super_block *sb,
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1266,7 +1205,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1317,7 +1256,7 @@ EXPORT_SYMBOL(ilookup);
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_lock held, so can't sleep.
+ * Note both @test and @set are called with the i_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
@@ -1378,7 +1317,6 @@ int insert_inode_locked(struct inode *inode)
 	while (1) {
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
-		spin_lock(&inode_lock);
 		hlist_bl_lock(b);
 		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_ino != ino)
@@ -1395,13 +1333,11 @@ int insert_inode_locked(struct inode *inode)
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
@@ -1428,7 +1364,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 		hlist_bl_lock(b);
 		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_sb != sb)
@@ -1445,13 +1380,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
@@ -1513,16 +1446,13 @@ static void iput_final(struct inode *inode)
 				return;
 			}
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
-		__remove_inode_hash(inode);
+		remove_inode_hash(inode);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
@@ -1543,7 +1473,6 @@ static void iput_final(struct inode *inode)
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb->s_inodes_lock);
 
-	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
 	wake_up_inode(inode);
@@ -1563,7 +1492,6 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		BUG_ON(inode->i_state & I_CLEAR);
 
@@ -1572,7 +1500,6 @@ void iput(struct inode *inode)
 			return;
 		}
 		spin_unlock(&inode->i_lock);
-		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
@@ -1752,8 +1679,6 @@ EXPORT_SYMBOL(inode_wait);
  * It doesn't matter if I_NEW is not set initially, a call to
  * wake_up_inode() after removing from the hash list will DTRT.
  *
- * This is called with inode_lock held.
- *
  * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
@@ -1763,10 +1688,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index d8c71ec..a67b607 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -286,7 +286,7 @@ static int logfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-/* called with inode_lock held */
+/* called with i_lock held */
 static int logfs_drop_inode(struct inode *inode)
 {
 	struct logfs_super *super = logfs_super(inode->i_sb);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 203146b..265ecba 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -22,7 +22,6 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
@@ -232,9 +231,8 @@ out:
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with inode_lock held, protecting the unmounting super block's list
- * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * We temporarily drop inode_lock, however, and CAN block.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb->s_inodes_lock protects the super block's list of inodes.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -288,13 +286,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_lock here because we hold
+		 * We can safely drop sb->s_inodes_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -306,7 +303,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 325185e..50c0085 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -91,7 +91,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/srcu.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c
index 56772b5..6f8eefe 100644
--- a/fs/notify/vfsmount_mark.c
+++ b/fs/notify/vfsmount_mark.c
@@ -23,7 +23,6 @@
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 07fdef8..9b9375a 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -54,7 +54,7 @@
  *
  * Return 1 if the attributes match and 0 if not.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep.
  */
 int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
@@ -98,7 +98,7 @@ int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
  *
  * Return 0 on success and -errno on error.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
  */
 static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index eece3e0..65c61e2 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1195,7 +1195,7 @@ void ocfs2_evict_inode(struct inode *inode)
 	ocfs2_clear_inode(inode);
 }
 
-/* Called under inode_lock, with no more references on the
+/* Called under i_lock, with no more references on the
  * struct inode, so it's safe here to check the flags field
  * and to manipulate i_nlink without any other locks. */
 int ocfs2_drop_inode(struct inode *inode)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index b02a3e1..178bed4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -76,7 +76,7 @@
 #include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
-#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -896,7 +896,6 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -914,21 +913,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the inode_lock.
+		 * removed from s_inodes list while we dropped the lock.
 		 * We cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under inode_lock. So we
+		 * reference and we cannot iput it under the lock. So we
 		 * keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1009,7 +1005,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -1025,7 +1020,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 962a606..da15124 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1593,7 +1593,7 @@ struct super_operations {
 };
 
 /*
- * Inode state bits.  Protected by inode_lock.
+ * Inode state bits.  Protected by i_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 242b6f8..fa38cf0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
-
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 40b84c6..af060d4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -73,7 +73,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -82,7 +81,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -698,14 +696,12 @@ void bdi_destroy(struct backing_dev_info *bdi)
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_lock);
 		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
 		spin_unlock(&bdi->wb.b_lock);
 		spin_unlock(&dst->b_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	bdi_unregister(bdi);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..ece6ef2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  ->inode_lock
+ *  ->i_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,8 +98,8 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(zap_pte_range->set_page_dirty)
+ *    ->i_lock			(page_remove_rmap->set_page_dirty)
+ *    ->i_lock			(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
  *  ->task->proc_lock
diff --git a/mm/rmap.c b/mm/rmap.c
index 92e6757..dbfccae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -31,11 +31,11 @@
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
- *               inode_lock (in set_page_dirty's __mark_inode_dirty)
- *                 sb_lock (within inode_lock in fs/fs-writeback.c)
+ *               i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                 sb_lock (within i_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_lock in __sync_single_inode)
+ *                           within i_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (15 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 16/19] fs: icache remove inode_lock Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-17  1:30   ` Christoph Hellwig
  2010-10-16  8:14 ` [PATCH 18/19] fs: split __inode_add_to_list Dave Chinner
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim can push many inodes into the I_FREEING state before
it actually frees them. During the time it gathers these inodes, it
can call iput(), invalidate_mapping_pages, be preempted, etc. As a
result, holding inodes in I_FREEING can cause pauses.

After the inode scalability work, there is not a big reason to batch
up inodes to reclaim them, so we can dispose them as they are found
from the LRU.

Unmount does a very similar reclaim process via invalidate_list(),
but currently uses the i_lru list to aggregate inodes for a batched
disposal. This requires taking the inode_lru_lock for every inode we
want to dispose. Instead, take the inodes off the superblock inode
list (as we already hold the lock) and use i_sb_list as the
aggregator for inodes to dispose to reduce lock traffic.

Further, iput_final() does the same inode cleanup as reclaim and
unmount, so convert them all to use a single function for destroying
inodes. This is written such that the callers can optimise list
removals to avoid unneccessary lock round trips when removing inodes
from lists.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |  109 +++++++++++++++++++++++++++++++----------------------------
 1 files changed, 57 insertions(+), 52 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5cddf45..bea1657 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -32,6 +32,8 @@
  *
  * inode->i_lock is *always* the innermost lock.
  *
+ * inode->i_lock is *always* the innermost lock.
+ *
  * inode->i_lock protects:
  *   i_ref i_state
  * inode hash lock protects:
@@ -49,8 +51,8 @@
  *
  * sb inode lock
  *   inode_lru_lock
- *     wb->b_lock
- *       inode->i_lock
+ *   wb->b_lock
+ *     inode->i_lock
  *
  * wb->b_lock
  *   sb_lock (pin sb for writeback)
@@ -460,6 +462,44 @@ static void evict(struct inode *inode)
 }
 
 /*
+ * Free the inode passed in, removing it from the lists it is still connected
+ * to but avoiding unnecessary lock round-trips for the lists it is no longer
+ * on.
+ *
+ * An inode must already be marked I_FREEING so that we avoid the inode being
+ * moved back onto lists if we race with other code that manipulates the lists
+ * (e.g. writeback_single_inode). The caller
+ */
+static void dispose_one_inode(struct inode *inode)
+{
+	BUG_ON(!(inode->i_state & I_FREEING));
+
+	/*
+	 * move the inode off the IO lists and LRU once
+	 * I_FREEING is set so that it won't get moved back on
+	 * there if it is dirty.
+	 */
+	if (!list_empty(&inode->i_wb_list))
+		inode_wb_list_del(inode);
+
+	if (!list_empty(&inode->i_lru))
+		inode_lru_list_del(inode);
+
+	if (!list_empty(&inode->i_sb_list)) {
+		spin_lock(&inode->i_sb->s_inodes_lock);
+		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode->i_sb->s_inodes_lock);
+	}
+
+	evict(inode);
+
+	remove_inode_hash(inode);
+	wake_up_inode(inode);
+	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+	destroy_inode(inode);
+}
+
+/*
  * dispose_list - dispose of the contents of a local list
  * @head: the head of the list to free
  *
@@ -471,18 +511,10 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_lru);
-		list_del_init(&inode->i_lru);
-
-		evict(inode);
-
-		remove_inode_hash(inode);
-		spin_lock(&inode->i_sb->s_inodes_lock);
+		inode = list_first_entry(head, struct inode, i_sb_list);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inodes_lock);
 
-		wake_up_inode(inode);
-		destroy_inode(inode);
+		dispose_one_inode(inode);
 	}
 }
 
@@ -523,17 +555,8 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
 
-			/*
-			 * move the inode off the IO lists and LRU once
-			 * I_FREEING is set so that it won't get moved back on
-			 * there if it is dirty.
-			 */
-			inode_wb_list_del(inode);
-
-			spin_lock(&inode_lru_lock);
-			list_move(&inode->i_lru, dispose);
-			percpu_counter_dec(&nr_inodes_unused);
-			spin_unlock(&inode_lru_lock);
+			/* save a lock round trip by removing the inode here. */
+			list_move(&inode->i_sb_list, dispose);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
@@ -552,17 +575,17 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
  */
 int invalidate_inodes(struct super_block *sb)
 {
-	int busy;
 	LIST_HEAD(throw_away);
+	int busy;
 
 	down_write(&iprune_sem);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
 	spin_unlock(&sb->s_inodes_lock);
+	up_write(&iprune_sem);
 
 	dispose_list(&throw_away);
-	up_write(&iprune_sem);
 
 	return busy;
 }
@@ -600,7 +623,6 @@ static int can_unuse(struct inode *inode)
  */
 static void prune_icache(int nr_to_scan)
 {
-	LIST_HEAD(freeable);
 	int nr_scanned;
 	unsigned long reap = 0;
 
@@ -660,15 +682,15 @@ static void prune_icache(int nr_to_scan)
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
 
-		/*
-		 * move the inode off the io lists and lru once
-		 * i_freeing is set so that it won't get moved back on
-		 * there if it is dirty.
-		 */
-		inode_wb_list_del(inode);
-
-		list_move(&inode->i_lru, &freeable);
+		/* save a lock round trip by removing the inode here. */
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
+		spin_unlock(&inode_lru_lock);
+
+		dispose_one_inode(inode);
+		cond_resched();
+
+		spin_lock(&inode_lru_lock);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -676,7 +698,6 @@ static void prune_icache(int nr_to_scan)
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
 
-	dispose_list(&freeable);
 	up_read(&iprune_sem);
 }
 
@@ -1461,23 +1482,7 @@ static void iput_final(struct inode *inode)
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
 
-	/*
-	 * After we delete the inode from the LRU and IO lists here, we avoid
-	 * moving dirty inodes back onto the LRU now because I_FREEING is set
-	 * and hence writeback_single_inode() won't move the inode around.
-	 */
-	inode_wb_list_del(inode);
-	inode_lru_list_del(inode);
-
-	spin_lock(&sb->s_inodes_lock);
-	list_del_init(&inode->i_sb_list);
-	spin_unlock(&sb->s_inodes_lock);
-
-	evict(inode);
-	remove_inode_hash(inode);
-	wake_up_inode(inode);
-	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-	destroy_inode(inode);
+	dispose_one_inode(inode);
 }
 
 /**
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 18/19] fs: split __inode_add_to_list
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (16 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  8:14 ` [PATCH 19/19] fs: do not assign default i_ino in new_inode Dave Chinner
  2010-10-16 17:55 ` Inode Lock Scalability V4 Nick Piggin
  19 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Christoph Hellwig <hch@lst.de>

__inode_add_to_list does two things that aren't related.  First it adds
the inode to the s_inodes list in the superblock, and second it optionally
adds the inode to the inode hash.  Now that these don't even share the
same lock there is no need to keeps this functionally together.  Split
out an add_to_inode_hash helper from __insert_inode_hash to add an inode
to a pre-calculated hash bucket for use by the various iget version, and
a inode_sb_list_add helper from __inode_add_to_list to just add an
inode to the per-sb list.  The inode.c-internal callers of
__inode_add_to_list are converted to a sequence of inode_sb_list_add
and __insert_inode_hash (if needed), and the only use of inode_add_to_list
in XFS is replaced with a call to inode_sb_list_add and insert_inode_hash.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c                  |   91 +++++++++++++++++++------------------------
 fs/xfs/linux-2.6/xfs_iops.c |    4 +-
 include/linux/fs.h          |    5 +-
 3 files changed, 46 insertions(+), 54 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index bea1657..3bd45fb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -384,6 +384,29 @@ void inode_lru_list_del(struct inode *inode)
 	spin_unlock(&inode_lru_lock);
 }
 
+/**
+ * inode_sb_list_add - add inode to the superblock list of inodes
+ * @inode: inode to add
+ */
+void inode_sb_list_add(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
+	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
+}
+EXPORT_SYMBOL_GPL(inode_sb_list_add);
+
+static void inode_sb_list_del(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
+	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb->s_inodes_lock);
+}
+
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
 	unsigned long tmp;
@@ -394,6 +417,13 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
 	return tmp & I_HASHMASK;
 }
 
+static void inode_hash_list_add(struct hlist_bl_head *b, struct inode *inode)
+{
+	hlist_bl_lock(b);
+	hlist_bl_add_head(&inode->i_hash, b);
+	hlist_bl_unlock(b);
+}
+
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
@@ -407,9 +437,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 	struct hlist_bl_head *b;
 
 	b = inode_hashtable + hash(inode->i_sb, hashval);
-	hlist_bl_lock(b);
-	hlist_bl_add_head(&inode->i_hash, b);
-	hlist_bl_unlock(b);
+	inode_hash_list_add(b, inode);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -468,28 +496,21 @@ static void evict(struct inode *inode)
  *
  * An inode must already be marked I_FREEING so that we avoid the inode being
  * moved back onto lists if we race with other code that manipulates the lists
- * (e.g. writeback_single_inode). The caller
+ * (e.g. writeback_single_inode). I_FREEING will prevent the other threads from
+ * manipulting the lists.
  */
 static void dispose_one_inode(struct inode *inode)
 {
 	BUG_ON(!(inode->i_state & I_FREEING));
 
-	/*
-	 * move the inode off the IO lists and LRU once
-	 * I_FREEING is set so that it won't get moved back on
-	 * there if it is dirty.
-	 */
 	if (!list_empty(&inode->i_wb_list))
 		inode_wb_list_del(inode);
 
 	if (!list_empty(&inode->i_lru))
 		inode_lru_list_del(inode);
 
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inodes_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inodes_lock);
-	}
+	if (!list_empty(&inode->i_sb_list))
+		inode_sb_list_del(inode);
 
 	evict(inode);
 
@@ -796,40 +817,6 @@ repeat:
 	return node ? inode : NULL;
 }
 
-static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_bl_head *b,
-			struct inode *inode)
-{
-	spin_lock(&sb->s_inodes_lock);
-	list_add(&inode->i_sb_list, &sb->s_inodes);
-	spin_unlock(&sb->s_inodes_lock);
-	if (b) {
-		hlist_bl_lock(b);
-		hlist_bl_add_head(&inode->i_hash, b);
-		hlist_bl_unlock(b);
-	}
-}
-
-/**
- * inode_add_to_lists - add a new inode to relevant lists
- * @sb: superblock inode belongs to
- * @inode: inode to mark in use
- *
- * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash.
- *
- * We calculate the hash list to add to here so it is all internal
- * which requires the caller to have already set up the inode number in the
- * inode to add.
- */
-void inode_add_to_lists(struct super_block *sb, struct inode *inode)
-{
-	struct hlist_bl_head *b = inode_hashtable + hash(sb, inode->i_ino);
-
-	__inode_add_to_lists(sb, b, inode);
-}
-EXPORT_SYMBOL_GPL(inode_add_to_lists);
-
 /*
  * Each cpu owns a range of LAST_INO_BATCH numbers.
  * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
@@ -891,7 +878,7 @@ struct inode *new_inode(struct super_block *sb)
 		 */
 		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
-		__inode_add_to_lists(sb, NULL, inode);
+		inode_sb_list_add(inode);
 	}
 	return inode;
 }
@@ -961,7 +948,8 @@ static struct inode *get_new_inode(struct super_block *sb,
 			 * visible to the outside world.
 			 */
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, b, inode);
+			inode_sb_list_add(inode);
+			inode_hash_list_add(b, inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1009,7 +997,8 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 			 */
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, b, inode);
+			inode_sb_list_add(inode);
+			inode_hash_list_add(b, inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b7ec465..3c7cea3 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -795,7 +795,9 @@ xfs_setup_inode(
 
 	inode->i_ino = ip->i_ino;
 	inode->i_state = I_NEW;
-	inode_add_to_lists(ip->i_mount->m_super, inode);
+
+	inode_sb_list_add(inode);
+	insert_inode_hash(inode);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	inode->i_nlink	= ip->i_d.di_nlink;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index da15124..abdb756 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2170,7 +2170,6 @@ extern loff_t vfs_llseek(struct file *file, loff_t offset, int origin);
 
 extern int inode_init_always(struct super_block *, struct inode *);
 extern void inode_init_once(struct inode *);
-extern void inode_add_to_lists(struct super_block *, struct inode *);
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
@@ -2202,9 +2201,11 @@ extern int file_remove_suid(struct file *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 extern void remove_inode_hash(struct inode *);
-static inline void insert_inode_hash(struct inode *inode) {
+static inline void insert_inode_hash(struct inode *inode)
+{
 	__insert_inode_hash(inode, inode->i_ino);
 }
+extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 19/19] fs: do not assign default i_ino in new_inode
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (17 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 18/19] fs: split __inode_add_to_list Dave Chinner
@ 2010-10-16  8:14 ` Dave Chinner
  2010-10-16  9:09   ` Eric Dumazet
  2010-10-16 17:55 ` Inode Lock Scalability V4 Nick Piggin
  19 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-16  8:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Christoph Hellwig <hch@lst.de>

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/infiniband/hw/ipath/ipath_fs.c |    1 +
 drivers/infiniband/hw/qib/qib_fs.c     |    1 +
 drivers/misc/ibmasm/ibmasmfs.c         |    1 +
 drivers/oprofile/oprofilefs.c          |    1 +
 drivers/usb/core/inode.c               |    1 +
 drivers/usb/gadget/f_fs.c              |    1 +
 drivers/usb/gadget/inode.c             |    1 +
 fs/anon_inodes.c                       |    1 +
 fs/autofs4/inode.c                     |    1 +
 fs/binfmt_misc.c                       |    1 +
 fs/configfs/inode.c                    |    1 +
 fs/debugfs/inode.c                     |    1 +
 fs/ext4/mballoc.c                      |    1 +
 fs/freevxfs/vxfs_inode.c               |    1 +
 fs/fuse/control.c                      |    1 +
 fs/hugetlbfs/inode.c                   |    1 +
 fs/inode.c                             |    3 ++-
 fs/ocfs2/dlmfs/dlmfs.c                 |    2 ++
 fs/pipe.c                              |    2 ++
 fs/proc/base.c                         |    2 ++
 fs/proc/proc_sysctl.c                  |    2 ++
 fs/ramfs/inode.c                       |    1 +
 fs/xfs/linux-2.6/xfs_buf.c             |    1 +
 include/linux/fs.h                     |    1 +
 ipc/mqueue.c                           |    1 +
 kernel/cgroup.c                        |    1 +
 mm/shmem.c                             |    1 +
 net/socket.c                           |    1 +
 net/sunrpc/rpc_pipe.c                  |    1 +
 security/inode.c                       |    1 +
 security/selinux/selinuxfs.c           |    1 +
 31 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c
index 2fca708..3d7c1df 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -57,6 +57,7 @@ static int ipathfs_mknod(struct inode *dir, struct dentry *dentry,
 		goto bail;
 	}
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	inode->i_private = data;
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index 9f989c0..0a8da2a 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -58,6 +58,7 @@ static int qibfs_mknod(struct inode *dir, struct dentry *dentry,
 		goto bail;
 	}
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_uid = 0;
 	inode->i_gid = 0;
diff --git a/drivers/misc/ibmasm/ibmasmfs.c b/drivers/misc/ibmasm/ibmasmfs.c
index 8844a3f..1ebe935 100644
--- a/drivers/misc/ibmasm/ibmasmfs.c
+++ b/drivers/misc/ibmasm/ibmasmfs.c
@@ -146,6 +146,7 @@ static struct inode *ibmasmfs_make_inode(struct super_block *sb, int mode)
 	struct inode *ret = new_inode(sb);
 
 	if (ret) {
+		ret->i_ino = get_next_ino();
 		ret->i_mode = mode;
 		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
 	}
diff --git a/drivers/oprofile/oprofilefs.c b/drivers/oprofile/oprofilefs.c
index 2766a6d..5acc58d 100644
--- a/drivers/oprofile/oprofilefs.c
+++ b/drivers/oprofile/oprofilefs.c
@@ -28,6 +28,7 @@ static struct inode *oprofilefs_get_inode(struct super_block *sb, int mode)
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	}
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index 095fa53..e2f63c0 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -276,6 +276,7 @@ static struct inode *usbfs_get_inode (struct super_block *sb, int mode, dev_t de
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
diff --git a/drivers/usb/gadget/f_fs.c b/drivers/usb/gadget/f_fs.c
index e4f5950..e093fd8 100644
--- a/drivers/usb/gadget/f_fs.c
+++ b/drivers/usb/gadget/f_fs.c
@@ -980,6 +980,7 @@ ffs_sb_make_inode(struct super_block *sb, void *data,
 	if (likely(inode)) {
 		struct timespec current_time = CURRENT_TIME;
 
+		inode->i_ino	 = usbfs_get_inode();
 		inode->i_mode    = perms->mode;
 		inode->i_uid     = perms->uid;
 		inode->i_gid     = perms->gid;
diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index fc35406..136e78d 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -1994,6 +1994,7 @@ gadgetfs_make_inode (struct super_block *sb,
 	struct inode *inode = new_inode (sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = default_uid;
 		inode->i_gid = default_gid;
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 451be78..327c484 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -189,6 +189,7 @@ static struct inode *anon_inode_mkinode(void)
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
 
+	inode->i_ino = get_next_ino();
 	inode->i_fop = &anon_inode_fops;
 
 	inode->i_mapping->a_ops = &anon_aops;
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 821b2b9..ac87e49 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -398,6 +398,7 @@ struct inode *autofs4_get_inode(struct super_block *sb,
 		inode->i_gid = sb->s_root->d_inode->i_gid;
 	}
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	inode->i_ino = get_next_ino();
 
 	if (S_ISDIR(inf->mode)) {
 		inode->i_nlink = 2;
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index fd0cc0b..37c4aef 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -495,6 +495,7 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
 	struct inode * inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime =
 			current_fs_time(inode->i_sb);
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index cf78d44..253476d 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -135,6 +135,7 @@ struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent * sd)
 {
 	struct inode * inode = new_inode(configfs_sb);
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mapping->a_ops = &configfs_aops;
 		inode->i_mapping->backing_dev_info = &configfs_backing_dev_info;
 		inode->i_op = &configfs_inode_operations;
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 30a87b3..a4ed838 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -40,6 +40,7 @@ static struct inode *debugfs_get_inode(struct super_block *sb, int mode, dev_t d
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4b4ad4b..96e2bf3 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2373,6 +2373,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
 		printk(KERN_ERR "EXT4-fs: can't get new inode\n");
 		goto err_freesgi;
 	}
+	sbi->s_buddy_cache->i_ino = get_next_ino();
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
 		desc = ext4_get_group_desc(sb, i, NULL);
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index 79d1b4e..8c04eac 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -260,6 +260,7 @@ vxfs_get_fake_inode(struct super_block *sbp, struct vxfs_inode_info *vip)
 	struct inode			*ip = NULL;
 
 	if ((ip = new_inode(sbp))) {
+		ip->i_ino = get_next_ino();
 		vxfs_iinit(ip, vip);
 		ip->i_mapping->a_ops = &vxfs_aops;
 	}
diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 3773fd6..3f67de2 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -218,6 +218,7 @@ static struct dentry *fuse_ctl_add_dentry(struct dentry *parent,
 	if (!inode)
 		return NULL;
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_uid = fc->user_id;
 	inode->i_gid = fc->group_id;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6e5bd42..b83f9ff 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -455,6 +455,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 	inode = new_inode(sb);
 	if (inode) {
 		struct hugetlbfs_inode_info *info;
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
diff --git a/fs/inode.c b/fs/inode.c
index 3bd45fb..e39c8e5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -835,7 +835,7 @@ repeat:
 #define LAST_INO_BATCH 1024
 static DEFINE_PER_CPU(unsigned int, last_ino);
 
-static unsigned int get_next_ino(void)
+unsigned int get_next_ino(void)
 {
 	unsigned int *p = &get_cpu_var(last_ino);
 	unsigned int res = *p;
@@ -853,6 +853,7 @@ static unsigned int get_next_ino(void)
 	put_cpu_var(last_ino);
 	return res;
 }
+EXPORT_SYMBOL(get_next_ino);
 
 /**
  *	new_inode 	- obtain an inode
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index c2903b8..124d400 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -400,6 +400,7 @@ static struct inode *dlmfs_get_root_inode(struct super_block *sb)
 	if (inode) {
 		ip = DLMFS_I(inode);
 
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
@@ -425,6 +426,7 @@ static struct inode *dlmfs_get_inode(struct inode *parent,
 	if (!inode)
 		return NULL;
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
diff --git a/fs/pipe.c b/fs/pipe.c
index 279eef9..acd453b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -954,6 +954,8 @@ static struct inode * get_pipe_inode(void)
 	if (!inode)
 		goto fail_inode;
 
+	inode->i_ino = get_next_ino();
+
 	pipe = alloc_pipe_info(inode);
 	if (!pipe)
 		goto fail_iput;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8e4adda..d2efd66 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1600,6 +1600,7 @@ static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_st
 
 	/* Common stuff */
 	ei = PROC_I(inode);
+	inode->i_ino = get_next_ino();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	inode->i_op = &proc_def_inode_operations;
 
@@ -2542,6 +2543,7 @@ static struct dentry *proc_base_instantiate(struct inode *dir,
 
 	/* Initialize the inode */
 	ei = PROC_I(inode);
+	inode->i_ino = get_next_ino();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 
 	/*
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 5be436e..f473a7b 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -23,6 +23,8 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	if (!inode)
 		goto out;
 
+	inode->i_ino = get_next_ino();
+
 	sysctl_head_get(head);
 	ei = PROC_I(inode);
 	ei->sysctl = head;
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a5ebae7..67fadb1 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -58,6 +58,7 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 	struct inode * inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 286e36e..a47e6db 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1572,6 +1572,7 @@ xfs_mapping_buftarg(
 			XFS_BUFTARG_NAME(btp));
 		return ENOMEM;
 	}
+	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFBLK;
 	inode->i_bdev = bdev;
 	inode->i_rdev = bdev->bd_dev;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index abdb756..213272b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2189,6 +2189,7 @@ extern struct inode * iget_locked(struct super_block *, unsigned long);
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
+extern unsigned int get_next_ino(void);
 
 extern void iref(struct inode *inode);
 extern void iget_failed(struct inode *);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index d53a2c1..a72f3c5 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -116,6 +116,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 
 	inode = new_inode(sb);
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c9483d8..e28f8e5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -778,6 +778,7 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
diff --git a/mm/shmem.c b/mm/shmem.c
index 419de2c..504ae65 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1586,6 +1586,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 
 	inode = new_inode(sb);
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
diff --git a/net/socket.c b/net/socket.c
index 715ca57..56114ec 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -480,6 +480,7 @@ static struct socket *sock_alloc(void)
 	sock = SOCKET_I(inode);
 
 	kmemcheck_annotate_bitfield(sock, type);
+	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 8c8eef2..70da9a4 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -453,6 +453,7 @@ rpc_get_inode(struct super_block *sb, umode_t mode)
 	struct inode *inode = new_inode(sb);
 	if (!inode)
 		return NULL;
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	switch(mode & S_IFMT) {
diff --git a/security/inode.c b/security/inode.c
index 8c777f0..d3321c2 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -60,6 +60,7 @@ static struct inode *get_inode(struct super_block *sb, int mode, dev_t dev)
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 79a1bb6..9e98cdc 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -785,6 +785,7 @@ static struct inode *sel_make_inode(struct super_block *sb, int mode)
 	struct inode *ret = new_inode(sb);
 
 	if (ret) {
+		ret->i_ino = get_next_ino();
 		ret->i_mode = mode;
 		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-16  8:13 ` [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
@ 2010-10-16  8:29   ` Eric Dumazet
  2010-10-16 10:04     ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2010-10-16  8:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Le samedi 16 octobre 2010 à 19:13 +1100, Dave Chinner a écrit :
> From: Dave Chinner <dchinner@redhat.com>
> 
> The number of inodes allocated does not need to be tied to the
> addition or removal of an inode to/from a list. If we are not tied
> to a list lock, we could update the counters when inodes are
> initialised or destroyed, but to do that we need to convert the
> counters to be per-cpu (i.e. independent of a lock). This means that
> we have the freedom to change the list/locking implementation
> without needing to care about the counters.
> 
> Based on a patch originally from Eric Dumazet.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

NACK

Some people believe percpu_counter object is the right answer to such
distributed counters, because the loop is done on 'online' cpus instead
of 'possible' cpus. "It must be better if number of possible cpus is
4096 and only one or two cpus are online"...

But if we do this loop only on rare events, like
"cat /proc/sys/fs/inode-nr", then the percpu_counter() is more
expensive, because percpu_add() _is_ more expensive :

- Its a function call and lot of instructions/cycles per call, while
this_cpu_inc(nr_inodes) is a single instruction, using no register on
x86.

- Its possibly accessing a shared spinlock and counter when the percpu
counter reaches the batch limit.


To recap : nr_inodes is not a counter that needs to be estimated in real
time, since we have not limit on number of inodes in the machine (limit
is the memory allocator).

Unless someone can prove "cat /proc/sys/fs/inode-nr" must be performed
thousand of times per second on their setup, the choice I made to scale
nr_inodes is better over the 'obvious percpu_counter choice'

This choice was made to scale some counters in network stack some years
ago.

Thanks

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 19/19] fs: do not assign default i_ino in new_inode
  2010-10-16  8:14 ` [PATCH 19/19] fs: do not assign default i_ino in new_inode Dave Chinner
@ 2010-10-16  9:09   ` Eric Dumazet
  2010-10-16 16:35     ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2010-10-16  9:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Le samedi 16 octobre 2010 à 19:14 +1100, Dave Chinner a écrit :
> From: Christoph Hellwig <hch@lst.de>
> 
> Instead of always assigning an increasing inode number in new_inode
> move the call to assign it into those callers that actually need it.
> For now callers that need it is estimated conservatively, that is
> the call is added to all filesystems that do not assign an i_ino
> by themselves.  For a few more filesystems we can avoid assigning
> any inode number given that they aren't user visible, and for others
> it could be done lazily when an inode number is actually needed,
> but that's left for later patches.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

I wonder if adding a flag in super_block to explicitely say :

"I dont need new_inode() allocates a i_ino for my new inode, because
I'll take care of this myself later"

would be safer, permiting each fs maintainer to assert the flag instead
of a single patch.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16  8:13 ` [PATCH 04/19] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-10-16  9:29   ` Nick Piggin
  2010-10-16 16:59     ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-16  9:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:13:58PM +1100, Dave Chinner wrote:
> @@ -502,11 +527,15 @@ static void prune_icache(int nr_to_scan)
>  			iput(inode);
>  			spin_lock(&inode_lock);
>  
> -			if (inode != list_entry(inode_unused.next,
> -						struct inode, i_list))
> -				continue;	/* wrong inode or list_empty */
> -			if (!can_unuse(inode))
> +			/*
> +			 * if we can't reclaim this inode immediately, give it
> +			 * another pass through the free list so we don't spin
> +			 * on it.
> +			 */
> +			if (!can_unuse(inode)) {
> +				list_move(&inode->i_list, &inode_unused);
>  				continue;
> +			}
>  		}
>  		list_move(&inode->i_list, &freeable);
>  		WARN_ON(inode->i_state & I_NEW);

This is a bug, actually 2 bugs, which is why I omitted it in the version
you picked up. I agree we want the optimisation though, so I've added it
back in my tree.

After you iput() and then re take the inode lock, you can't reference
the inode because you don't know what happened to it. You need to keep
that pointer check to verify it is still there.

Secondly, iput sets I_REFERENCED, so you can't leave can_unuse
unmodified otherwise then it's basically always a false negative. It
needs to check for (i_state & ~I_REFERENCED)



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/19] fs: switch bdev inode bdi's correctly
  2010-10-16  8:13 ` [PATCH 01/19] fs: switch bdev inode bdi's correctly Dave Chinner
@ 2010-10-16  9:30   ` Nick Piggin
  2010-10-16 16:31   ` Christoph Hellwig
  1 sibling, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-16  9:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:13:55PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> bdev inodes can remain dirty even after their last close. Hence the
> BDI associated with the bdev->inode gets modified duringthe last
> close to point to the default BDI. However, the bdev inode still
> needs to be moved to the dirty lists of the new BDI, otherwise it
> will corrupt the writeback list is was left on.
> 
> Add a new function bdev_inode_switch_bdi() to move all the bdi state
> from the old bdi to the new one safely. This is only a temporary
> measure until the bdev inode<->bdi lifecycle problems are sorted
> out.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Nice to see you and Christoph eliminated the need for that bdev:inode
hack in my tree. Looks good.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/19] kernel: add bl_list
  2010-10-16  8:13 ` [PATCH 02/19] kernel: add bl_list Dave Chinner
@ 2010-10-16  9:51   ` Nick Piggin
  2010-10-16 16:32     ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-16  9:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:13:56PM +1100, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> +/**
> + * hlist_bl_lock	- lock a hash list
> + * @h:	hash list head to lock
> + */
> +static inline void hlist_bl_lock(struct hlist_bl_head *h)
> +{
> +	bit_spin_lock(0, (unsigned long *)h);
> +}
> +
> +/**
> + * hlist_bl_unlock	- unlock a hash list
> + * @h:	hash list head to unlock
> + */
> +static inline void hlist_bl_unlock(struct hlist_bl_head *h)
> +{
> +	__bit_spin_unlock(0, (unsigned long *)h);
> +}

I don't know why Christoph asked for these; he's usually against
obfuscating wrapper functions.

bit_spin_lock is fine in callers so please leave it at that.
Other users may not want a spinlock, or might want to use
bit_spin_is_locked, bit_spin_trylock etc.


> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 2110a81..d367d39 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -22,6 +22,8 @@
>  #define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
>  #define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
>  
> +#define BL_LIST_POISON1  ((void *) 0x00300300 + POISON_POINTER_DELTA)
> +#define BL_LIST_POISON2  ((void *) 0x00400400 + POISON_POINTER_DELTA)
>  /********** include/linux/timer.h **********/
>  /*
>   * Magic number "tsta" to indicate a static timer initializer

I don't see the point of this, but if there is one it should be
documented in a different patch and propagated to other types of
lists too. I wouldn't worry about it though.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-16  8:29   ` Eric Dumazet
@ 2010-10-16 10:04     ` Dave Chinner
  2010-10-16 10:27       ` Eric Dumazet
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-16 10:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 10:29:36AM +0200, Eric Dumazet wrote:
> Le samedi 16 octobre 2010 à 19:13 +1100, Dave Chinner a écrit :
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The number of inodes allocated does not need to be tied to the
> > addition or removal of an inode to/from a list. If we are not tied
> > to a list lock, we could update the counters when inodes are
> > initialised or destroyed, but to do that we need to convert the
> > counters to be per-cpu (i.e. independent of a lock). This means that
> > we have the freedom to change the list/locking implementation
> > without needing to care about the counters.
> > 
> > Based on a patch originally from Eric Dumazet.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> 
> NACK
> 
> Some people believe percpu_counter object is the right answer to such
> distributed counters, because the loop is done on 'online' cpus instead
> of 'possible' cpus. "It must be better if number of possible cpus is
> 4096 and only one or two cpus are online"...
> 
> But if we do this loop only on rare events, like
> "cat /proc/sys/fs/inode-nr", then the percpu_counter() is more
> expensive, because percpu_add() _is_ more expensive :
> 
> - Its a function call and lot of instructions/cycles per call, while
> this_cpu_inc(nr_inodes) is a single instruction, using no register on
> x86.
> 
> - Its possibly accessing a shared spinlock and counter when the percpu
> counter reaches the batch limit.
> 
> 
> To recap : nr_inodes is not a counter that needs to be estimated in real
> time, since we have not limit on number of inodes in the machine (limit
> is the memory allocator).
> 
> Unless someone can prove "cat /proc/sys/fs/inode-nr" must be performed
> thousand of times per second on their setup, the choice I made to scale
> nr_inodes is better over the 'obvious percpu_counter choice'

get_nr_inodes_unused() is called on every single shrinker call. i.e.
for every 128 inodes we attempt to reclaim. Given that I'm seeing
inode reclaim rate in the order of a million per second on a 8p box,
that meets your criteria for using the generic percpu counter
infrastructure.

Also, get_nr_inodes() is also called by get_nr_dirty_inodes(), which is
called by the high level inode writeback code, so will typically be
called in the order of tens of times per second, and the number of
calls increased depending on the number of filesystems that are
active. It's still much higher frequency than your "cat
/proc/sys/fs/inode-nr" example indicates it might be called.

So I'd say that by your reasoning, the choice of using the generic
percpu counters is the right one to make.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-16 10:04     ` Dave Chinner
@ 2010-10-16 10:27       ` Eric Dumazet
  2010-10-16 17:26         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2010-10-16 10:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Le samedi 16 octobre 2010 à 21:04 +1100, Dave Chinner a écrit :

> get_nr_inodes_unused() is called on every single shrinker call. i.e.
> for every 128 inodes we attempt to reclaim. Given that I'm seeing
> inode reclaim rate in the order of a million per second on a 8p box,
> that meets your criteria for using the generic percpu counter
> infrastructure.
> 
> Also, get_nr_inodes() is also called by get_nr_dirty_inodes(), which is
> called by the high level inode writeback code, so will typically be
> called in the order of tens of times per second, and the number of
> calls increased depending on the number of filesystems that are
> active. It's still much higher frequency than your "cat
> /proc/sys/fs/inode-nr" example indicates it might be called.
> 
> So I'd say that by your reasoning, the choice of using the generic
> percpu counters is the right one to make.

You missed one thing :

In cases you mention, you want a precise count (aka
percpu_counter_sum_positive()), not the approximate one (aka
percpu_counter_read_positive()).

The only difference is then the possible/online cpu loop difference.

I am saying :

No need to maintain an approximate counter if we dont _read_ it, ever.

Andrew answer : Eric, cant you implement this generically ?



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/19] fs: switch bdev inode bdi's correctly
  2010-10-16  8:13 ` [PATCH 01/19] fs: switch bdev inode bdi's correctly Dave Chinner
  2010-10-16  9:30   ` Nick Piggin
@ 2010-10-16 16:31   ` Christoph Hellwig
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:13:55PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> bdev inodes can remain dirty even after their last close. Hence the
> BDI associated with the bdev->inode gets modified duringthe last
> close to point to the default BDI. However, the bdev inode still
> needs to be moved to the dirty lists of the new BDI, otherwise it
> will corrupt the writeback list is was left on.
> 
> Add a new function bdev_inode_switch_bdi() to move all the bdi state
> from the old bdi to the new one safely. This is only a temporary
> measure until the bdev inode<->bdi lifecycle problems are sorted
> out.

Ok for now, I really hope my progress on sorting out the bdi mess
will eventually get rid of this.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/19] kernel: add bl_list
  2010-10-16  9:51   ` Nick Piggin
@ 2010-10-16 16:32     ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 08:51:50PM +1100, Nick Piggin wrote:
> I don't know why Christoph asked for these; he's usually against
> obfuscating wrapper functions.
> 
> bit_spin_lock is fine in callers so please leave it at that.
> Other users may not want a spinlock, or might want to use
> bit_spin_is_locked, bit_spin_trylock etc.

See the comments on the previous posting of this.  It provides a useful
abtraction.  And yes, eventually we might need more wrappers, but
we can add them as needed.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 19/19] fs: do not assign default i_ino in new_inode
  2010-10-16  9:09   ` Eric Dumazet
@ 2010-10-16 16:35     ` Christoph Hellwig
  2010-10-18  9:11       ` Eric Dumazet
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 11:09:13AM +0200, Eric Dumazet wrote:
> I wonder if adding a flag in super_block to explicitely say :
> 
> "I dont need new_inode() allocates a i_ino for my new inode, because
> I'll take care of this myself later"
> 
> would be safer, permiting each fs maintainer to assert the flag instead
> of a single patch.

What's the point, really?  Assigning i_ino in new_inode always has been
an utterly stupid idea to start with.  I fixed the few filesystem that
need it to do it explicitly.  There's nothing unsafe about it - checking
callers of new_inode for manual i_ino assignment was trivial.

Conditional code like the one you suggested is simply evil - it
complicates things instead of simplifying them.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16  9:29   ` Nick Piggin
@ 2010-10-16 16:59     ` Christoph Hellwig
  2010-10-16 17:29       ` Nick Piggin
  2010-10-17  1:53       ` Dave Chinner
  0 siblings, 2 replies; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 08:29:16PM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 07:13:58PM +1100, Dave Chinner wrote:
> > @@ -502,11 +527,15 @@ static void prune_icache(int nr_to_scan)
> >  			iput(inode);
> >  			spin_lock(&inode_lock);
> >  
> > -			if (inode != list_entry(inode_unused.next,
> > -						struct inode, i_list))
> > -				continue;	/* wrong inode or list_empty */
> > -			if (!can_unuse(inode))
> > +			/*
> > +			 * if we can't reclaim this inode immediately, give it
> > +			 * another pass through the free list so we don't spin
> > +			 * on it.
> > +			 */
> > +			if (!can_unuse(inode)) {
> > +				list_move(&inode->i_list, &inode_unused);
> >  				continue;
> > +			}
> >  		}
> >  		list_move(&inode->i_list, &freeable);
> >  		WARN_ON(inode->i_state & I_NEW);
> 
> This is a bug, actually 2 bugs, which is why I omitted it in the version
> you picked up. I agree we want the optimisation though, so I've added it
> back in my tree.
> 
> After you iput() and then re take the inode lock, you can't reference
> the inode because you don't know what happened to it. You need to keep
> that pointer check to verify it is still there.

I don't think the pointer check will work either.  By the time we retake
the lru lock the inode might already have been reaped through a call
to invalidate_inodes.  There's no way we can do anything with it after
iput.  What we could do is using variant of can_unuse to decide to move
the inode to the front of the lru before doing the iput.  That way
we'll get to it next after retaking the lru lock if it's still there.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-16 10:27       ` Eric Dumazet
@ 2010-10-16 17:26         ` Peter Zijlstra
  2010-10-17  1:09           ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2010-10-16 17:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, 2010-10-16 at 12:27 +0200, Eric Dumazet wrote:
> 
> In cases you mention, you want a precise count (aka
> percpu_counter_sum_positive()), not the approximate one (aka
> percpu_counter_read_positive()).
> 
> The only difference is then the possible/online cpu loop difference.
> 
In either case, SGI is screwed, doing millions of for_each_*_cpu() loops
per second isn't going to work for them.

fwiw, for_each_*_cpu() takes longer than a single jiffy tick on those
machines.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16 16:59     ` Christoph Hellwig
@ 2010-10-16 17:29       ` Nick Piggin
  2010-10-16 17:34         ` Nick Piggin
  2010-10-17  0:47         ` Christoph Hellwig
  2010-10-17  1:53       ` Dave Chinner
  1 sibling, 2 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-16 17:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:59:30PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 08:29:16PM +1100, Nick Piggin wrote:
> > On Sat, Oct 16, 2010 at 07:13:58PM +1100, Dave Chinner wrote:
> > > @@ -502,11 +527,15 @@ static void prune_icache(int nr_to_scan)
> > >  			iput(inode);
> > >  			spin_lock(&inode_lock);
> > >  
> > > -			if (inode != list_entry(inode_unused.next,
> > > -						struct inode, i_list))
> > > -				continue;	/* wrong inode or list_empty */
> > > -			if (!can_unuse(inode))
> > > +			/*
> > > +			 * if we can't reclaim this inode immediately, give it
> > > +			 * another pass through the free list so we don't spin
> > > +			 * on it.
> > > +			 */
> > > +			if (!can_unuse(inode)) {
> > > +				list_move(&inode->i_list, &inode_unused);
> > >  				continue;
> > > +			}
> > >  		}
> > >  		list_move(&inode->i_list, &freeable);
> > >  		WARN_ON(inode->i_state & I_NEW);
> > 
> > This is a bug, actually 2 bugs, which is why I omitted it in the version
> > you picked up. I agree we want the optimisation though, so I've added it
> > back in my tree.
> > 
> > After you iput() and then re take the inode lock, you can't reference
> > the inode because you don't know what happened to it. You need to keep
> > that pointer check to verify it is still there.
> 
> I don't think the pointer check will work either.  By the time we retake
> the lru lock the inode might already have been reaped through a call
> to invalidate_inodes.  There's no way we can do anything with it after

I don't think you're right. If we re take inode_lock, ensure it is on
the LRU, and call the can_unuse checks, there is no more problem than
the regular loop taking items from the LRU, AFAIKS.

> iput.  What we could do is using variant of can_unuse to decide to move
> the inode to the front of the lru before doing the iput.  That way
> we'll get to it next after retaking the lru lock if it's still there.

This might actually be the better approach anyway (even for upstream)
-- it means we don't have to worry about the "check head element"
heuristic of the LRU check which could get false negatives if there is
a lot of concurrency on the LRU.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16 17:29       ` Nick Piggin
@ 2010-10-16 17:34         ` Nick Piggin
  2010-10-17  0:47           ` Christoph Hellwig
  2010-10-17  0:47         ` Christoph Hellwig
  1 sibling, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-16 17:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:29:24AM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 12:59:30PM -0400, Christoph Hellwig wrote:
> > I don't think the pointer check will work either.  By the time we retake
> > the lru lock the inode might already have been reaped through a call
> > to invalidate_inodes.  There's no way we can do anything with it after
> 
> I don't think you're right. If we re take inode_lock, ensure it is on
> the LRU, and call the can_unuse checks, there is no more problem than
> the regular loop taking items from the LRU, AFAIKS.
> 
> > iput.  What we could do is using variant of can_unuse to decide to move
> > the inode to the front of the lru before doing the iput.  That way
> > we'll get to it next after retaking the lru lock if it's still there.
> 
> This might actually be the better approach anyway (even for upstream)
> -- it means we don't have to worry about the "check head element"
> heuristic of the LRU check which could get false negatives if there is
> a lot of concurrency on the LRU.

Oh hmm, but then you do have the double lock of the LRU lock.

if (can_unuse_after_iput(inode)) {
  spin_lock(&inode_lock);
  list_move(inode, list tail)
  spin_unlock(&inode_lock);
}
iput(inode);
spin_lock(&inode_lock);

Is that worth it?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Inode Lock Scalability V4
  2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
                   ` (18 preceding siblings ...)
  2010-10-16  8:14 ` [PATCH 19/19] fs: do not assign default i_ino in new_inode Dave Chinner
@ 2010-10-16 17:55 ` Nick Piggin
  2010-10-17  2:47   ` Dave Chinner
  19 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-16 17:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:13:54PM +1100, Dave Chinner wrote:
> Version 2 of this series is a complete rework of the original patch
> series.  Nick's original code nested list locks inside the the
> inode->i_lock, resulting in a large mess of trylock operations to
> get locks out of order all over the place. In many cases, the reason
> fo this lock ordering is removed later on in Nick's series as
> cleanups are introduced.

I think you must misunderstand how my patchset is structured. It is not
that it was simply thrown together with cleanups piled on top of crap as
I discovered new ways to fix trylocks.

It is structured so that for the first steps, each (semi-)independent
aspect of inode_lock's concurrency protection is taken over by a new
lock. I've just made these locks dumb globals for simplicity at this
point.

Often, the required locking of these different aspects of the icache
overlap. Also inode_lock remains in place until the end. This makes lock
ordering get ugly.

But the point is that once we get past the trylocks and actually have
the locks held, it is relatively easy to demonstrate that the protection
provided is exactly what inode_lock provided.

After inode_lock is removed, I start to scale the locks properly and
introduce more complicated transforms to the code to improve the
locking. I really like how I split it up.

In your patch set you've basically pulled all these steps into the
inode_lock splitting itself. This is my first big objection.

Second I actually prefer how I cover all the icache state of an inode
with i_lock. You have avoided some coverage in the interest of reducing
lock ordering complexity, but as I have demonstrated RCU tends to work
as well for this too (and is mostly all solved in the second half of my
series).


> This patch set is just the basic inode_lock breakup patches plus a
> few more simple changes to the inode code. It stops short of
> introducing RCU inode freeing because those changes are not
> completely baked yet.

It also doesn't contain per-zone locking and lrus, or scalability of
superblock list locking.

And while the rcu-walk path walking is not fully baked, it has been
reviewed by Linus and is in pretty good shape. So I prefer to utilise
RCU locking here too, seeing as we know it will go in.

I much prefer to fix all of the problematic global locks in icache up
front and do the icache merge in a single release so we don't have
locking changes there stringing out over several releases.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16 17:29       ` Nick Piggin
  2010-10-16 17:34         ` Nick Piggin
@ 2010-10-17  0:47         ` Christoph Hellwig
  2010-10-17  2:09           ` Nick Piggin
  1 sibling, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-17  0:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:29:24AM +1100, Nick Piggin wrote:
> > I don't think the pointer check will work either.  By the time we retake
> > the lru lock the inode might already have been reaped through a call
> > to invalidate_inodes.  There's no way we can do anything with it after
> 
> I don't think you're right. If we re take inode_lock, ensure it is on
> the LRU, and call the can_unuse checks, there is no more problem than
> the regular loop taking items from the LRU, AFAIKS.

As long as we have the global inode lock it should indeed be safe.
But once we have a separate lru lock (global or per-zone, with or
without i_lock during the addition) there is nothing preventing the
inode from getting reused and re-added to the lru in the meantime.
Sure this is an extremly unlikely case, but there is no locking to
prevent it once inode_lock is gone.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16 17:34         ` Nick Piggin
@ 2010-10-17  0:47           ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-17  0:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:34:22AM +1100, Nick Piggin wrote:
> > This might actually be the better approach anyway (even for upstream)
> > -- it means we don't have to worry about the "check head element"
> > heuristic of the LRU check which could get false negatives if there is
> > a lot of concurrency on the LRU.
> 
> Oh hmm, but then you do have the double lock of the LRU lock.
> 
> if (can_unuse_after_iput(inode)) {
>   spin_lock(&inode_lock);
>   list_move(inode, list tail)
>   spin_unlock(&inode_lock);
> }
> iput(inode);
> spin_lock(&inode_lock);
> 
> Is that worth it?

Probably not.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-16 17:26         ` Peter Zijlstra
@ 2010-10-17  1:09           ` Dave Chinner
  2010-10-17  1:12             ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  1:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Eric Dumazet, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 07:26:48PM +0200, Peter Zijlstra wrote:
> On Sat, 2010-10-16 at 12:27 +0200, Eric Dumazet wrote:
> > 
> > In cases you mention, you want a precise count (aka
> > percpu_counter_sum_positive()), not the approximate one (aka
> > percpu_counter_read_positive()).
> > 
> > The only difference is then the possible/online cpu loop difference.
> > 
> In either case, SGI is screwed, doing millions of for_each_*_cpu() loops
> per second isn't going to work for them.
> 
> fwiw, for_each_*_cpu() takes longer than a single jiffy tick on those
> machines.

Yes, agreed. I'm not sure we need exact summation for these counters,
but I haven't wanted to bring inaccuracies into the code at this
point in time. I need to investigate the effect of using the
approximate summation values in all the cases they are used.

I don't see that there will be a problem doing this, but it's not
something that needs to be cared about right now. After we get to
the store-free path walks in place, we can revisit all these side
issues more fully.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-17  1:09           ` Dave Chinner
@ 2010-10-17  1:12             ` Christoph Hellwig
  2010-10-17  2:16               ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-17  1:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Peter Zijlstra, Eric Dumazet, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 12:09:44PM +1100, Dave Chinner wrote:
> > fwiw, for_each_*_cpu() takes longer than a single jiffy tick on those
> > machines.
> 
> Yes, agreed. I'm not sure we need exact summation for these counters,
> but I haven't wanted to bring inaccuracies into the code at this
> point in time. I need to investigate the effect of using the
> approximate summation values in all the cases they are used.

Use of the dirty inodes numbers in the writeback code is something
that does not make much sense.  It was added as an undocumented
workaround somewhere in the old writeback code, and spread to even
more sites over time.  I'm pretty sure we don't actually need it,
but I'm not quite sure what we actually need.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-16  8:14 ` [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
@ 2010-10-17  1:30   ` Christoph Hellwig
  2010-10-17  2:49     ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-17  1:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

>   * inode->i_lock is *always* the innermost lock.
>   *
> + * inode->i_lock is *always* the innermost lock.
> + *

No need to repeat, we got it..

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-16 16:59     ` Christoph Hellwig
  2010-10-16 17:29       ` Nick Piggin
@ 2010-10-17  1:53       ` Dave Chinner
  1 sibling, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  1:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:59:30PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 08:29:16PM +1100, Nick Piggin wrote:
> > On Sat, Oct 16, 2010 at 07:13:58PM +1100, Dave Chinner wrote:
> > > @@ -502,11 +527,15 @@ static void prune_icache(int nr_to_scan)
> > >  			iput(inode);
> > >  			spin_lock(&inode_lock);
> > >  
> > > -			if (inode != list_entry(inode_unused.next,
> > > -						struct inode, i_list))
> > > -				continue;	/* wrong inode or list_empty */
> > > -			if (!can_unuse(inode))
> > > +			/*
> > > +			 * if we can't reclaim this inode immediately, give it
> > > +			 * another pass through the free list so we don't spin
> > > +			 * on it.
> > > +			 */
> > > +			if (!can_unuse(inode)) {
> > > +				list_move(&inode->i_list, &inode_unused);
> > >  				continue;
> > > +			}
> > >  		}
> > >  		list_move(&inode->i_list, &freeable);
> > >  		WARN_ON(inode->i_state & I_NEW);
> > 
> > This is a bug, actually 2 bugs, which is why I omitted it in the version
> > you picked up. I agree we want the optimisation though, so I've added it
> > back in my tree.
> > 
> > After you iput() and then re take the inode lock, you can't reference
> > the inode because you don't know what happened to it. You need to keep
> > that pointer check to verify it is still there.
> 
> I don't think the pointer check will work either.  By the time we retake
> the lru lock the inode might already have been reaped through a call
> to invalidate_inodes.  There's no way we can do anything with it after
> iput.  What we could do is using variant of can_unuse to decide to move
> the inode to the front of the lru before doing the iput.  That way
> we'll get to it next after retaking the lru lock if it's still there.

But as Nick pointed out I_REFERENCED will be set, and so it'll move
to the back of the list next time around it we leave it on the list.
If we race with another reclaimer, it'll see a reference count, and
move it to the back of the list without clearing the I_REFERENCED
flag we set.

IOWs, there does not seem to any point to me in keeping the can_unuse()
optimisation. All it really is doing is repeating the behaviour that
will occur if we leave the inode where it is on the list and run the
loop again. If we really care that it counts as two scanned items
now, we can add a scan count back in after the iput() call....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/19] fs: Implement lazy LRU updates for inodes.
  2010-10-17  0:47         ` Christoph Hellwig
@ 2010-10-17  2:09           ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  2:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 08:47:10PM -0400, Christoph Hellwig wrote:
> On Sun, Oct 17, 2010 at 04:29:24AM +1100, Nick Piggin wrote:
> > > I don't think the pointer check will work either.  By the time we retake
> > > the lru lock the inode might already have been reaped through a call
> > > to invalidate_inodes.  There's no way we can do anything with it after
> > 
> > I don't think you're right. If we re take inode_lock, ensure it is on
> > the LRU, and call the can_unuse checks, there is no more problem than
> > the regular loop taking items from the LRU, AFAIKS.
> 
> As long as we have the global inode lock it should indeed be safe.
> But once we have a separate lru lock (global or per-zone, with or
> without i_lock during the addition) there is nothing preventing the
> inode from getting reused and re-added to the lru in the meantime.
> Sure this is an extremly unlikely case, but there is no locking to
> prevent it once inode_lock is gone.

No. There is nothing preventing that exact reuse from happening in
mainline _today_ either, because the inode_lock is dropped there too.

The point is that it is a heuristic that works (apparently) most of the
time but if it gets it wrong then it's not a big deal, it's only the LRU
position anyway. It would work exactly the same with separate global or
per-zone lru locks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-17  1:12             ` Christoph Hellwig
@ 2010-10-17  2:16               ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  2:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Eric Dumazet, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 09:12:04PM -0400, Christoph Hellwig wrote:
> On Sun, Oct 17, 2010 at 12:09:44PM +1100, Dave Chinner wrote:
> > > fwiw, for_each_*_cpu() takes longer than a single jiffy tick on those
> > > machines.
> > 
> > Yes, agreed. I'm not sure we need exact summation for these counters,
> > but I haven't wanted to bring inaccuracies into the code at this
> > point in time. I need to investigate the effect of using the
> > approximate summation values in all the cases they are used.
> 
> Use of the dirty inodes numbers in the writeback code is something
> that does not make much sense.  It was added as an undocumented
> workaround somewhere in the old writeback code, and spread to even
> more sites over time.  I'm pretty sure we don't actually need it,
> but I'm not quite sure what we actually need.

IIRC, it was needed so that we trigger writeback on purely dirty
inodes. i.e. inodes that have no dirty pages and so wbc->nr_to_write
could be changed and not have background writeback abort because
either nr_to_write was zero or wbc->nr_to_write did not change...

A hack, yes, and one that definitely needs to be revisited outside
the scope of this patch set.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Inode Lock Scalability V4
  2010-10-16 17:55 ` Inode Lock Scalability V4 Nick Piggin
@ 2010-10-17  2:47   ` Dave Chinner
  2010-10-17  2:55     ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  2:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:55:15AM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 07:13:54PM +1100, Dave Chinner wrote:
> > This patch set is just the basic inode_lock breakup patches plus a
> > few more simple changes to the inode code. It stops short of
> > introducing RCU inode freeing because those changes are not
> > completely baked yet.
> 
> It also doesn't contain per-zone locking and lrus, or scalability of
> superblock list locking.

Sure - that's all explained in the description of what the series
actually contains later on. 

> And while the rcu-walk path walking is not fully baked, it has been
> reviewed by Linus and is in pretty good shape. So I prefer to utilise
> RCU locking here too, seeing as we know it will go in.

I deliberately left out the RCU changes as we know that the version
that is in your tree causes siginificant performance regressions for
single threaded and some parallel workloads on small (<=8p)
machines.  There is more development needed there so, IMO, it has
never been a candidate for this series which is aimed directly at
.37 inclusion.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  1:30   ` Christoph Hellwig
@ 2010-10-17  2:49     ` Nick Piggin
  2010-10-17  4:13       ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  2:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> >   * inode->i_lock is *always* the innermost lock.
> >   *
> > + * inode->i_lock is *always* the innermost lock.
> > + *
> 
> No need to repeat, we got it..

Except that I didn't see where you fixed all the places where it is
*not* the innermost lock. Like for example places that take dcache_lock
inside i_lock.

Really, the assertions that my series is causing the world to end
because it makes i_lock no longer the innermost lock (not that it is
anyway) etc. is just not constructive in the slightest.

A lock is a lock. If we take another lock inside it, it is not an
innermost lock. If we do it properly and don't introduce deadlocks, it
is not a bug.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Inode Lock Scalability V4
  2010-10-17  2:47   ` Dave Chinner
@ 2010-10-17  2:55     ` Nick Piggin
  2010-10-17  2:57       ` Nick Piggin
  2010-10-17  6:10       ` Dave Chinner
  0 siblings, 2 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  2:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 01:47:59PM +1100, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 04:55:15AM +1100, Nick Piggin wrote:
> > On Sat, Oct 16, 2010 at 07:13:54PM +1100, Dave Chinner wrote:
> > > This patch set is just the basic inode_lock breakup patches plus a
> > > few more simple changes to the inode code. It stops short of
> > > introducing RCU inode freeing because those changes are not
> > > completely baked yet.
> > 
> > It also doesn't contain per-zone locking and lrus, or scalability of
> > superblock list locking.
> 
> Sure - that's all explained in the description of what the series
> actually contains later on. 
> 
> > And while the rcu-walk path walking is not fully baked, it has been
> > reviewed by Linus and is in pretty good shape. So I prefer to utilise
> > RCU locking here too, seeing as we know it will go in.
> 
> I deliberately left out the RCU changes as we know that the version
> that is in your tree causes siginificant performance regressions for
> single threaded and some parallel workloads on small (<=8p)
> machines.

The worst-case microbenchmark is not a "significant performance
regression". It is a worst case demonstration. With the parallel
workloads, are you referring to your postmark xfs workload? It was
actually due to lazy LRU, IIRC. I didn't think RCU overhead was
noticable there actually.

Anyway, I've already gone over this couple of months ago when we
were discussing it. We know it could cause some small regressions,
if they are small it is considered acceptable and outweighed
greatly by fastpath speedup. And I have a design to do slab RCU
which can be used if regressions are large. Linus signed off on
this, in fact. Why weren't you debating it then?


>  There is more development needed there so, IMO, it has
> never been a candidate for this series which is aimed directly at
> .37 inclusion.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Inode Lock Scalability V4
  2010-10-17  2:55     ` Nick Piggin
@ 2010-10-17  2:57       ` Nick Piggin
  2010-10-17  6:10       ` Dave Chinner
  1 sibling, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  2:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 01:55:33PM +1100, Nick Piggin wrote:
> On Sun, Oct 17, 2010 at 01:47:59PM +1100, Dave Chinner wrote:
> > On Sun, Oct 17, 2010 at 04:55:15AM +1100, Nick Piggin wrote:
> > > On Sat, Oct 16, 2010 at 07:13:54PM +1100, Dave Chinner wrote:
> > > > This patch set is just the basic inode_lock breakup patches plus a
> > > > few more simple changes to the inode code. It stops short of
> > > > introducing RCU inode freeing because those changes are not
> > > > completely baked yet.
> > > 
> > > It also doesn't contain per-zone locking and lrus, or scalability of
> > > superblock list locking.
> > 
> > Sure - that's all explained in the description of what the series
> > actually contains later on. 
> > 
> > > And while the rcu-walk path walking is not fully baked, it has been
> > > reviewed by Linus and is in pretty good shape. So I prefer to utilise
> > > RCU locking here too, seeing as we know it will go in.
> > 
> > I deliberately left out the RCU changes as we know that the version
> > that is in your tree causes siginificant performance regressions for
> > single threaded and some parallel workloads on small (<=8p)
> > machines.
> 
> The worst-case microbenchmark is not a "significant performance
> regression". It is a worst case demonstration. With the parallel
> workloads, are you referring to your postmark xfs workload? It was
> actually due to lazy LRU, IIRC. I didn't think RCU overhead was
> noticable there actually.
> 
> Anyway, I've already gone over this couple of months ago when we
> were discussing it. We know it could cause some small regressions,
> if they are small it is considered acceptable and outweighed
> greatly by fastpath speedup. And I have a design to do slab RCU
> which can be used if regressions are large. Linus signed off on
> this, in fact. Why weren't you debating it then?

It goes to the heart of the locking model, and I think it is silly
just to do this and then go and rewrite locking a release or two later
when rcu is introduced. And change it again when you start listening
to the people who want per-zone lrus.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  2:49     ` Nick Piggin
@ 2010-10-17  4:13       ` Dave Chinner
  2010-10-17  4:35         ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  4:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 01:49:23PM +1100, Nick Piggin wrote:
> On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> > >   * inode->i_lock is *always* the innermost lock.
> > >   *
> > > + * inode->i_lock is *always* the innermost lock.
> > > + *
> > 
> > No need to repeat, we got it..
> 
> Except that I didn't see where you fixed all the places where it is
> *not* the innermost lock. Like for example places that take dcache_lock
> inside i_lock.

I can't find any code outside of ceph where the dcache_lock is used
within 200 lines of code of the inode->i_lock. The ceph code is not
nesting them, though. And AFAICT, the i_lock is not used at all in the
dentry code.

So I must be missing something if this is occurring - can you point
out where this lock ordering is occurring in the mainline code?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  4:13       ` Dave Chinner
@ 2010-10-17  4:35         ` Nick Piggin
  2010-10-17  5:13           ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  4:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 03:13:13PM +1100, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 01:49:23PM +1100, Nick Piggin wrote:
> > On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> > > >   * inode->i_lock is *always* the innermost lock.
> > > >   *
> > > > + * inode->i_lock is *always* the innermost lock.
> > > > + *
> > > 
> > > No need to repeat, we got it..
> > 
> > Except that I didn't see where you fixed all the places where it is
> > *not* the innermost lock. Like for example places that take dcache_lock
> > inside i_lock.
> 
> I can't find any code outside of ceph where the dcache_lock is used
> within 200 lines of code of the inode->i_lock. The ceph code is not
> nesting them, though.

You mustn't have looked very hard? From ceph:

       spin_unlock(&dcache_lock);
       spin_unlock(&inode->i_lock);

(and yes, acquisition side does go in i_lock->dcache_lock order)

If you want to start mandating that i_lock *must* be an inner lock,
please establish the lock order, fix existing code that makes use
of nesting it, and provide a justification to merge the patch.

If the justification is compelling and it gets merged, I can happily
use another lock in the inode for my icache state lock. So please
don't make these off hand insinuations about how that makes your code
better and mine worse. If it is better, let's see a proper
justification and upstream patch, and the issue is really not
central to my locking design.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  4:35         ` Nick Piggin
@ 2010-10-17  5:13           ` Nick Piggin
  2010-10-17  6:52             ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  5:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 03:35:14PM +1100, Nick Piggin wrote:
> On Sun, Oct 17, 2010 at 03:13:13PM +1100, Dave Chinner wrote:
> > On Sun, Oct 17, 2010 at 01:49:23PM +1100, Nick Piggin wrote:
> > > On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> > > > >   * inode->i_lock is *always* the innermost lock.
> > > > >   *
> > > > > + * inode->i_lock is *always* the innermost lock.
> > > > > + *
> > > > 
> > > > No need to repeat, we got it..
> > > 
> > > Except that I didn't see where you fixed all the places where it is
> > > *not* the innermost lock. Like for example places that take dcache_lock
> > > inside i_lock.
> > 
> > I can't find any code outside of ceph where the dcache_lock is used
> > within 200 lines of code of the inode->i_lock. The ceph code is not
> > nesting them, though.
> 
> You mustn't have looked very hard? From ceph:
> 
>        spin_unlock(&dcache_lock);
>        spin_unlock(&inode->i_lock);
> 
> (and yes, acquisition side does go in i_lock->dcache_lock order)

A really quick grep reveals cifs is using GlobalSMBSeslock inside i_lock
too.

Everything uses i_size seqlock inside it, but I guess that's *always*
the inner innermost lock so it doesn't really count.
 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Inode Lock Scalability V4
  2010-10-17  2:55     ` Nick Piggin
  2010-10-17  2:57       ` Nick Piggin
@ 2010-10-17  6:10       ` Dave Chinner
  2010-10-17  6:34         ` Nick Piggin
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  6:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 01:55:33PM +1100, Nick Piggin wrote:
> On Sun, Oct 17, 2010 at 01:47:59PM +1100, Dave Chinner wrote:
> > On Sun, Oct 17, 2010 at 04:55:15AM +1100, Nick Piggin wrote:
> > > On Sat, Oct 16, 2010 at 07:13:54PM +1100, Dave Chinner wrote:
> > > > This patch set is just the basic inode_lock breakup patches plus a
> > > > few more simple changes to the inode code. It stops short of
> > > > introducing RCU inode freeing because those changes are not
> > > > completely baked yet.
> > > 
> > > It also doesn't contain per-zone locking and lrus, or scalability of
> > > superblock list locking.
> > 
> > Sure - that's all explained in the description of what the series
> > actually contains later on. 
> > 
> > > And while the rcu-walk path walking is not fully baked, it has been
> > > reviewed by Linus and is in pretty good shape. So I prefer to utilise
> > > RCU locking here too, seeing as we know it will go in.
> > 
> > I deliberately left out the RCU changes as we know that the version
> > that is in your tree causes siginificant performance regressions for
> > single threaded and some parallel workloads on small (<=8p)
> > machines.
> 
> The worst-case microbenchmark is not a "significant performance
> regression". It is a worst case demonstration. With the parallel
> workloads, are you referring to your postmark xfs workload? It was
> actually due to lazy LRU, IIRC.

Actually, I wasn't refering to the regressions I reported from
fs_mark runs on XFS - I was refering to your "worse case
demonstration" numbers and the comments made during the discussion
that followed.  It wasn't clear to me what the plan was to use
SLAB_DESTROY_BY_RCU or not and the commit messages didn't help,
so I left it out because I was not about to bite off more than I
could chew for .37.

As it is, the lazy LRU code doesn't appear to cause any fs_mark
performance regressions in the testing I've done of my series on
either ext4 or XFS. Hence I don't think that was the cause of any of
the performance problems I originally measured using fs_mark.

And you are right that it wasn't RCU overhead, because....

> I didn't think RCU overhead was noticable there actually.

.... I later noticed you never converted the XFS inode cache to use
RCU inode freeing. Which means that none of the RCU tree walks
are actually protected by RCU when XFS is used with your tree.
Maybe that was causing problems.

But if it's not RCU freeing (or lack thereof) or lazy LRU, it's one
of the other scalability patches that I left out of my series that
was causing the problem.

> Anyway, I've already gone over this couple of months ago when we
> were discussing it. We know it could cause some small regressions,
> if they are small it is considered acceptable and outweighed
> greatly by fastpath speedup. And I have a design to do slab RCU
> which can be used if regressions are large. Linus signed off on
> this, in fact. Why weren't you debating it then?

I try not to debate stuff I don't understand or have no information
about. That discussion is where I first learnt about the existence
of SLAB_DESTROY_BY_RCU. Clueless is not a great position to start
from in a discussion with Linus...

Anyway, that is ancient history. Now I've got patches to convert the
XFS inode cache to use RCU freeing via SLAB_DESTROY_BY_RCU thanks to
what I learnt from that discussion.  The patches don't show any
performance degradation at up to 16p in the benchmarking I've done
so far when combined with the the inode-scale series and the .37 XFS
queue. Hence I think XFS will be ready to go for RCU freed inodes in
.38 regardless of whether the VFS gets there or not.

And as a result of XFS being able to implement this functionality
independently of the VFS, I'm completely ambivialent as to how the
VFS goes about implementing RCU inode freeing. If the VFS
maintainers want to go straight to using SLAB_DESTROY_BY_RCU to
minimise the worst case overhead, then that's what I'll to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Inode Lock Scalability V4
  2010-10-17  6:10       ` Dave Chinner
@ 2010-10-17  6:34         ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  6:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 05:10:42PM +1100, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 01:55:33PM +1100, Nick Piggin wrote:
> > On Sun, Oct 17, 2010 at 01:47:59PM +1100, Dave Chinner wrote:
> > > On Sun, Oct 17, 2010 at 04:55:15AM +1100, Nick Piggin wrote:
> > > > On Sat, Oct 16, 2010 at 07:13:54PM +1100, Dave Chinner wrote:
> > > > > This patch set is just the basic inode_lock breakup patches plus a
> > > > > few more simple changes to the inode code. It stops short of
> > > > > introducing RCU inode freeing because those changes are not
> > > > > completely baked yet.
> > > > 
> > > > It also doesn't contain per-zone locking and lrus, or scalability of
> > > > superblock list locking.
> > > 
> > > Sure - that's all explained in the description of what the series
> > > actually contains later on. 
> > > 
> > > > And while the rcu-walk path walking is not fully baked, it has been
> > > > reviewed by Linus and is in pretty good shape. So I prefer to utilise
> > > > RCU locking here too, seeing as we know it will go in.
> > > 
> > > I deliberately left out the RCU changes as we know that the version
> > > that is in your tree causes siginificant performance regressions for
> > > single threaded and some parallel workloads on small (<=8p)
> > > machines.
> > 
> > The worst-case microbenchmark is not a "significant performance
> > regression". It is a worst case demonstration. With the parallel
> > workloads, are you referring to your postmark xfs workload? It was
> > actually due to lazy LRU, IIRC.
> 
> Actually, I wasn't refering to the regressions I reported from
> fs_mark runs on XFS - I was refering to your "worse case
> demonstration" numbers and the comments made during the discussion
> that followed.  It wasn't clear to me what the plan was to use

OK, so why do you keep saying RCU changes aren't agreed on? There was a
lot of discussion about this an we reached agreement.  Really, you blame
me for delaying things and keep obstructing things yourself about things
that have already been discussed and you haven't followed.

And Christoph with his endless rubbish about not doing scalability
changes because he has "ideas" about changing the global hashes, and
then telling me I'm deliberately delaying things.  Of course never
actually offering up any real substance or trying to address my points
that I reply with _every_ single time he drops this "oh let's wait and I
have some ideas about the hash" into the argument.

It's really rude.


> SLAB_DESTROY_BY_RCU or not and the commit messages didn't help,
> so I left it out because I was not about to bite off more than I
> could chew for .37.

I don't understand. It's already bitten off. It's already there.
Linus was ready to pull the whole thing in fact, but I wanted
to wait to get more people on board.

I've also raised a lot of concerns about how your series is structured,
how you want to merge it, the locking design, and how it will delay
things further and just require all the same testing from the same
people I have been asking.


> As it is, the lazy LRU code doesn't appear to cause any fs_mark
> performance regressions in the testing I've done of my series on
> either ext4 or XFS.

I think it does, with the double movement of the inodes in that
reclaim path. I wasn't able to reproduce your results exactly,
however I did see some slowdowns with that in inode intensive
workloads.


> Hence I don't think that was the cause of any of
> the performance problems I originally measured using fs_mark.
> 
> And you are right that it wasn't RCU overhead, because....

Yep.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  5:13           ` Nick Piggin
@ 2010-10-17  6:52             ` Dave Chinner
  2010-10-17  7:05               ` Nick Piggin
  2010-10-18 21:27               ` Sage Weil
  0 siblings, 2 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-17  6:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:13:10PM +1100, Nick Piggin wrote:
> On Sun, Oct 17, 2010 at 03:35:14PM +1100, Nick Piggin wrote:
> > On Sun, Oct 17, 2010 at 03:13:13PM +1100, Dave Chinner wrote:
> > > On Sun, Oct 17, 2010 at 01:49:23PM +1100, Nick Piggin wrote:
> > > > On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> > > > > >   * inode->i_lock is *always* the innermost lock.
> > > > > >   *
> > > > > > + * inode->i_lock is *always* the innermost lock.
> > > > > > + *
> > > > > 
> > > > > No need to repeat, we got it..
> > > > 
> > > > Except that I didn't see where you fixed all the places where it is
> > > > *not* the innermost lock. Like for example places that take dcache_lock
> > > > inside i_lock.
> > > 
> > > I can't find any code outside of ceph where the dcache_lock is used
> > > within 200 lines of code of the inode->i_lock. The ceph code is not
> > > nesting them, though.
> > 
> > You mustn't have looked very hard? From ceph:
> > 
> >        spin_unlock(&dcache_lock);
> >        spin_unlock(&inode->i_lock);
> > 
> > (and yes, acquisition side does go in i_lock->dcache_lock order)

Sorry, easy to miss with a quick grep when the locks are taken in
different functions.

Anyway, this one looks difficult to fix without knowing something
about Ceph and wtf it is doing there. It's one to punt to the
maintainer to solve as it's not critical to this patch set.

> A really quick grep reveals cifs is using GlobalSMBSeslock inside i_lock
> too.

I'm having a grep-fail day. Where is that one?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  6:52             ` Dave Chinner
@ 2010-10-17  7:05               ` Nick Piggin
  2010-10-17 23:39                 ` Dave Chinner
  2010-10-18 21:27               ` Sage Weil
  1 sibling, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2010-10-17  7:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 05:52:45PM +1100, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 04:13:10PM +1100, Nick Piggin wrote:
> > On Sun, Oct 17, 2010 at 03:35:14PM +1100, Nick Piggin wrote:
> > > On Sun, Oct 17, 2010 at 03:13:13PM +1100, Dave Chinner wrote:
> > > > On Sun, Oct 17, 2010 at 01:49:23PM +1100, Nick Piggin wrote:
> > > > > On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> > > > > > >   * inode->i_lock is *always* the innermost lock.
> > > > > > >   *
> > > > > > > + * inode->i_lock is *always* the innermost lock.
> > > > > > > + *
> > > > > > 
> > > > > > No need to repeat, we got it..
> > > > > 
> > > > > Except that I didn't see where you fixed all the places where it is
> > > > > *not* the innermost lock. Like for example places that take dcache_lock
> > > > > inside i_lock.
> > > > 
> > > > I can't find any code outside of ceph where the dcache_lock is used
> > > > within 200 lines of code of the inode->i_lock. The ceph code is not
> > > > nesting them, though.
> > > 
> > > You mustn't have looked very hard? From ceph:
> > > 
> > >        spin_unlock(&dcache_lock);
> > >        spin_unlock(&inode->i_lock);
> > > 
> > > (and yes, acquisition side does go in i_lock->dcache_lock order)
> 
> Sorry, easy to miss with a quick grep when the locks are taken in
> different functions.

Easy to see they're nested when they're dropped in adjacent lines. That
should give you a clue to go and check their lock order.
 
 
> Anyway, this one looks difficult to fix without knowing something
> about Ceph and wtf it is doing there. It's one to punt to the
> maintainer to solve as it's not critical to this patch set.

I thought the raison detre for your starting to write your own vfs
scale branch was because you objected to i_lock not being an "innermost"
lock (not that it was before my patch).

So I don't get it. If your patch mandates it to be an innermost lock,
then you absolutely do need to fix the filesystems before changing the
lock order.

 
> > A really quick grep reveals cifs is using GlobalSMBSeslock inside i_lock
> > too.
> 
> I'm having a grep-fail day. Where is that one?

Uh, inside one of the 6 places that i_lock is taken in cifs. The only
non-trivial one, not surprisingly.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  7:05               ` Nick Piggin
@ 2010-10-17 23:39                 ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2010-10-17 23:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 06:05:19PM +1100, Nick Piggin wrote:
> On Sun, Oct 17, 2010 at 05:52:45PM +1100, Dave Chinner wrote:
> > > A really quick grep reveals cifs is using GlobalSMBSeslock inside i_lock
> > > too.
> > 
> > I'm having a grep-fail day. Where is that one?
> 
> Uh, inside one of the 6 places that i_lock is taken in cifs. The only
> non-trivial one, not surprisingly.

It's two function calls away from the use of i_lock, in a different file.
A quick grep on a warm, sunny Sunday afternoon between walking the
dog and heading down to the pub for a beer and dinner isn't going to
find that. :/

Anyway, this one looks fairly easy to fix as the code that locks
GlobalSMBSeslock is only called from this location. I'll code up a
patch to fix it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 19/19] fs: do not assign default i_ino in new_inode
  2010-10-16 16:35     ` Christoph Hellwig
@ 2010-10-18  9:11       ` Eric Dumazet
  2010-10-18 14:48         ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2010-10-18  9:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le samedi 16 octobre 2010 à 12:35 -0400, Christoph Hellwig a écrit :

> What's the point, really?  Assigning i_ino in new_inode always has been
> an utterly stupid idea to start with.  I fixed the few filesystem that
> need it to do it explicitly.  There's nothing unsafe about it - checking
> callers of new_inode for manual i_ino assignment was trivial.
> 
> Conditional code like the one you suggested is simply evil - it
> complicates things instead of simplifying them.

This is what we call code factorization. I wont call it evil.

If we want to change get_next_ino(void) implementation to 
get_next_ino(struct inode *), or even get_next_inode(struct inode
*inode, struct super_block *sb)
 then we must go through all fs after your patch to add the new
parameter.

I proposed an implementation on get_next_ino() on 32bit arches, with no
per_cpu and shared counter, assuming we know the inode pointer. With
your patch, it become very difficult to implement such an idea.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 19/19] fs: do not assign default i_ino in new_inode
  2010-10-18  9:11       ` Eric Dumazet
@ 2010-10-18 14:48         ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2010-10-18 14:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Mon, Oct 18, 2010 at 11:11:39AM +0200, Eric Dumazet wrote:
> This is what we call code factorization. I wont call it evil.
> 
> If we want to change get_next_ino(void) implementation to 
> get_next_ino(struct inode *), or even get_next_inode(struct inode
> *inode, struct super_block *sb)
>  then we must go through all fs after your patch to add the new
> parameter.
> 
> I proposed an implementation on get_next_ino() on 32bit arches, with no
> per_cpu and shared counter, assuming we know the inode pointer. With
> your patch, it become very difficult to implement such an idea.

Why?  You now have all call sites that care and can trivially change
them to whatever argument they need.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-17  6:52             ` Dave Chinner
  2010-10-17  7:05               ` Nick Piggin
@ 2010-10-18 21:27               ` Sage Weil
  2010-10-19  3:54                 ` Nick Piggin
  1 sibling, 1 reply; 59+ messages in thread
From: Sage Weil @ 2010-10-18 21:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Christoph Hellwig, linux-fsdevel, linux-kernel

Hi guys,

I'm a bit late the party, but:

On Sun, 17 Oct 2010, Dave Chinner wrote:
> On Sun, Oct 17, 2010 at 04:13:10PM +1100, Nick Piggin wrote:
> > On Sun, Oct 17, 2010 at 03:35:14PM +1100, Nick Piggin wrote:
> > > On Sun, Oct 17, 2010 at 03:13:13PM +1100, Dave Chinner wrote:
> > > > On Sun, Oct 17, 2010 at 01:49:23PM +1100, Nick Piggin wrote:
> > > > > On Sat, Oct 16, 2010 at 09:30:47PM -0400, Christoph Hellwig wrote:
> > > > > > >   * inode->i_lock is *always* the innermost lock.
> > > > > > >   *
> > > > > > > + * inode->i_lock is *always* the innermost lock.
> > > > > > > + *
> > > > > > 
> > > > > > No need to repeat, we got it..
> > > > > 
> > > > > Except that I didn't see where you fixed all the places where it is
> > > > > *not* the innermost lock. Like for example places that take dcache_lock
> > > > > inside i_lock.
> > > > 
> > > > I can't find any code outside of ceph where the dcache_lock is used
> > > > within 200 lines of code of the inode->i_lock. The ceph code is not
> > > > nesting them, though.
> > > 
> > > You mustn't have looked very hard? From ceph:
> > > 
> > >        spin_unlock(&dcache_lock);
> > >        spin_unlock(&inode->i_lock);
> > > 
> > > (and yes, acquisition side does go in i_lock->dcache_lock order)
> 
> Sorry, easy to miss with a quick grep when the locks are taken in
> different functions.
> 
> Anyway, this one looks difficult to fix without knowing something
> about Ceph and wtf it is doing there. It's one to punt to the
> maintainer to solve as it's not critical to this patch set.

I took a look at this code and was able to remove the i_lock -> 
dcache_lock nesting.  To be honest, I'm not sure why I did it that way in 
the first place.  The only point (now) where i_lock is used/needed is 
while checking the ceph inode flags, and that's done without holding 
dcache_lock (presumably soon to be d_lock).

The patch is 622386be in the for-next branch of
	git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git

Anyway, hope this helps!  
sage

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-18 21:27               ` Sage Weil
@ 2010-10-19  3:54                 ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2010-10-19  3:54 UTC (permalink / raw)
  To: Sage Weil
  Cc: Dave Chinner, Nick Piggin, Christoph Hellwig, linux-fsdevel,
	linux-kernel

On Mon, Oct 18, 2010 at 02:27:02PM -0700, Sage Weil wrote:
> I took a look at this code and was able to remove the i_lock -> 
> dcache_lock nesting.  To be honest, I'm not sure why I did it that way in 
> the first place.  The only point (now) where i_lock is used/needed is 
> while checking the ceph inode flags, and that's done without holding 
> dcache_lock (presumably soon to be d_lock).

Great, it was a bit nasty of ceph to add that in without documenting
it, but I guess it got past reviewers so it was up to them to ack it
or point it out.

I think we should probably enforce a rule that dcache_lock not nest
inside i_lock :) Anybody wanting to do that in future had better have
a really good reason.

 
> The patch is 622386be in the for-next branch of
> 	git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
> 
> Anyway, hope this helps!  

It does, thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2010-10-19  3:54 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-16  8:13 Inode Lock Scalability V4 Dave Chinner
2010-10-16  8:13 ` [PATCH 01/19] fs: switch bdev inode bdi's correctly Dave Chinner
2010-10-16  9:30   ` Nick Piggin
2010-10-16 16:31   ` Christoph Hellwig
2010-10-16  8:13 ` [PATCH 02/19] kernel: add bl_list Dave Chinner
2010-10-16  9:51   ` Nick Piggin
2010-10-16 16:32     ` Christoph Hellwig
2010-10-16  8:13 ` [PATCH 03/19] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
2010-10-16  8:29   ` Eric Dumazet
2010-10-16 10:04     ` Dave Chinner
2010-10-16 10:27       ` Eric Dumazet
2010-10-16 17:26         ` Peter Zijlstra
2010-10-17  1:09           ` Dave Chinner
2010-10-17  1:12             ` Christoph Hellwig
2010-10-17  2:16               ` Dave Chinner
2010-10-16  8:13 ` [PATCH 04/19] fs: Implement lazy LRU updates for inodes Dave Chinner
2010-10-16  9:29   ` Nick Piggin
2010-10-16 16:59     ` Christoph Hellwig
2010-10-16 17:29       ` Nick Piggin
2010-10-16 17:34         ` Nick Piggin
2010-10-17  0:47           ` Christoph Hellwig
2010-10-17  0:47         ` Christoph Hellwig
2010-10-17  2:09           ` Nick Piggin
2010-10-17  1:53       ` Dave Chinner
2010-10-16  8:13 ` [PATCH 05/19] fs: inode split IO and LRU lists Dave Chinner
2010-10-16  8:14 ` [PATCH 06/19] fs: Clean up inode reference counting Dave Chinner
2010-10-16  8:14 ` [PATCH 07/19] exofs: use iput() for inode reference count decrements Dave Chinner
2010-10-16  8:14 ` [PATCH 08/19] fs: rework icount to be a locked variable Dave Chinner
2010-10-16  8:14 ` [PATCH 09/19] fs: Factor inode hash operations into functions Dave Chinner
2010-10-16  8:14 ` [PATCH 10/19] fs: Introduce per-bucket inode hash locks Dave Chinner
2010-10-16  8:14 ` [PATCH 11/19] fs: add a per-superblock lock for the inode list Dave Chinner
2010-10-16  8:14 ` [PATCH 12/19] fs: split locking of inode writeback and LRU lists Dave Chinner
2010-10-16  8:14 ` [PATCH 13/19] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
2010-10-16  8:14 ` [PATCH 14/19] fs: introduce a per-cpu last_ino allocator Dave Chinner
2010-10-16  8:14 ` [PATCH 15/19] fs: Make iunique independent of inode_lock Dave Chinner
2010-10-16  8:14 ` [PATCH 16/19] fs: icache remove inode_lock Dave Chinner
2010-10-16  8:14 ` [PATCH 17/19] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
2010-10-17  1:30   ` Christoph Hellwig
2010-10-17  2:49     ` Nick Piggin
2010-10-17  4:13       ` Dave Chinner
2010-10-17  4:35         ` Nick Piggin
2010-10-17  5:13           ` Nick Piggin
2010-10-17  6:52             ` Dave Chinner
2010-10-17  7:05               ` Nick Piggin
2010-10-17 23:39                 ` Dave Chinner
2010-10-18 21:27               ` Sage Weil
2010-10-19  3:54                 ` Nick Piggin
2010-10-16  8:14 ` [PATCH 18/19] fs: split __inode_add_to_list Dave Chinner
2010-10-16  8:14 ` [PATCH 19/19] fs: do not assign default i_ino in new_inode Dave Chinner
2010-10-16  9:09   ` Eric Dumazet
2010-10-16 16:35     ` Christoph Hellwig
2010-10-18  9:11       ` Eric Dumazet
2010-10-18 14:48         ` Christoph Hellwig
2010-10-16 17:55 ` Inode Lock Scalability V4 Nick Piggin
2010-10-17  2:47   ` Dave Chinner
2010-10-17  2:55     ` Nick Piggin
2010-10-17  2:57       ` Nick Piggin
2010-10-17  6:10       ` Dave Chinner
2010-10-17  6:34         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).