Inode Lock Scalability V6

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Inode Lock Scalability V6
@ 2010-10-21  0:49 Dave Chinner
  2010-10-21  0:49 ` [PATCH 01/21] fs: switch bdev inode bdi's correctly Dave Chinner
                   ` (21 more replies)
  0 siblings, 22 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This patch set is derived from Nick Piggin's VFS scalability tree.
This is an attempt to push the process of finer grained review of
the series for upstream inclusion. I'm hitting VFS lock contention
problems with XFS on 8-16p machines now, so I need to get this stuff
moving.

This patch set is just the basic inode_lock breakup patches plus a
few more simple changes to the inode code. It stops short of
introducing RCU inode freeing because those changes are not
completely baked yet.

As a result, the full inode handling improvements of Nick's patch
set are not realised with this short series. However, my own testing
indicates that the amount of lock traffic and contention is down by
an order of magnitude on an 8-way box for parallel inode create and
unlink workloads, so there is still significant improvements from
just this patch set.

Version 2 of this series was a complete rework of the original patch
series.  I've pulled in several of the cleanups and re-ordered the
series such that cleanups, factoring and list splitting are done
before any of the locking changes. Instead of converting the inode
state flags first, I've converted them last, ensuring that
manipulations are kept inside other locks rather than outside them.

The series is made up of the following steps:

	- inode counters are made per-cpu
	- inode LRU manipulations are made lazy
	- i_list is split into two lists (grows inode by 2
	  pointers), one for tracking lru status, one for writeback
	  status
	- reference counting is factored, then renamed and locked
	  differently
	- protect iunique counter with it's own lock
	- hash lookups and reference counting is cleaned up
	- inode hash operations are factored, then locked per bucket
	- superblock inode listis locked per-superblock
	- inode LRU is locked via a global lock
		- unclear what the best way to split this up from
		  here is, so no attempt is made to optimise
		  further.
		- Currently not showing signs of contention under
		  any workload on an 8p machine.
	- inode IO list are locked via a per-BDI lock
		- further analysis needed to determine the next step
		  in optimising this list. It is extremely contended
		  under parallel workloads because foreground
		  throttling (balance_dirty_pages) causes unbound
		  writeback parallelism and contention. Fixing the
		  unbound parallelism, I think, is a more important
		  first optimisation step than making the list
		  per-cpu.
	- lock i_state operations with i_lock
	- removed unnecessary i_state lock avoidance optimisations
	- convert last_ino allocation to a percpu counter
	- remove inode_lock
	- push inode number assignment out of the inode allocation code and
	  into the filesystems that require it
	- factor destroying an inode into dispose_one_inode() which
	  is called from reclaim, dispose_list and iput_final.

Version 6:
- removed reference to sb_inode_list_lock in documentation
- remove references to writeback_single_inode in comments.
- cleaned up some typos reported by Christian Stroetmann
  <stroetmann@ontolinux.com>.
- dropped unnecessary EXPORT_SYMBOL for bdi_lock_two().
- cleaned up stale remove_inode_hash comment.
- added a new patch to fix an inode hash lookup/removal race by the
  protecting wake_up_inode() with the i_lock. This also removes the
  now unnecessary memory barrier based inode_lock contention
  optimisation for clearing I_NEW in unlock_new_inode.

Version 5:
- removed buggy can_unuse() optimisation in prune_icache that the
  lazy LRU code exposes.
- Christoph found a nasty bug in the new hash locking code where the
  hash lock is dropped between the lookup and insert in
  get_new_inode[_fast](). This lookup and insert needs to be atomic,
  so it needs fixing. Thanks to Christoph for finding and fixing
  the bug.

  Detailed changes:
	- iunique rework moved forward in the series to before the
	  inode hash locking changes
	- new patch to move inode reference on successful lookup
	  back inside find_inode[_fast]()
	- moved splitting of inode_add_to_lists forward to before
	  the inode hash locking changes
	- modified the intorudction of the new inode hash list locks
	  to be taken outside find_inode[_fast]() and held until the
	  new inode is inserted into the hash. They cover the same
	  scope as the inode_lock covered. This is the bug fix.

Version 4:
- re-added inode reference count check in writeback_single_inode()
  when the inode is clean and only attempt to add the inode to the
  LRU if the inodis unreferenced.
- moved hash_bl_[un]lock into hlist_bl.h introductory patch.
- updated documentation and comments still referencing i_count
- updated documentation and comments still referencing inode_lock
- removed a couple of unneeded include files.
- writeback_single_inode() and sync_inode are now the same, so fold
  writeback_single_inode() into sync_inode.
- moved lock ordering comments around into the patches that
  introduce the locks or change the ordering.
- cleaned up dispose_one_inode comments and layout.
- added patch to start of series to move bdev inodes around bdi's
  as they change the bdi in the inode mapping during the final put
  of the bdev. Changes to this new code propagate throw the subsequent
  scalability patches.

Version 3:
- whitespace fix in inode_init_early.
- dropped patch that moves inodes around bdi lists as problem is now
  fixed in mainline.
- added comments explaining lazy inode LRU manipulations.
- added inode_lru_list_{add,del} helpers much earlier to avoid
  needing to export then unexport inode counters.
- renamed i_io to i_wb_list.
- removed iref_locked and just open code internal inode reference
  increments.
- added a WARN_ON() condition to detect iref() being called without
  a pre-existing reference count.
- added kerneldoc comment to iref().
- dropped iref_read() wrapper function patch
- killed the inode_hash_bucket wrapper, use hlist_bl_head directly
- moved spin_[un]lock_bucket wrappers to list_bl.h, and renamed them
  hlist_bl_[un]lock()
- added inode_unhashed() helper function.
- documented use of I_FREEING to ensure removal from inode lru and
  writeback lists is kept sane when the inode is being freed.
- added inode_wb_list_del() helper to avoid exporting the
  inode_to_bdi() function.
- added comments to explain why we need to set the i_state field
  before adding new inodes to various lists
- renamed last_ino_get() to get_next_ino().
- kept invalidate_list/dispose_list pairing for invalidate_inodes(),
  but changed the dispose list to use the i_sb_list pointer in the
  inode instead of the i_lru to avoid needing to take the
  inode_lru_lock for every inode on the superblock list.
- added patch from Christoph Hellwig to spilt up inode_add_to_lists.
  Modified the new function names to match the naming convention
  used by all the other list helpers in inode.c, and added a
  matching inode_sb_list_del() function for symmetry.
- added patch from Christoph Hellwig to move inode number assignment
  in get_new_inode() to the callers that don't directly assign an
  inode number.

Version 2:
- complete rework of series
--
The following changes since commit cb655d0f3d57c23db51b981648e452988c0223f9:

  Linux 2.6.36-rc7 (2010-10-06 13:39:52 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Christoph Hellwig (4):
      fs: Stop abusing find_inode_fast in iunique
      fs: move i_ref increments into find_inode/find_inode_fast
      fs: remove inode_add_to_list/__inode_add_to_list
      fs: do not assign default i_ino in new_inode

Dave Chinner (13):
      fs: switch bdev inode bdi's correctly
      fs: Convert nr_inodes and nr_unused to per-cpu counters
      fs: Clean up inode reference counting
      exofs: use iput() for inode reference count decrements
      fs: rework icount to be a locked variable
      fs: Factor inode hash operations into functions
      fs: Introduce per-bucket inode hash locks
      fs: add a per-superblock lock for the inode list
      fs: split locking of inode writeback and LRU lists
      fs: Protect inode->i_state with the inode->i_lock
      fs: protect wake_up_inode with inode->i_lock
      fs: icache remove inode_lock
      fs: Reduce inode I_FREEING and factor inode disposal

Eric Dumazet (1):
      fs: introduce a per-cpu last_ino allocator

Nick Piggin (3):
      kernel: add bl_list
      fs: Implement lazy LRU updates for inodes
      fs: inode split IO and LRU lists

 Documentation/filesystems/Locking        |    2 +-
 Documentation/filesystems/porting        |    8 +-
 Documentation/filesystems/vfs.txt        |   16 +-
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/infiniband/hw/ipath/ipath_fs.c   |    1 +
 drivers/infiniband/hw/qib/qib_fs.c       |    1 +
 drivers/misc/ibmasm/ibmasmfs.c           |    1 +
 drivers/oprofile/oprofilefs.c            |    1 +
 drivers/usb/core/inode.c                 |    1 +
 drivers/usb/gadget/f_fs.c                |    1 +
 drivers/usb/gadget/inode.c               |    1 +
 fs/9p/vfs_inode.c                        |    5 +-
 fs/affs/inode.c                          |    2 +-
 fs/afs/dir.c                             |    2 +-
 fs/anon_inodes.c                         |    8 +-
 fs/autofs4/inode.c                       |    1 +
 fs/bfs/dir.c                             |    2 +-
 fs/binfmt_misc.c                         |    1 +
 fs/block_dev.c                           |   42 +-
 fs/btrfs/inode.c                         |   18 +-
 fs/buffer.c                              |    2 +-
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/coda/dir.c                            |    2 +-
 fs/configfs/inode.c                      |    1 +
 fs/debugfs/inode.c                       |    1 +
 fs/drop_caches.c                         |   19 +-
 fs/exofs/inode.c                         |    6 +-
 fs/exofs/namei.c                         |    2 +-
 fs/ext2/namei.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext3/namei.c                          |    2 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/ext4/mballoc.c                        |    1 +
 fs/ext4/namei.c                          |    2 +-
 fs/freevxfs/vxfs_inode.c                 |    1 +
 fs/fs-writeback.c                        |  235 +++++----
 fs/fuse/control.c                        |    1 +
 fs/gfs2/ops_inode.c                      |    2 +-
 fs/hfs/hfs_fs.h                          |    2 +-
 fs/hfs/inode.c                           |    2 +-
 fs/hfsplus/dir.c                         |    2 +-
 fs/hfsplus/hfsplus_fs.h                  |    2 +-
 fs/hfsplus/inode.c                       |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/hugetlbfs/inode.c                     |    1 +
 fs/inode.c                               |  850 +++++++++++++++++++-----------
 fs/internal.h                            |   11 +
 fs/jffs2/dir.c                           |    4 +-
 fs/jfs/jfs_txnmgr.c                      |    2 +-
 fs/jfs/namei.c                           |    2 +-
 fs/libfs.c                               |    2 +-
 fs/locks.c                               |    2 +-
 fs/logfs/dir.c                           |    2 +-
 fs/logfs/inode.c                         |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/minix/namei.c                         |    2 +-
 fs/namei.c                               |    2 +-
 fs/nfs/dir.c                             |    2 +-
 fs/nfs/getroot.c                         |    2 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nfs/write.c                           |    2 +-
 fs/nilfs2/gcdat.c                        |    1 +
 fs/nilfs2/gcinode.c                      |   22 +-
 fs/nilfs2/mdt.c                          |    5 +-
 fs/nilfs2/namei.c                        |    2 +-
 fs/nilfs2/segment.c                      |    2 +-
 fs/nilfs2/the_nilfs.h                    |    2 +-
 fs/notify/inode_mark.c                   |   46 +-
 fs/notify/mark.c                         |    1 -
 fs/notify/vfsmount_mark.c                |    1 -
 fs/ntfs/inode.c                          |   10 +-
 fs/ntfs/super.c                          |    6 +-
 fs/ocfs2/dlmfs/dlmfs.c                   |    2 +
 fs/ocfs2/inode.c                         |    2 +-
 fs/ocfs2/namei.c                         |    2 +-
 fs/pipe.c                                |    2 +
 fs/proc/base.c                           |    2 +
 fs/proc/proc_sysctl.c                    |    2 +
 fs/quota/dquot.c                         |   32 +-
 fs/ramfs/inode.c                         |    1 +
 fs/reiserfs/namei.c                      |    2 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/reiserfs/xattr.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/super.c                               |    1 +
 fs/sysv/namei.c                          |    2 +-
 fs/ubifs/dir.c                           |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/inode.c                           |    2 +-
 fs/udf/namei.c                           |    2 +-
 fs/ufs/namei.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c               |    1 +
 fs/xfs/linux-2.6/xfs_iops.c              |    6 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    3 +-
 include/linux/backing-dev.h              |    3 +
 include/linux/fs.h                       |   43 +-
 include/linux/list_bl.h                  |  146 +++++
 include/linux/poison.h                   |    2 +
 include/linux/writeback.h                |    4 -
 ipc/mqueue.c                             |    3 +-
 kernel/cgroup.c                          |    1 +
 kernel/futex.c                           |    2 +-
 kernel/sysctl.c                          |    4 +-
 mm/backing-dev.c                         |   28 +-
 mm/filemap.c                             |    6 +-
 mm/rmap.c                                |    6 +-
 mm/shmem.c                               |    7 +-
 net/socket.c                             |    3 +-
 net/sunrpc/rpc_pipe.c                    |    1 +
 security/inode.c                         |    1 +
 security/selinux/selinuxfs.c             |    1 +
 114 files changed, 1128 insertions(+), 624 deletions(-)
 create mode 100644 include/linux/list_bl.h

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/21] fs: switch bdev inode bdi's correctly
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 02/21] kernel: add bl_list Dave Chinner
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

bdev inodes can remain dirty even after their last close. Hence the
BDI associated with the bdev->inode gets modified duringthe last
close to point to the default BDI. However, the bdev inode still
needs to be moved to the dirty lists of the new BDI, otherwise it
will corrupt the writeback list is was left on.

Add a new function bdev_inode_switch_bdi() to move all the bdi state
from the old bdi to the new one safely. This is only a temporary
measure until the bdev inode<->bdi lifecycle problems are sorted
out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c |   26 +++++++++++++++++++++-----
 1 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 50e8c85..501eab5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -48,6 +48,21 @@ inline struct block_device *I_BDEV(struct inode *inode)
 
 EXPORT_SYMBOL(I_BDEV);
 
+/*
+ * move the inode from it's current bdi to the a new bdi. if the inode is dirty
+ * we need to move it onto the dirty list of @dst so that the inode is always
+ * on the right list.
+ */
+static void bdev_inode_switch_bdi(struct inode *inode,
+			struct backing_dev_info *dst)
+{
+	spin_lock(&inode_lock);
+	inode->i_data.backing_dev_info = dst;
+	if (inode->i_state & I_DIRTY)
+		list_move(&inode->i_list, &dst->wb.b_dirty);
+	spin_unlock(&inode_lock);
+}
+
 static sector_t max_block(struct block_device *bdev)
 {
 	sector_t retval = ~((sector_t)0);
@@ -1390,7 +1405,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				bdi = blk_get_backing_dev_info(bdev);
 				if (bdi == NULL)
 					bdi = &default_backing_dev_info;
-				bdev->bd_inode->i_data.backing_dev_info = bdi;
+				bdev_inode_switch_bdi(bdev->bd_inode, bdi);
 			}
 			if (bdev->bd_invalidated)
 				rescan_partitions(disk, bdev);
@@ -1405,8 +1420,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			if (ret)
 				goto out_clear;
 			bdev->bd_contains = whole;
-			bdev->bd_inode->i_data.backing_dev_info =
-			   whole->bd_inode->i_data.backing_dev_info;
+			bdev_inode_switch_bdi(bdev->bd_inode,
+				whole->bd_inode->i_data.backing_dev_info);
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
 			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
@@ -1439,7 +1454,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+	bdev_inode_switch_bdi(bdev->bd_inode, &default_backing_dev_info);
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
@@ -1533,7 +1548,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
-		bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+		bdev_inode_switch_bdi(bdev->bd_inode,
+					&default_backing_dev_info);
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/21] kernel: add bl_list
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
  2010-10-21  0:49 ` [PATCH 01/21] fs: switch bdev inode bdi's correctly Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 03/21] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Introduce a type of hlist that can support the use of the lowest bit
in the hlist_head. This will be subsequently used to implement
per-bucket bit spinlock for inode hashes.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_bl.h |  145 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/poison.h  |    2 +
 2 files changed, 147 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/list_bl.h

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
new file mode 100644
index 0000000..0d791ff
--- /dev/null
+++ b/include/linux/list_bl.h
@@ -0,0 +1,145 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ *
+ * For modification operations, the 0 bit of hlist_bl_head->first
+ * pointer must be set.
+ */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define LIST_BL_LOCKMASK	1UL
+#else
+#define LIST_BL_LOCKMASK	0UL
+#endif
+
+#ifdef CONFIG_DEBUG_LIST
+#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#else
+#define LIST_BL_BUG_ON(x)
+#endif
+
+
+struct hlist_bl_head {
+	struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+	struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+	((ptr)->first = NULL)
+
+static inline void init_hlist_bl_node(struct hlist_bl_node *h)
+{
+	h->next = NULL;
+	h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr, type, member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+	return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+	return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+	hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+	struct hlist_bl_node *next = n->next;
+	struct hlist_bl_node **pprev = n->pprev;
+
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	*pprev = (struct hlist_bl_node *)
+			((unsigned long)next |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->next = BL_LIST_POISON1;
+	n->pprev = BL_LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		init_hlist_bl_node(n);
+	}
+}
+
+/**
+ * hlist_bl_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first(head);				\
+	     pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+	     pos = pos->next)
+
+#endif
+
+/**
+ * hlist_bl_lock	- lock a hash list
+ * @h:	hash list head to lock
+ */
+static inline void hlist_bl_lock(struct hlist_bl_head *h)
+{
+	bit_spin_lock(0, (unsigned long *)h);
+}
+
+/**
+ * hlist_bl_unlock	- unlock a hash list
+ * @h:	hash list head to unlock
+ */
+static inline void hlist_bl_unlock(struct hlist_bl_head *h)
+{
+	__bit_spin_unlock(0, (unsigned long *)h);
+}
diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..d367d39 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -22,6 +22,8 @@
 #define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
 #define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
 
+#define BL_LIST_POISON1  ((void *) 0x00300300 + POISON_POINTER_DELTA)
+#define BL_LIST_POISON2  ((void *) 0x00400400 + POISON_POINTER_DELTA)
 /********** include/linux/timer.h **********/
 /*
  * Magic number "tsta" to indicate a static timer initializer
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/21] fs: Convert nr_inodes and nr_unused to per-cpu counters
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
  2010-10-21  0:49 ` [PATCH 01/21] fs: switch bdev inode bdi's correctly Dave Chinner
  2010-10-21  0:49 ` [PATCH 02/21] kernel: add bl_list Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 04/21] fs: Implement lazy LRU updates for inodes Dave Chinner
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/fs-writeback.c  |    5 +--
 fs/inode.c         |   64 ++++++++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |    4 ++-
 kernel/sysctl.c    |    4 +-
 4 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ab38fef..58a95b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -723,7 +723,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			get_nr_dirty_inodes();
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1090,8 +1090,7 @@ void writeback_inodes_sb(struct super_block *sb)
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	work.nr_pages = nr_dirty + nr_unstable + get_nr_dirty_inodes();
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..b3b6a4b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -103,8 +103,41 @@ static DECLARE_RWSEM(iprune_sem);
  */
 struct inodes_stat_t inodes_stat;
 
+static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+static struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
+
 static struct kmem_cache *inode_cachep __read_mostly;
 
+static inline int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+static inline int get_nr_inodes_unused(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes_unused);
+}
+
+int get_nr_dirty_inodes(void)
+{
+	int nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
+	return nr_dirty > 0 ? nr_dirty : 0;
+
+}
+
+/*
+ * Handle nr_inode sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	inodes_stat.nr_inodes = get_nr_inodes();
+	inodes_stat.nr_unused = get_nr_inodes_unused();
+	return proc_dointvec(table, write, buffer, lenp, ppos);
+}
+#endif
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -192,6 +225,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 
+	percpu_counter_inc(&nr_inodes);
+
 	return 0;
 out:
 	return -ENOMEM;
@@ -232,6 +267,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	percpu_counter_dec(&nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
@@ -286,7 +322,7 @@ void __iget(struct inode *inode)
 
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
-	inodes_stat.nr_unused--;
+	percpu_counter_dec(&nr_inodes_unused);
 }
 
 void end_writeback(struct inode *inode)
@@ -327,8 +363,6 @@ static void evict(struct inode *inode)
  */
 static void dispose_list(struct list_head *head)
 {
-	int nr_disposed = 0;
-
 	while (!list_empty(head)) {
 		struct inode *inode;
 
@@ -344,11 +378,7 @@ static void dispose_list(struct list_head *head)
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
-		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -357,7 +387,7 @@ static void dispose_list(struct list_head *head)
 static int invalidate_list(struct list_head *head, struct list_head *dispose)
 {
 	struct list_head *next;
-	int busy = 0, count = 0;
+	int busy = 0;
 
 	next = head->next;
 	for (;;) {
@@ -383,13 +413,11 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-			count++;
+			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		busy = 1;
 	}
-	/* only unused inodes may be cached with i_count zero */
-	inodes_stat.nr_unused -= count;
 	return busy;
 }
 
@@ -448,7 +476,6 @@ static int can_unuse(struct inode *inode)
 static void prune_icache(int nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_pruned = 0;
 	int nr_scanned;
 	unsigned long reap = 0;
 
@@ -484,9 +511,8 @@ static void prune_icache(int nr_to_scan)
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
-		nr_pruned++;
+		percpu_counter_dec(&nr_inodes_unused);
 	}
-	inodes_stat.nr_unused -= nr_pruned;
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -518,7 +544,7 @@ static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 			return -1;
 		prune_icache(nr);
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+	return (get_nr_inodes_unused() / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
@@ -595,7 +621,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
 	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
@@ -1215,7 +1240,7 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
-		inodes_stat.nr_unused++;
+		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode_lock);
 			return;
@@ -1227,14 +1252,13 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
+		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
@@ -1503,6 +1527,8 @@ void __init inode_init(void)
 					 SLAB_MEM_SPREAD),
 					 init_once);
 	register_shrinker(&icache_shrinker);
+	percpu_counter_init(&nr_inodes, 0);
+	percpu_counter_init(&nr_inodes_unused, 0);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d069b..1fb92f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -407,6 +407,7 @@ extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
 extern int sysctl_nr_open;
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_dirty_inodes(void);
 extern int leases_enable, lease_break_time;
 
 struct buffer_head;
@@ -2474,7 +2475,8 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf,
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
-
+int proc_nr_inodes(struct ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..33d1733 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1340,14 +1340,14 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "inode-state",
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/21] fs: Implement lazy LRU updates for inodes
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 03/21] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  2:14   ` Christian Stroetmann
                     ` (2 more replies)
  2010-10-21  0:49 ` [PATCH 05/21] fs: inode split IO and LRU lists Dave Chinner
                   ` (17 subsequent siblings)
  21 siblings, 3 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/fs-writeback.c         |   14 +++---
 fs/inode.c                |  111 +++++++++++++++++++++++++++-----------------
 fs/internal.h             |    6 +++
 include/linux/fs.h        |   13 +++---
 include/linux/writeback.h |    1 -
 5 files changed, 88 insertions(+), 57 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 58a95b7..33e9857 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,16 +408,16 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
 			/*
-			 * The inode is clean, unused
+			 * The inode is clean. If it is unused, then make sure
+			 * that it is put on the LRU correctly as iput_final()
+			 * does not move dirty inodes to the LRU and dirty
+			 * inodes are removed from the LRU during scanning.
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			list_del_init(&inode->i_list);
+			if (!atomic_read(&inode->i_count))
+				inode_lru_list_add(inode);
 		}
 	}
 	inode_sync_complete(inode);
diff --git a/fs/inode.c b/fs/inode.c
index b3b6a4b..f47ec71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -72,7 +72,6 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 
@@ -291,6 +290,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
+	INIT_LIST_HEAD(&inode->i_list);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -317,12 +317,21 @@ static void init_once(void *foo)
  */
 void __iget(struct inode *inode)
 {
-	if (atomic_inc_return(&inode->i_count) != 1)
-		return;
+	atomic_inc(&inode->i_count);
+}
+
+void inode_lru_list_add(struct inode *inode)
+{
+	list_add(&inode->i_list, &inode_unused);
+	percpu_counter_inc(&nr_inodes_unused);
+}
 
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-		list_move(&inode->i_list, &inode_in_use);
-	percpu_counter_dec(&nr_inodes_unused);
+void inode_lru_list_del(struct inode *inode)
+{
+	if (!list_empty(&inode->i_list)) {
+		list_del_init(&inode->i_list);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
 }
 
 void end_writeback(struct inode *inode)
@@ -367,7 +376,7 @@ static void dispose_list(struct list_head *head)
 		struct inode *inode;
 
 		inode = list_first_entry(head, struct inode, i_list);
-		list_del(&inode->i_list);
+		list_del_init(&inode->i_list);
 
 		evict(inode);
 
@@ -410,9 +419,9 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			continue;
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
-			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			list_move(&inode->i_list, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
@@ -447,31 +456,21 @@ int invalidate_inodes(struct super_block *sb)
 }
 EXPORT_SYMBOL(invalidate_inodes);
 
-static int can_unuse(struct inode *inode)
-{
-	if (inode->i_state)
-		return 0;
-	if (inode_has_buffers(inode))
-		return 0;
-	if (atomic_read(&inode->i_count))
-		return 0;
-	if (inode->i_data.nrpages)
-		return 0;
-	return 1;
-}
-
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
+ * temporary list and then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
- * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
- * inode is still freeable, proceed.  The right inode is found 99.9% of the
- * time in testing on a 4-way.
+ * pagecache removed.  If the inode has metadata buffers attached to
+ * mapping->private_list then try to remove them.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has the I_REFERENCED flag set, it means that it has been used
+ * recently - the flag is set in iput_final(). When we encounter such an inode,
+ * clear the flag and move it to the back of the LRU so it gets another pass
+ * through the LRU before it gets reclaimed. This is necessary because of the
+ * fact we are doing lazy LRU updates to minimise lock contention so the LRU
+ * does not have strict ordering. Hence we don't want to reclaim inodes with
+ * this flag set because they are the inodes that are out of order.
  */
 static void prune_icache(int nr_to_scan)
 {
@@ -489,8 +488,21 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		/*
+		 * Referenced or dirty inodes are still in use. Give them
+		 * another pass through the LRU as we canot reclaim them now.
+		 */
+		if (atomic_read(&inode->i_count) ||
+		    (inode->i_state & ~I_REFERENCED)) {
+			list_del_init(&inode->i_list);
+			percpu_counter_dec(&nr_inodes_unused);
+			continue;
+		}
+
+		/* recently referenced inodes get one more pass */
+		if (inode->i_state & I_REFERENCED) {
 			list_move(&inode->i_list, &inode_unused);
+			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -500,13 +512,19 @@ static void prune_icache(int nr_to_scan)
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 
-			if (inode != list_entry(inode_unused.next,
-						struct inode, i_list))
-				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
-				continue;
+			/*
+			 * Rather than try to determine if we can still use the
+			 * inode after calling iput(), leave the inode where it
+			 * is on the LRU. If we race with another reclaimer,
+			 * that reclaimer will either see a reference count
+			 * or the I_REFERENCED flag, and move the inode to the
+			 * back of the LRU. If we don't race, then we'll see
+			 * the I_REFERENCED flag on the next pass and do the
+			 * same. Either way, we won't spin on it in this loop.
+			 */
+			spin_lock(&inode_lock);
+			continue;
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
@@ -621,7 +639,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
@@ -1238,10 +1255,12 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-			list_move(&inode->i_list, &inode_unused);
-		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
+			inode->i_state |= I_REFERENCED;
+			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+				list_del_init(&inode->i_list);
+				inode_lru_list_add(inode);
+			}
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1252,13 +1271,19 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
-	list_del_init(&inode->i_list);
-	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+
+	/*
+	 * After we delete the inode from the LRU here, we avoid moving dirty
+	 * inodes back onto the LRU now because I_FREEING is set and hence
+	 * writeback_single_inode() won't move the inode around.
+	 */
+	inode_lru_list_del(inode);
+
+	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
diff --git a/fs/internal.h b/fs/internal.h
index a6910e9..ece3565 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -101,3 +101,9 @@ extern void put_super(struct super_block *sb);
 struct nameidata;
 extern struct file *nameidata_to_filp(struct nameidata *);
 extern void release_open_intent(struct nameidata *);
+
+/*
+ * inode.c
+ */
+extern void inode_lru_list_add(struct inode *inode);
+extern void inode_lru_list_del(struct inode *inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fb92f9..af1d516 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1632,16 +1632,17 @@ struct super_operations {
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
+#define I_DIRTY_SYNC		0x01
+#define I_DIRTY_DATASYNC	0x02
+#define I_DIRTY_PAGES		0x04
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_WILL_FREE		0x10
+#define I_FREEING		0x20
+#define I_CLEAR			0x40
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
+#define I_REFERENCED		0x100
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..f956b66 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/21] fs: inode split IO and LRU lists
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (3 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 04/21] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 06/21] fs: Clean up inode reference counting Dave Chinner
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c            |    4 ++--
 fs/fs-writeback.c         |   27 ++++++++++++++-------------
 fs/inode.c                |   38 +++++++++++++++++++++-----------------
 fs/nilfs2/mdt.c           |    3 ++-
 include/linux/fs.h        |    3 ++-
 include/linux/writeback.h |    1 -
 mm/backing-dev.c          |    6 +++---
 7 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 501eab5..63b1c4c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,8 +58,8 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 {
 	spin_lock(&inode_lock);
 	inode->i_data.backing_dev_info = dst;
-	if (inode->i_state & I_DIRTY)
-		list_move(&inode->i_list, &dst->wb.b_dirty);
+	if (!list_empty(&inode->i_wb_list))
+		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode_lock);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 33e9857..92d73b6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,11 +172,11 @@ static void redirty_tail(struct inode *inode)
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_wb_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &wb->b_dirty);
+	list_move(&inode->i_wb_list, &wb->b_dirty);
 }
 
 /*
@@ -186,7 +186,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
-	list_move(&inode->i_list, &wb->b_more_io);
+	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -227,14 +227,15 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	int do_sb_sort = 0;
 
 	while (!list_empty(delaying_queue)) {
-		inode = list_entry(delaying_queue->prev, struct inode, i_list);
+		inode = list_entry(delaying_queue->prev,
+						struct inode, i_wb_list);
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_list, &tmp);
+		list_move(&inode->i_wb_list, &tmp);
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
@@ -245,12 +246,12 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		inode = list_entry(tmp.prev, struct inode, i_list);
+		inode = list_entry(tmp.prev, struct inode, i_wb_list);
 		sb = inode->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = list_entry(pos, struct inode, i_list);
+			inode = list_entry(pos, struct inode, i_wb_list);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_list, dispatch_queue);
+				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
 }
@@ -415,7 +416,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * does not move dirty inodes to the LRU and dirty
 			 * inodes are removed from the LRU during scanning.
 			 */
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_wb_list);
 			if (!atomic_read(&inode->i_count))
 				inode_lru_list_add(inode);
 		}
@@ -466,7 +467,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_wb_list);
 
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
@@ -537,7 +538,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_wb_list);
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
@@ -676,7 +677,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_list);
+						struct inode, i_wb_list);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -990,7 +991,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index f47ec71..e88d582 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -72,7 +72,7 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_lru);
 static struct hlist_head *inode_hashtable __read_mostly;
 
 /*
@@ -272,6 +272,7 @@ EXPORT_SYMBOL(__destroy_inode);
 
 void destroy_inode(struct inode *inode)
 {
+	BUG_ON(!list_empty(&inode->i_lru));
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
@@ -290,7 +291,8 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -322,14 +324,16 @@ void __iget(struct inode *inode)
 
 void inode_lru_list_add(struct inode *inode)
 {
-	list_add(&inode->i_list, &inode_unused);
-	percpu_counter_inc(&nr_inodes_unused);
+	if (list_empty(&inode->i_lru)) {
+		list_add(&inode->i_lru, &inode_lru);
+		percpu_counter_inc(&nr_inodes_unused);
+	}
 }
 
 void inode_lru_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_list)) {
-		list_del_init(&inode->i_list);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 }
@@ -375,8 +379,8 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_list);
-		list_del_init(&inode->i_list);
+		inode = list_first_entry(head, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
 
 		evict(inode);
 
@@ -421,7 +425,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (!atomic_read(&inode->i_count)) {
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-			list_move(&inode->i_list, dispose);
+			list_move(&inode->i_lru, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
@@ -483,10 +487,10 @@ static void prune_icache(int nr_to_scan)
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(inode_lru.prev, struct inode, i_lru);
 
 		/*
 		 * Referenced or dirty inodes are still in use. Give them
@@ -494,14 +498,14 @@ static void prune_icache(int nr_to_scan)
 		 */
 		if (atomic_read(&inode->i_count) ||
 		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 
 		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
@@ -526,7 +530,8 @@ static void prune_icache(int nr_to_scan)
 			spin_lock(&inode_lock);
 			continue;
 		}
-		list_move(&inode->i_list, &freeable);
+		list_move(&inode->i_lru, &freeable);
+		list_del_init(&inode->i_wb_list);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		percpu_counter_dec(&nr_inodes_unused);
@@ -1257,10 +1262,8 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
-			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-				list_del_init(&inode->i_list);
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 				inode_lru_list_add(inode);
-			}
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1273,6 +1276,7 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
 	}
+	list_del_init(&inode->i_wb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index d01aff4..62756b4 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -504,7 +504,8 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 #endif
 		inode->dirtied_when = 0;
 
-		INIT_LIST_HEAD(&inode->i_list);
+		INIT_LIST_HEAD(&inode->i_wb_list);
+		INIT_LIST_HEAD(&inode->i_lru);
 		INIT_LIST_HEAD(&inode->i_sb_list);
 		inode->i_state = 0;
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index af1d516..90d2b47 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -725,7 +725,8 @@ struct posix_acl;
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;		/* backing dev IO list */
+	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index f956b66..242b6f8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_unused;
 
 /*
  * fs/fs-writeback.c
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 65d4204..15d5097 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_list)
+	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/21] fs: Clean up inode reference counting
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (4 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 05/21] fs: inode split IO and LRU lists Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  1:41   ` Christoph Hellwig
  2010-10-21  0:49 ` [PATCH 07/21] exofs: use iput() for inode reference count decrements Dave Chinner
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Lots of filesystem code open codes the act of getting a reference to
an inode.  Factor the open coded inode lock, increment, unlock into
a function iref(). This removes most direct external references to
the inode reference count.

Originally based on a patch from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/9p/vfs_inode.c           |    5 +++--
 fs/affs/inode.c             |    2 +-
 fs/afs/dir.c                |    2 +-
 fs/anon_inodes.c            |    7 +------
 fs/bfs/dir.c                |    2 +-
 fs/block_dev.c              |   13 ++++++-------
 fs/btrfs/inode.c            |    2 +-
 fs/coda/dir.c               |    2 +-
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    2 +-
 fs/exofs/namei.c            |    2 +-
 fs/ext2/namei.c             |    2 +-
 fs/ext3/namei.c             |    2 +-
 fs/ext4/namei.c             |    2 +-
 fs/fs-writeback.c           |    7 +++----
 fs/gfs2/ops_inode.c         |    2 +-
 fs/hfsplus/dir.c            |    2 +-
 fs/inode.c                  |   34 ++++++++++++++++++++++------------
 fs/jffs2/dir.c              |    4 ++--
 fs/jfs/jfs_txnmgr.c         |    2 +-
 fs/jfs/namei.c              |    2 +-
 fs/libfs.c                  |    2 +-
 fs/logfs/dir.c              |    2 +-
 fs/minix/namei.c            |    2 +-
 fs/namei.c                  |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/nfs/getroot.c            |    2 +-
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    2 +-
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 ++--
 fs/ocfs2/namei.c            |    2 +-
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    2 +-
 fs/sysv/namei.c             |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/udf/namei.c              |    2 +-
 fs/ufs/namei.c              |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c |    2 +-
 fs/xfs/xfs_inode.h          |    2 +-
 include/linux/fs.h          |    2 +-
 ipc/mqueue.c                |    2 +-
 kernel/futex.c              |    2 +-
 mm/shmem.c                  |    2 +-
 net/socket.c                |    2 +-
 45 files changed, 80 insertions(+), 76 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 9e670d5..1f76624 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1789,9 +1789,10 @@ v9fs_vfs_link_dotl(struct dentry *old_dentry, struct inode *dir,
 		kfree(st);
 	} else {
 		/* Caching disabled. No need to get upto date stat info.
-		 * This dentry will be released immediately. So, just i_count++
+		 * This dentry will be released immediately. So, just take
+		 * a reference.
 		 */
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 
 	dentry->d_op = old_dentry->d_op;
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 3a0fdec..2100852 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -388,7 +388,7 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 0d38c09..87d8c03 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -1045,7 +1045,7 @@ static int afs_link(struct dentry *from, struct inode *dir,
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	iref(&vnode->vfs_inode);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index e4b75d6..451be78 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -109,12 +109,7 @@ struct file *anon_inode_getfile(const char *name,
 		goto err_module;
 
 	path.mnt = mntget(anon_inode_mnt);
-	/*
-	 * We know the anon_inode inode count is always greater than zero,
-	 * so we can avoid doing an igrab() and we can use an open-coded
-	 * atomic_inc().
-	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	iref(anon_inode_inode);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index d967e05..6e93a37 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -176,7 +176,7 @@ static int bfs_link(struct dentry *old, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 63b1c4c..11ad53d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -565,7 +565,7 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	atomic_inc(&bdev->bd_inode->i_count);
+	iref(bdev->bd_inode);
 	return bdev;
 }
 
@@ -595,7 +595,7 @@ static struct block_device *bd_acquire(struct inode *inode)
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		bdgrab(bdev);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -606,12 +606,11 @@ static struct block_device *bd_acquire(struct inode *inode)
 		spin_lock(&bdev_lock);
 		if (!inode->i_bdev) {
 			/*
-			 * We take an additional bd_inode->i_count for inode,
-			 * and it's released in clear_inode() of inode.
-			 * So, we can access it via ->i_mapping always
-			 * without igrab().
+			 * We take an additional bdev reference here so
+			 * we can access it via ->i_mapping always
+			 * without first needing to grab a reference.
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			bdgrab(bdev);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..80e28bf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4758,7 +4758,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ccd98b0..ac8b913 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -303,7 +303,7 @@ static int coda_link(struct dentry *source_de, struct inode *dir_inode,
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 2195c21..c2721fa 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -22,7 +22,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index eb7368e..b631ff3 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1154,7 +1154,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	ios->done = create_done;
 	ios->private = inode;
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index b7dd0c2..f2a30a0 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -153,7 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return exofs_add_nondir(dentry, inode);
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..b15435f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -206,7 +206,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 2b35ddb..6c7a5d6 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2260,7 +2260,7 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 314c0d3..a406a85 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2312,7 +2312,7 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 92d73b6..595dfc6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -298,8 +298,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 
 /*
  * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ * caller has a reference on the inode or the inode has I_WILL_FREE set.
  *
  * If `wait' is set, wait on the writeout.
  *
@@ -500,7 +499,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1046,7 +1045,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 1009be2..508407d 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -253,7 +253,7 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 764fd1b..e2ce54d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -301,7 +301,7 @@ static int hfsplus_link(struct dentry *src_dentry, struct inode *dst_dir,
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
diff --git a/fs/inode.c b/fs/inode.c
index e88d582..c53d1b3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -314,13 +314,23 @@ static void init_once(void *foo)
 	inode_init_once(inode);
 }
 
-/*
- * inode_lock must be held
+/**
+ * iref - increment the reference count on an inode
+ * @inode:	inode to take a reference on
+ *
+ * iref() should be called to take an extra reference to an inode. The inode
+ * must already have a reference count obtained via igrab() as iref() does not
+ * do checks for the inode being freed and hence cannot be used to initially
+ * obtain a reference to the inode.
  */
-void __iget(struct inode *inode)
+void iref(struct inode *inode)
 {
+	WARN_ON(atomic_read(&inode->i_count) < 1);
+	spin_lock(&inode_lock);
 	atomic_inc(&inode->i_count);
+	spin_unlock(&inode_lock);
 }
+EXPORT_SYMBOL_GPL(iref);
 
 void inode_lru_list_add(struct inode *inode)
 {
@@ -510,7 +520,7 @@ static void prune_icache(int nr_to_scan)
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
+			atomic_inc(&inode->i_count);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -578,7 +588,7 @@ static struct shrinker icache_shrinker = {
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call __iget()
+ * NOTE: we are not increasing the inode-refcount, you must take a reference
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
@@ -782,7 +792,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -829,7 +839,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -882,7 +892,7 @@ struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 	else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
@@ -923,7 +933,7 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -956,7 +966,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1139,7 +1149,7 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1178,7 +1188,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		atomic_inc(&old->i_count);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index ed78a3c..797a034 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -289,7 +289,7 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 	return ret;
 }
@@ -864,7 +864,7 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index d945ea7..3e6dd08 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,7 @@ int txCommit(tid_t tid,		/* transaction identifier */
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		iref(tblk->u.ip);
 		/*
 		 * Avoid a rare deadlock
 		 *
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index a9cf8e8..3d3566e 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -839,7 +839,7 @@ static int jfs_link(struct dentry *old_dentry,
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	iref(ip);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0a9da95..f190d73 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -255,7 +255,7 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 9777eb5..8522edc 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -569,7 +569,7 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index f3f3578..7563a82 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -101,7 +101,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	return add_nondir(dentry, inode);
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 24896e8..5fb93f3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2291,7 +2291,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 			goto slashes;
 		inode = dentry->d_inode;
 		if (inode)
-			atomic_inc(&inode->i_count);
+			iref(inode);
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e257172..5482ede 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1580,7 +1580,7 @@ nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_add(dentry, inode);
 	}
 	return error;
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index a70e446..5aaa2be 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -55,7 +55,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 874972d..d1c2f08 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		igrab(inode);
+		iref(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index ad6ed2c..fbd3348 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -219,7 +219,7 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 33297c0..fa7f3b8 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -244,7 +244,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		struct inode *need_iput_tmp;
 
 		/*
-		 * We cannot __iget() an inode in state I_FREEING,
+		 * We cannot iref() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
@@ -253,7 +253,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * doing an iref/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
@@ -265,7 +265,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			__iget(inode);
+			atomic_inc(&inode->i_count);
 		else
 			need_iput_tmp = NULL;
 
@@ -273,7 +273,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		if ((&next_i->i_sb_list != list) &&
 		    atomic_read(&next_i->i_count) &&
 		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			__iget(next_i);
+			atomic_inc(&next_i->i_count);
 			need_iput = next_i;
 		}
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 5128061..52b48e3 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2929,8 +2929,8 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 		goto unl_upcase_iput_tmp_ino_err_out_now;
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
-		/* We increment i_count simulating an ntfs_iget(). */
-		atomic_inc(&vol->root_ino->i_count);
+		/* Simulate an ntfs_iget() call */
+		iref(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index a00dda2..0e002f6 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -741,7 +741,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index aad1316..38d4304 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		if (!dqinit_needed(inode, type))
 			continue;
 
-		__iget(inode);
+		atomic_inc(&inode->i_count);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index ee78d4a..f19bb3d 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1156,7 +1156,7 @@ static int reiserfs_link(struct dentry *old_dentry, struct inode *dir,
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 33e047b..765974f 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -126,7 +126,7 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return add_nondir(dentry, inode);
 }
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 87ebcce..9e8281f 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -550,7 +550,7 @@ static int ubifs_link(struct dentry *old_dentry, struct inode *dir,
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..f6e232a 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1101,7 +1101,7 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index b056f02..2a598eb 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -180,7 +180,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b1fc2a6..b7ec465 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -352,7 +352,7 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0898c54..cbb4791 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -482,7 +482,7 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 #define IHOLD(ip) \
 do { \
 	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 90d2b47..6eb94b0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2184,7 +2184,7 @@ extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struc
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
-extern void __iget(struct inode * inode);
+extern void iref(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c60e519..d53a2c1 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -769,7 +769,7 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 
 	inode = dentry->d_inode;
 	if (inode)
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
diff --git a/kernel/futex.c b/kernel/futex.c
index 6a3a5fa..3bb418c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -168,7 +168,7 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		iref(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..7d0bc16 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,7 +1903,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	iref(inode);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
diff --git a/net/socket.c b/net/socket.c
index 2270b94..715ca57 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,7 +377,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		atomic_inc(&path.dentry->d_inode->i_count);
+		iref(path.dentry->d_inode);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/21] exofs: use iput() for inode reference count decrements
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (5 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 06/21] fs: Clean up inode reference counting Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 08/21] fs: rework icount to be a locked variable Dave Chinner
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Direct modification of the inode reference count is a no-no. Convert
the exofs decrements to call iput() instead of acting directly on
i_count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/exofs/inode.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b631ff3..0fb4d4c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1101,7 +1101,7 @@ static void create_done(struct exofs_io_state *ios, void *p)
 
 	set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	iput(inode);
 	wake_up(&oi->i_wq);
 }
 
@@ -1161,7 +1161,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	ios->cred = oi->i_cred;
 	ret = exofs_sbi_create(ios);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		iput(inode);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/21] fs: rework icount to be a locked variable
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (6 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 07/21] exofs: use iput() for inode reference count decrements Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21 19:40   ` Al Viro
  2010-10-21  0:49 ` [PATCH 09/21] fs: Factor inode hash operations into functions Dave Chinner
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

The inode reference count is currently an atomic variable so that it
can be sampled/modified outside the inode_lock. However, the
inode_lock is still needed to synchronise the final reference count
and checks against the inode state.

To avoid needing the protection of the inode lock, protect the inode
reference count with the per-inode i_lock and convert it to a normal
variable. To avoid existing out-of-tree code accidentally compiling
against the new method, rename the i_count field to i_ref. This is
relatively straight forward as there are limited external references
to the i_count field remaining.

Based on work originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/vfs.txt        |   14 +++---
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 fs/btrfs/inode.c                         |   14 ++++--
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/drop_caches.c                         |    4 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/fs-writeback.c                        |   12 +++--
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   79 ++++++++++++++++++++++-------
 fs/locks.c                               |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nilfs2/mdt.c                          |    2 +-
 fs/notify/inode_mark.c                   |   25 ++++++---
 fs/ntfs/inode.c                          |    6 +-
 fs/ntfs/super.c                          |    2 +-
 fs/quota/dquot.c                         |    4 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/inode.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    1 -
 include/linux/fs.h                       |    4 +-
 27 files changed, 132 insertions(+), 71 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..0dbbbe4 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -347,8 +347,8 @@ otherwise noted.
   lookup: called when the VFS needs to look up an inode in a parent
 	directory. The name to look for is found in the dentry. This
 	method must call d_add() to insert the found inode into the
-	dentry. The "i_count" field in the inode structure should be
-	incremented. If the named inode does not exist a NULL inode
+	dentry. A reference to the inode should be taken via the
+	iref() function.  If the named inode does not exist a NULL inode
 	should be inserted into the dentry (this is called a negative
 	dentry). Returning an error code from this routine must only
 	be done on a real error, otherwise creating inodes with system
@@ -926,11 +926,11 @@ manipulate dentries:
 	d_instantiate()
 
   d_instantiate: add a dentry to the alias hash list for the inode and
-	updates the "d_inode" member. The "i_count" member in the
-	inode structure should be set/incremented. If the inode
-	pointer is NULL, the dentry is called a "negative
-	dentry". This function is commonly called when an inode is
-	created for an existing negative dentry
+	updates the "d_inode" member. A reference to the inode
+	should be taken via the iref() function.  If the inode
+	pointer is NULL, the dentry is called a "negative dentry".
+	This function is commonly called when an inode is created
+	for an existing negative dentry
 
   d_lookup: look up a dentry given its parent and path name component
 	It looks up the child of that given name from the dcache
diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 1a40da9..03d8ed3 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1549,7 +1549,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (inode->i_ref != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80e28bf..7947bf0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1964,8 +1964,14 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct delayed_iput *delayed;
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	/* XXX: filesystems should not play refcount games like this */
+	spin_lock(&inode->i_lock);
+	if (inode->i_ref > 1) {
+		inode->i_ref--;
+		spin_unlock(&inode->i_lock);
 		return;
+	}
+	spin_unlock(&inode->i_lock);
 
 	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 	delayed->inode = inode;
@@ -2718,10 +2724,10 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+	if (S_ISDIR(inode->i_mode) && inode->i_ref > 1)
 		return ERR_PTR(-ENOSPC);
 
-	if (atomic_read(&inode->i_count) > 2)
+	if (inode->i_ref > 2)
 		return ERR_PTR(-ENOSPC);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3939,7 +3945,7 @@ again:
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (inode->i_ref > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index fad95f8..1217580 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1102,7 +1102,7 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, inode->i_ref);
 		return 0;
 	}
 
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 53cce8c..f13f2d0 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1641,7 +1641,7 @@ int cifs_revalidate_dentry(struct dentry *dentry)
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count.counter,
+		 "jiffies %ld", full_path, inode, inode->i_ref,
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c2721fa..10c8c5a 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -22,7 +22,9 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4ab72db..fb20ac7 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_ref > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					inode->i_ref);
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..56d0bb0 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -189,9 +189,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_ref > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       inode->i_ref);
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 595dfc6..9832beb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -315,7 +315,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
+	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -416,7 +416,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * inodes are removed from the LRU during scanning.
 			 */
 			list_del_init(&inode->i_wb_list);
-			if (!atomic_read(&inode->i_count))
+			if (!inode->i_ref)
 				inode_lru_list_add(inode);
 		}
 	}
@@ -499,7 +499,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1045,7 +1047,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 56f0da1..67147bf 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !i->i_ref) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
diff --git a/fs/inode.c b/fs/inode.c
index c53d1b3..77b71ce 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -26,6 +26,15 @@
 #include <linux/posix_acl.h>
 
 /*
+ * Locking rules.
+ *
+ * inode->i_lock is *always* the innermost lock.
+ *
+ * inode->i_lock protects:
+ *   i_ref
+ */
+
+/*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
@@ -64,9 +73,9 @@ static unsigned int i_hash_shift __read_mostly;
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
- *  "in_use" - valid inode, i_count > 0, i_nlink > 0
+ *  "in_use" - valid inode, i_ref > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
- *  "unused" - valid inode, i_count = 0
+ *  "unused" - valid inode, i_ref = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
@@ -164,7 +173,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_ref = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -325,9 +334,11 @@ static void init_once(void *foo)
  */
 void iref(struct inode *inode)
 {
-	WARN_ON(atomic_read(&inode->i_count) < 1);
+	WARN_ON(inode->i_ref < 1);
 	spin_lock(&inode_lock);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_ref++;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
@@ -432,13 +443,16 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (inode->i_state & I_NEW)
 			continue;
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			list_move(&inode->i_lru, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	return busy;
@@ -506,8 +520,9 @@ static void prune_icache(int nr_to_scan)
 		 * Referenced or dirty inodes are still in use. Give them
 		 * another pass through the LRU as we canot reclaim them now.
 		 */
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
+			spin_unlock(&inode->i_lock);
 			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
@@ -515,12 +530,14 @@ static void prune_icache(int nr_to_scan)
 
 		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
+			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			atomic_inc(&inode->i_count);
+			inode->i_ref++;
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -540,6 +557,7 @@ static void prune_icache(int nr_to_scan)
 			spin_lock(&inode_lock);
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		list_move(&inode->i_lru, &freeable);
 		list_del_init(&inode->i_wb_list);
 		WARN_ON(inode->i_state & I_NEW);
@@ -792,7 +810,9 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -839,7 +859,9 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -891,15 +913,19 @@ EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		atomic_inc(&inode->i_count);
-	else
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+	} else {
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
 		inode = NULL;
+	}
 	spin_unlock(&inode_lock);
 	return inode;
 }
@@ -933,7 +959,9 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -966,7 +994,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1149,7 +1179,9 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1188,7 +1220,9 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		atomic_inc(&old->i_count);
+		spin_lock(&old->i_lock);
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1322,8 +1356,15 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
+		if (--inode->i_ref == 0) {
+			spin_unlock(&inode->i_lock);
 			iput_final(inode);
+			return;
+		}
+		spin_unlock(&inode->i_lock);
+		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
diff --git a/fs/locks.c b/fs/locks.c
index ab24d49..4dec81a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1376,7 +1376,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 			goto out;
 		if ((arg == F_WRLCK)
 		    && ((atomic_read(&dentry->d_count) > 1)
-			|| (atomic_read(&inode->i_count) > 1)))
+			|| inode->i_ref > 1))
 			goto out;
 	}
 
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 6127baf..1b26a8d 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -1002,7 +1002,7 @@ static int __logfs_is_valid_block(struct inode *inode, u64 bix, u64 ofs)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+	if ((inode->i_nlink == 0) && inode->i_ref == 1)
 		return 0;
 
 	if (bix < I0_BLOCKS)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..32a9c69 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -384,7 +384,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		inode->i_ref);
 
 out:
 	return inode;
@@ -1190,7 +1190,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			inode->i_ref, fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 3e2f19b..d7fc5d0 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -506,8 +506,8 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
 		state->owner = owner;
 		atomic_inc(&owner->so_count);
 		list_add(&state->inode_states, &nfsi->open_states);
-		state->inode = igrab(inode);
 		spin_unlock(&inode->i_lock);
+		state->inode = igrab(inode);
 		/* Note: The reclaim code dictates that we add stateless
 		 * and read-only stateids to the end of the list */
 		list_add_tail(&state->open_states, &owner->so_states);
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 62756b4..939459d 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -480,7 +480,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_ref = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index fa7f3b8..1a4c117 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -252,29 +252,36 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			continue;
 
 		/*
-		 * If i_count is zero, the inode cannot have any watches and
+		 * If i_ref is zero, the inode cannot have any watches and
 		 * doing an iref/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
+		 * evict all inodes with zero i_ref from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			atomic_inc(&inode->i_count);
+			inode->i_ref++;
 		else
 			need_iput_tmp = NULL;
+		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-		    atomic_read(&next_i->i_count) &&
-		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			atomic_inc(&next_i->i_count);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (inode->i_ref &&
+			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
+				next_i->i_ref++;
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 93622b1..07fdef8 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -531,7 +531,7 @@ err_corrupt_attr:
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_ref is set to 1, so it is not going to go away
  *    i_flags is set to 0 and we have no business touching it.  Only an ioctl()
  *    is allowed to write to them. We should of course be honouring them but
  *    we need to do that using the IS_* macros defined in include/linux/fs.h.
@@ -1208,7 +1208,7 @@ err_out:
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_ref is set to 1, so it is not going to go away
  *
  * Return 0 on success and -errno on error.  In the error case, the inode will
  * have had make_bad_inode() executed on it.
@@ -1475,7 +1475,7 @@ err_out:
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_ref is set to 1, so it is not going to go away
  *
  * Return 0 on success and -errno on error.  In the error case, the inode will
  * have had make_bad_inode() executed on it.
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 52b48e3..181eddb 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2689,7 +2689,7 @@ static const struct super_operations ntfs_sops = {
 	//					   held. See fs/inode.c::
 	//					   generic_drop_inode(). */
 	//.delete_inode	= NULL,			/* VFS: Delete inode from disk.
-	//					   Called when i_count becomes
+	//					   Called when i_ref becomes
 	//					   0 and i_nlink is also 0. */
 	//.write_super	= NULL,			/* Flush dirty super block to
 	//					   disk. */
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 38d4304..326df72 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		if (!dqinit_needed(inode, type))
 			continue;
 
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 313d39d..42d3311 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1477,7 +1477,7 @@ static int maybe_indirect_to_direct(struct reiserfs_transaction_handle *th,
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (inode->i_ref > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index 450c919..85ff606 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -320,7 +320,7 @@ out:
 }
 
 /*
- * This routine is called when i_nlink == 0 and i_count goes to 0.
+ * This routine is called when i_nlink == 0 and i_ref goes to 0.
  * All blocking cleanup operations need to go here to avoid races.
  */
 static void
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cd5900b..ead1f89 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -342,7 +342,7 @@ static void ubifs_evict_inode(struct inode *inode)
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!inode->i_ref);
 
 	truncate_inode_pages(&inode->i_data, 0);
 
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index fc48f37..05b0445 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1071,7 +1071,7 @@ static void __udf_read_inode(struct inode *inode)
 	 *      i_flags = sb->s_flags
 	 *      i_state = 0
 	 * clean_inode(): zero fills and sets
-	 *      i_count = 1
+	 *      i_ref = 1
 	 *      i_nlink = 1
 	 *      i_op = NULL;
 	 */
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..0428b06 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -599,7 +599,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = VFS_I(ip)->i_ref;
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index cbb4791..1e41fa8 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -481,7 +481,6 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
 	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6eb94b0..c720d65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -730,7 +730,7 @@ struct inode {
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_ref;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
@@ -1612,7 +1612,7 @@ struct super_operations {
  *			also cause waiting on I_NEW, without I_NEW actually
  *			being set.  find_inode() uses this to prevent returning
  *			nearly-dead inodes.
- * I_WILL_FREE		Must be set when calling write_inode_now() if i_count
+ * I_WILL_FREE		Must be set when calling write_inode_now() if i_ref
  *			is zero.  I_FREEING must be set when I_WILL_FREE is
  *			cleared.
  * I_FREEING		Set when inode is about to be freed but still has dirty
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/21] fs: Factor inode hash operations into functions
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (7 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 08/21] fs: rework icount to be a locked variable Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 10/21] fs: Stop abusing find_inode_fast in iunique Dave Chinner
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/inode.c |  100 +++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 56 insertions(+), 44 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 77b71ce..cfcafee 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -359,6 +359,59 @@ void inode_lru_list_del(struct inode *inode)
 	}
 }
 
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+	unsigned long tmp;
+
+	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+			L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
+	return tmp & I_HASHMASK;
+}
+
+/**
+ *	__insert_inode_hash - hash an inode
+ *	@inode: unhashed inode
+ *	@hashval: unsigned long value used to locate this object in the
+ *		inode_hashtable.
+ *
+ *	Add an inode to the inode hash for this superblock.
+ */
+void __insert_inode_hash(struct inode *inode, unsigned long hashval)
+{
+	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	spin_lock(&inode_lock);
+	hlist_add_head(&inode->i_hash, head);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(__insert_inode_hash);
+
+/**
+ *	__remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock. inode->i_lock must be
+ *	held.
+ */
+static void __remove_inode_hash(struct inode *inode)
+{
+	hlist_del_init(&inode->i_hash);
+}
+
+/**
+ *	remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock.
+ */
+void remove_inode_hash(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(remove_inode_hash);
+
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
@@ -406,7 +459,7 @@ static void dispose_list(struct list_head *head)
 		evict(inode);
 
 		spin_lock(&inode_lock);
-		hlist_del_init(&inode->i_hash);
+		__remove_inode_hash(inode);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode_lock);
 
@@ -658,16 +711,6 @@ repeat:
 	return node ? inode : NULL;
 }
 
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-	unsigned long tmp;
-
-	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-			L1_CACHE_BYTES;
-	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
-	return tmp & I_HASHMASK;
-}
-
 static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
@@ -1234,36 +1277,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 }
 EXPORT_SYMBOL(insert_inode_locked4);
 
-/**
- *	__insert_inode_hash - hash an inode
- *	@inode: unhashed inode
- *	@hashval: unsigned long value used to locate this object in the
- *		inode_hashtable.
- *
- *	Add an inode to the inode hash for this superblock.
- */
-void __insert_inode_hash(struct inode *inode, unsigned long hashval)
-{
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL(__insert_inode_hash);
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL(remove_inode_hash);
 
 int generic_delete_inode(struct inode *inode)
 {
@@ -1319,6 +1332,7 @@ static void iput_final(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
+		__remove_inode_hash(inode);
 	}
 	list_del_init(&inode->i_wb_list);
 	WARN_ON(inode->i_state & I_NEW);
@@ -1334,9 +1348,7 @@ static void iput_final(struct inode *inode)
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
-	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_lock);
+	remove_inode_hash(inode);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/21] fs: Stop abusing find_inode_fast in iunique
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (8 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 09/21] fs: Factor inode hash operations into functions Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 11/21] fs: move i_ref increments into find_inode/find_inode_fast Dave Chinner
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Christoph Hellwig <hch@lst.de>

Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.

Based on a patch originally from Nick Piggin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   30 +++++++++++++++++++++++++-----
 1 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index cfcafee..77ff091 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -913,6 +913,27 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * search the inode cache for a matching inode number.
+ * If we find one, then the inode number we are trying to
+ * allocate is not unique and so we should not use it.
+ *
+ * Returns 1 if the inode number is unique, 0 if it is not.
+ */
+static int test_inode_iunique(struct super_block *sb, unsigned long ino)
+{
+	struct hlist_head *b = inode_hashtable + hash(sb, ino);
+	struct hlist_node *node;
+	struct inode *inode;
+
+	hlist_for_each_entry(inode, node, b, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb)
+			return 0;
+	}
+
+	return 1;
+}
+
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
@@ -934,19 +955,18 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
+	static DEFINE_SPINLOCK(iunique_lock);
 	static unsigned int counter;
-	struct inode *inode;
-	struct hlist_head *head;
 	ino_t res;
 
 	spin_lock(&inode_lock);
+	spin_lock(&iunique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		head = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, head, res);
-	} while (inode != NULL);
+	} while (!test_inode_iunique(sb, res));
+	spin_unlock(&iunique_lock);
 	spin_unlock(&inode_lock);
 
 	return res;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/21] fs: move i_ref increments into find_inode/find_inode_fast
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (9 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 10/21] fs: Stop abusing find_inode_fast in iunique Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 12/21] fs: remove inode_add_to_list/__inode_add_to_list Dave Chinner
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Christoph Hellwig <hch@lst.de>

Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   30 +++++++++++-------------------
 1 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 77ff091..6b97eb7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -657,11 +657,9 @@ static struct shrinker icache_shrinker = {
 };
 
 static void __wait_on_freeing_inode(struct inode *inode);
+
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must take a reference
- * by hand after calling find_inode now! This simplifies iunique and won't
- * add any additional branch in the common code.
  */
 static struct inode *find_inode(struct super_block *sb,
 				struct hlist_head *head,
@@ -681,9 +679,12 @@ repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
-		break;
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		return inode;
 	}
-	return node ? inode : NULL;
+	return NULL;
 }
 
 /*
@@ -706,9 +707,12 @@ repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
-		break;
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		return inode;
 	}
-	return node ? inode : NULL;
+	return NULL;
 }
 
 static inline void
@@ -853,9 +857,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
-		old->i_ref++;
-		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -902,9 +903,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
-		old->i_ref++;
-		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -1022,9 +1020,6 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_ref++;
-		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -1057,9 +1052,6 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_ref++;
-		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/21] fs: remove inode_add_to_list/__inode_add_to_list
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (10 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 11/21] fs: move i_ref increments into find_inode/find_inode_fast Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 13/21] fs: Introduce per-bucket inode hash locks Dave Chinner
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Christoph Hellwig <hch@lst.de>

Split up inode_add_to_list/__inode_add_to_list.  Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c                  |   70 +++++++++++++++++++-----------------------
 fs/xfs/linux-2.6/xfs_iops.c |    4 ++-
 include/linux/fs.h          |    5 ++-
 3 files changed, 38 insertions(+), 41 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6b97eb7..301dff5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -359,6 +359,28 @@ void inode_lru_list_del(struct inode *inode)
 	}
 }
 
+static inline void __inode_sb_list_add(struct inode *inode)
+{
+	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
+}
+
+/**
+ * inode_sb_list_add - add inode to the superblock list of inodes
+ * @inode: inode to add
+ */
+void inode_sb_list_add(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	__inode_sb_list_add(inode);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL_GPL(inode_sb_list_add);
+
+static inline void __inode_sb_list_del(struct inode *inode)
+{
+	list_del_init(&inode->i_sb_list);
+}
+
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
 	unsigned long tmp;
@@ -379,9 +401,10 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+
 	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
+	hlist_add_head(&inode->i_hash, b);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -460,7 +483,7 @@ static void dispose_list(struct list_head *head)
 
 		spin_lock(&inode_lock);
 		__remove_inode_hash(inode);
-		list_del_init(&inode->i_sb_list);
+		__inode_sb_list_del(inode);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -715,37 +738,6 @@ repeat:
 	return NULL;
 }
 
-static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
-			struct inode *inode)
-{
-	list_add(&inode->i_sb_list, &sb->s_inodes);
-	if (head)
-		hlist_add_head(&inode->i_hash, head);
-}
-
-/**
- * inode_add_to_lists - add a new inode to relevant lists
- * @sb: superblock inode belongs to
- * @inode: inode to mark in use
- *
- * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash. This needs to be done under
- * the inode_lock, so export a function to do this rather than the inode lock
- * itself. We calculate the hash list to add to here so it is all internal
- * which requires the caller to have already set up the inode number in the
- * inode to add.
- */
-void inode_add_to_lists(struct super_block *sb, struct inode *inode)
-{
-	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
-
-	spin_lock(&inode_lock);
-	__inode_add_to_lists(sb, head, inode);
-	spin_unlock(&inode_lock);
-}
-EXPORT_SYMBOL_GPL(inode_add_to_lists);
-
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -773,7 +765,7 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
+		__inode_sb_list_add(inode);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
@@ -842,7 +834,8 @@ static struct inode *get_new_inode(struct super_block *sb,
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, head, inode);
+			hlist_add_head(&inode->i_hash, head);
+			__inode_sb_list_add(inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -888,7 +881,8 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, head, inode);
+			hlist_add_head(&inode->i_hash, head);
+			__inode_sb_list_add(inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -1357,7 +1351,7 @@ static void iput_final(struct inode *inode)
 	 */
 	inode_lru_list_del(inode);
 
-	list_del_init(&inode->i_sb_list);
+	__inode_sb_list_del(inode);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b7ec465..3c7cea3 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -795,7 +795,9 @@ xfs_setup_inode(
 
 	inode->i_ino = ip->i_ino;
 	inode->i_state = I_NEW;
-	inode_add_to_lists(ip->i_mount->m_super, inode);
+
+	inode_sb_list_add(inode);
+	insert_inode_hash(inode);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	inode->i_nlink	= ip->i_d.di_nlink;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c720d65..baf8d32 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2163,7 +2163,6 @@ extern loff_t vfs_llseek(struct file *file, loff_t offset, int origin);
 
 extern int inode_init_always(struct super_block *, struct inode *);
 extern void inode_init_once(struct inode *);
-extern void inode_add_to_lists(struct super_block *, struct inode *);
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
@@ -2195,9 +2194,11 @@ extern int file_remove_suid(struct file *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 extern void remove_inode_hash(struct inode *);
-static inline void insert_inode_hash(struct inode *inode) {
+static inline void insert_inode_hash(struct inode *inode)
+{
 	__insert_inode_hash(inode, inode->i_ino);
 }
+extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/21] fs: Introduce per-bucket inode hash locks
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (11 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 12/21] fs: remove inode_add_to_list/__inode_add_to_list Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 14/21] fs: add a per-superblock lock for the inode list Dave Chinner
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Protect the inode hash with a single lock is not scalable.  Convert
the inode hash to use the new bit-locked hash list implementation
that allows per-bucket locks to be used. This allows us to replace
the global inode_lock with finer grained locking without increasing
the size of the hash table.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        |    2 +-
 fs/fs-writeback.c       |    2 +-
 fs/hfs/hfs_fs.h         |    2 +-
 fs/hfs/inode.c          |    2 +-
 fs/hfsplus/hfsplus_fs.h |    2 +-
 fs/hfsplus/inode.c      |    2 +-
 fs/inode.c              |  168 ++++++++++++++++++++++++++++-------------------
 fs/nilfs2/gcinode.c     |   22 ++++---
 fs/nilfs2/segment.c     |    2 +-
 fs/nilfs2/the_nilfs.h   |    2 +-
 fs/reiserfs/xattr.c     |    2 +-
 include/linux/fs.h      |    8 ++-
 include/linux/list_bl.h |    1 +
 mm/shmem.c              |    4 +-
 14 files changed, 132 insertions(+), 89 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7947bf0..c7a2bef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3855,7 +3855,7 @@ again:
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_unhashed(&inode->i_hash))
+	if (inode_unhashed(inode))
 		return;
 
 	spin_lock(&root->inode_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9832beb..1fb5d95 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -964,7 +964,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
+			if (inode_unhashed(inode))
 				goto out;
 		}
 		if (inode->i_state & I_FREEING)
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 4f55651..24591be 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {
 
 	int fs_div;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFS_FLG_BITMAP_DIRTY	0
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 397b7ad..7778298 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -524,7 +524,7 @@ static struct dentry *hfs_file_lookup(struct inode *dir, struct dentry *dentry,
 	HFS_I(inode)->rsrc_inode = dir;
 	HFS_I(dir)->rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index dc856be..499f5a5 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -144,7 +144,7 @@ struct hfsplus_sb_info {
 
 	unsigned long flags;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFSPLUS_SB_WRITEBACKUP	0x0001
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index c5a979d..b755cf0 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -202,7 +202,7 @@ static struct dentry *hfsplus_file_lookup(struct inode *dir, struct dentry *dent
 	HFSPLUS_I(inode).rsrc_inode = dir;
 	HFSPLUS_I(dir).rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/inode.c b/fs/inode.c
index 301dff5..da6b73b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -32,6 +32,13 @@
  *
  * inode->i_lock protects:
  *   i_ref
+ * inode hash lock protects:
+ *   inode hash table, i_hash
+ *
+ * Lock orders
+ * inode_lock
+ *   inode hash bucket lock
+ *     inode->i_lock
  */
 
 /*
@@ -68,6 +75,7 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
+static struct hlist_bl_head *inode_hashtable __read_mostly;
 
 /*
  * Each inode can be on two separate lists. One is
@@ -80,9 +88,7 @@ static unsigned int i_hash_shift __read_mostly;
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
  */
-
 static LIST_HEAD(inode_lru);
-static struct hlist_head *inode_hashtable __read_mostly;
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -297,7 +303,7 @@ void destroy_inode(struct inode *inode)
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	init_hlist_bl_node(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_wb_list);
@@ -401,10 +407,12 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(inode->i_sb, hashval);
 
 	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, b);
+	hlist_bl_lock(b);
+	hlist_bl_add_head(&inode->i_hash, b);
+	hlist_bl_unlock(b);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -418,7 +426,12 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 static void __remove_inode_hash(struct inode *inode)
 {
-	hlist_del_init(&inode->i_hash);
+	struct hlist_bl_head *b;
+
+	b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+	hlist_bl_lock(b);
+	hlist_bl_del_init(&inode->i_hash);
+	hlist_bl_unlock(b);
 }
 
 /**
@@ -430,7 +443,7 @@ static void __remove_inode_hash(struct inode *inode)
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	hlist_del_init(&inode->i_hash);
+	__remove_inode_hash(inode);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -685,21 +698,23 @@ static void __wait_on_freeing_inode(struct inode *inode);
  * Called with the inode lock held.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
+			hlist_bl_lock(b);
 			goto repeat;
 		}
 		spin_lock(&inode->i_lock);
@@ -715,19 +730,21 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+		struct hlist_bl_head *b, unsigned long ino)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
+			hlist_bl_lock(b);
 			goto repeat;
 		}
 		spin_lock(&inode->i_lock);
@@ -816,7 +833,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -828,13 +845,15 @@ static struct inode *get_new_inode(struct super_block *sb,
 		struct inode *old;
 
 		spin_lock(&inode_lock);
+		hlist_bl_lock(b);
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, b, test, data);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
 			__inode_sb_list_add(inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
@@ -850,6 +869,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
+		hlist_bl_unlock(b);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -858,6 +878,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 	return inode;
 
 set_failed:
+	hlist_bl_unlock(b);
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
@@ -868,7 +889,7 @@ set_failed:
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_bl_head *b, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -877,11 +898,13 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		struct inode *old;
 
 		spin_lock(&inode_lock);
+		hlist_bl_lock(b);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
 			__inode_sb_list_add(inode);
 			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
@@ -897,6 +920,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
+		hlist_bl_unlock(b);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -914,15 +938,19 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *b = inode_hashtable + hash(sb, ino);
-	struct hlist_node *node;
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry(inode, node, b, i_hash) {
-		if (inode->i_ino == ino && inode->i_sb == sb)
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb) {
+			hlist_bl_unlock(b);
 			return 0;
+		}
 	}
 
+	hlist_bl_unlock(b);
 	return 1;
 }
 
@@ -1006,21 +1034,21 @@ EXPORT_SYMBOL(igrab);
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct hlist_head *head, int (*test)(struct inode *, void *),
+		struct hlist_bl_head *b,
+		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
-	inode = find_inode(sb, head, test, data);
-	if (inode) {
-		spin_unlock(&inode_lock);
-		if (likely(wait))
-			wait_on_inode(inode);
-		return inode;
-	}
+	hlist_bl_lock(b);
+	inode = find_inode(sb, b, test, data);
+	hlist_bl_unlock(b);
 	spin_unlock(&inode_lock);
-	return NULL;
+
+	if (inode && likely(wait))
+		wait_on_inode(inode);
+	return inode;
 }
 
 /**
@@ -1039,19 +1067,20 @@ static struct inode *ifind(struct super_block *sb,
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct hlist_head *head, unsigned long ino)
+		struct hlist_bl_head *b,
+		unsigned long ino)
 {
 	struct inode *inode;
 
 	spin_lock(&inode_lock);
-	inode = find_inode_fast(sb, head, ino);
-	if (inode) {
-		spin_unlock(&inode_lock);
-		wait_on_inode(inode);
-		return inode;
-	}
+	hlist_bl_lock(b);
+	inode = find_inode_fast(sb, b, ino);
+	hlist_bl_unlock(b);
 	spin_unlock(&inode_lock);
-	return NULL;
+
+	if (inode)
+		wait_on_inode(inode);
+	return inode;
 }
 
 /**
@@ -1078,9 +1107,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 0);
+	return ifind(sb, b, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1106,9 +1135,9 @@ EXPORT_SYMBOL(ilookup5_nowait);
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 1);
+	return ifind(sb, b, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 
@@ -1128,9 +1157,9 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 
-	return ifind_fast(sb, head, ino);
+	return ifind_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(ilookup);
 
@@ -1158,17 +1187,17 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	inode = ifind(sb, head, test, data, 1);
+	inode = ifind(sb, b, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode(sb, head, test, set, data);
+	return get_new_inode(sb, b, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 
@@ -1189,17 +1218,17 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	inode = ifind_fast(sb, head, ino);
+	inode = ifind_fast(sb, b, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode_fast(sb, head, ino);
+	return get_new_inode_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 
@@ -1207,14 +1236,15 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 
 	inode->i_state |= I_NEW;
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 		spin_lock(&inode_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1224,16 +1254,18 @@ int insert_inode_locked(struct inode *inode)
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1246,16 +1278,17 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
 	inode->i_state |= I_NEW;
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1265,16 +1298,18 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1297,7 +1332,7 @@ EXPORT_SYMBOL(generic_delete_inode);
  */
 int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || hlist_unhashed(&inode->i_hash);
+	return !inode->i_nlink || inode_unhashed(inode);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 
@@ -1337,7 +1372,6 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		hlist_del_init(&inode->i_hash);
 		__remove_inode_hash(inode);
 	}
 	list_del_init(&inode->i_wb_list);
@@ -1601,7 +1635,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1634,7 +1668,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					0,
@@ -1643,7 +1677,7 @@ void __init inode_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index bed3a78..ce7344e 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -196,13 +196,13 @@ int nilfs_init_gccache(struct the_nilfs *nilfs)
 	INIT_LIST_HEAD(&nilfs->ns_gc_inodes);
 
 	nilfs->ns_gc_inodes_h =
-		kmalloc(sizeof(struct hlist_head) * NILFS_GCINODE_HASH_SIZE,
+		kmalloc(sizeof(struct hlist_bl_head) * NILFS_GCINODE_HASH_SIZE,
 			GFP_NOFS);
 	if (nilfs->ns_gc_inodes_h == NULL)
 		return -ENOMEM;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++)
-		INIT_HLIST_HEAD(&nilfs->ns_gc_inodes_h[loop]);
+		INIT_HLIST_BL_HEAD(&nilfs->ns_gc_inodes_h[loop]);
 	return 0;
 }
 
@@ -254,18 +254,18 @@ static unsigned long ihash(ino_t ino, __u64 cno)
  */
 struct inode *nilfs_gc_iget(struct the_nilfs *nilfs, ino_t ino, __u64 cno)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
-	struct hlist_node *node;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino == ino && NILFS_I(inode)->i_cno == cno)
 			return inode;
 	}
 
 	inode = alloc_gcinode(nilfs, ino, cno);
 	if (likely(inode)) {
-		hlist_add_head(&inode->i_hash, head);
+		hlist_bl_add_head(&inode->i_hash, head);
 		list_add(&NILFS_I(inode)->i_dirty, &nilfs->ns_gc_inodes);
 	}
 	return inode;
@@ -284,16 +284,18 @@ void nilfs_clear_gcinode(struct inode *inode)
  */
 void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h;
-	struct hlist_node *node, *n;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
+	struct hlist_bl_node *node;
 	struct inode *inode;
 	int loop;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
-		hlist_for_each_entry_safe(inode, node, n, head, i_hash) {
-			hlist_del_init(&inode->i_hash);
+restart:
+		hlist_bl_for_each_entry(inode, node, head, i_hash) {
+			hlist_bl_del_init(&inode->i_hash);
 			list_del_init(&NILFS_I(inode)->i_dirty);
 			nilfs_clear_gcinode(inode); /* might sleep */
+			goto restart;
 		}
 	}
 }
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9fd051a..038251c 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -2452,7 +2452,7 @@ nilfs_remove_written_gcinodes(struct the_nilfs *nilfs, struct list_head *head)
 	list_for_each_entry_safe(ii, n, head, i_dirty) {
 		if (!test_bit(NILFS_I_UPDATED, &ii->i_state))
 			continue;
-		hlist_del_init(&ii->vfs_inode.i_hash);
+		hlist_bl_del_init(&ii->vfs_inode.i_hash);
 		list_del_init(&ii->i_dirty);
 		nilfs_clear_gcinode(&ii->vfs_inode);
 	}
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index f785a7b..1ab441a 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -167,7 +167,7 @@ struct the_nilfs {
 
 	/* GC inode list and hash table head */
 	struct list_head	ns_gc_inodes;
-	struct hlist_head      *ns_gc_inodes_h;
+	struct hlist_bl_head      *ns_gc_inodes_h;
 
 	/* Disk layout information (static) */
 	unsigned int		ns_blocksize_bits;
diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index 8c4cf27..b246e3c 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -424,7 +424,7 @@ int reiserfs_prepare_write(struct file *f, struct page *page,
 static void update_ctime(struct inode *inode)
 {
 	struct timespec now = current_fs_time(inode->i_sb);
-	if (hlist_unhashed(&inode->i_hash) || !inode->i_nlink ||
+	if (inode_unhashed(inode) || !inode->i_nlink ||
 	    timespec_equal(&inode->i_ctime, &now))
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index baf8d32..adcbfb9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -383,6 +383,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/list_bl.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -724,7 +725,7 @@ struct posix_acl;
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
 	struct list_head	i_wb_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
@@ -789,6 +790,11 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 };
 
+static inline int inode_unhashed(struct inode *inode)
+{
+	return hlist_bl_unhashed(&inode->i_hash);
+}
+
 /*
  * inode->i_mutex nesting subclasses for the lock validator:
  *
diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index 0d791ff..5bb2370 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -126,6 +126,7 @@ static inline void hlist_bl_del_init(struct hlist_bl_node *n)
 
 #endif
 
+
 /**
  * hlist_bl_lock	- lock a hash list
  * @h:	hash list head to lock
diff --git a/mm/shmem.c b/mm/shmem.c
index 7d0bc16..419de2c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2146,7 +2146,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 	if (*len < 3)
 		return 255;
 
-	if (hlist_unhashed(&inode->i_hash)) {
+	if (inode_unhashed(inode)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
 		 * so as we hash inodes here rather than at creation
 		 * time, we need a lock to ensure we only try
@@ -2154,7 +2154,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 		 */
 		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
-		if (hlist_unhashed(&inode->i_hash))
+		if (inode_unhashed(inode))
 			__insert_inode_hash(inode,
 					    inode->i_ino + inode->i_generation);
 		spin_unlock(&lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/21] fs: add a per-superblock lock for the inode list
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (12 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 13/21] fs: Introduce per-bucket inode hash locks Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 15/21] fs: split locking of inode writeback and LRU lists Dave Chinner
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

To allow removal of the inode_lock, we first need to protect the
superblock inode list with its own lock instead of using the
inode_lock. Add a lock to the superblock to protect this list and
nest the new lock inside the inode_lock around the list operations
it needs to protect.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/drop_caches.c       |    4 ++++
 fs/fs-writeback.c      |    4 ++++
 fs/inode.c             |   29 +++++++++++++++++++++++------
 fs/notify/inode_mark.c |    3 +++
 fs/quota/dquot.c       |    6 ++++++
 fs/super.c             |    1 +
 include/linux/fs.h     |    1 +
 7 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 10c8c5a..dfe8cb1 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,6 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -25,12 +26,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1fb5d95..676e048 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1031,6 +1031,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1050,6 +1051,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
@@ -1067,7 +1069,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		cond_resched();
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
diff --git a/fs/inode.c b/fs/inode.c
index da6b73b..734aadf 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -34,13 +34,18 @@
  *   i_ref
  * inode hash lock protects:
  *   inode hash table, i_hash
+ * sb inode lock protects:
+ *   s_inodes, i_sb_list
  *
  * Lock orders
  * inode_lock
  *   inode hash bucket lock
  *     inode->i_lock
+ *
+ * inode_lock
+ *   sb inode lock
+ *     inode->i_lock
  */
-
 /*
  * This is needed for the following functions:
  *  - inode_has_buffers
@@ -365,9 +370,13 @@ void inode_lru_list_del(struct inode *inode)
 	}
 }
 
-static inline void __inode_sb_list_add(struct inode *inode)
+static void __inode_sb_list_add(struct inode *inode)
 {
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
+	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
 }
 
 /**
@@ -382,9 +391,13 @@ void inode_sb_list_add(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
-static inline void __inode_sb_list_del(struct inode *inode)
+static void __inode_sb_list_del(struct inode *inode)
 {
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb->s_inodes_lock);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -507,7 +520,8 @@ static void dispose_list(struct list_head *head)
 /*
  * Invalidate all inodes for a device.
  */
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_list(struct super_block *sb, struct list_head *head,
+			struct list_head *dispose)
 {
 	struct list_head *next;
 	int busy = 0;
@@ -524,6 +538,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -562,8 +577,10 @@ int invalidate_inodes(struct super_block *sb)
 
 	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
-	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 1a4c117..4ed0e43 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -242,6 +242,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
+		struct super_block *sb = inode->i_sb;
 
 		/*
 		 * We cannot iref() an inode in state I_FREEING,
@@ -290,6 +291,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -303,5 +305,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		iput(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 326df72..7ef5411 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -897,6 +897,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 #endif
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -912,6 +913,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -923,7 +925,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 
@@ -1006,6 +1010,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	int reserved = 0;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1019,6 +1024,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
diff --git a/fs/super.c b/fs/super.c
index 8819e3a..c5332e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,6 +76,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		spin_lock_init(&s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
diff --git a/include/linux/fs.h b/include/linux/fs.h
index adcbfb9..f222ce8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1347,6 +1347,7 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
+	spinlock_t		s_inodes_lock;	/* lock for s_inodes */
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/21] fs: split locking of inode writeback and LRU lists
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (13 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 14/21] fs: add a per-superblock lock for the inode list Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Given that the inode LRU and IO lists are split apart, they do not
need to be protected by the same lock. So in preparation for removal
of the inode_lock, add new locks for them. The writeback lists are
only ever accessed in the context of a bdi, so add a per-BDI lock to
protect manipulations of these lists.

For the inode LRU, introduce a simple global lock to protect it.
While this could be made per-sb, it is unclear yet as to what is the
next step for optimising/parallelising reclaim of inodes. Rather
than optimise now, leave it as a global list and lock until further
analysis can be done.

Because there will now be a situation where the inode is on
different lists protected by different locks during the freeing of
the inode (i.e. not an atomic state transition), we need to ensure
that we set the I_FREEING state flag before we start removing inodes
from the IO and LRU lists. This ensures that if we race with other
threads during freeing, they will notice the I_FREEING flag is set
and be able to take appropriate action to avoid problems.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c              |    5 +++
 fs/fs-writeback.c           |   51 +++++++++++++++++++++++++++++++++---
 fs/inode.c                  |   61 ++++++++++++++++++++++++++++++++++++++-----
 fs/internal.h               |    5 +++
 include/linux/backing-dev.h |    3 ++
 mm/backing-dev.c            |   18 ++++++++++++
 6 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 11ad53d..7909775 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -56,10 +56,15 @@ EXPORT_SYMBOL(I_BDEV);
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
 	spin_lock(&inode_lock);
+	bdi_lock_two(old, dst);
 	inode->i_data.backing_dev_info = dst;
 	if (!list_empty(&inode->i_wb_list))
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
+	spin_unlock(&old->wb.b_lock);
+	spin_unlock(&dst->wb.b_lock);
 	spin_unlock(&inode_lock);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 676e048..36106e6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -157,6 +157,18 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 }
 
 /*
+ * Remove the inode from the writeback list it is on.
+ */
+void inode_wb_list_del(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.b_lock);
+	list_del_init(&inode->i_wb_list);
+	spin_unlock(&bdi->wb.b_lock);
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -169,6 +181,7 @@ static void redirty_tail(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -186,6 +199,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -269,6 +283,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	assert_spin_locked(&wb->b_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -311,6 +326,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
@@ -330,7 +346,9 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
+			spin_unlock(&bdi->wb.b_lock);
 			return 0;
 		}
 
@@ -385,6 +403,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
@@ -400,6 +419,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 				 */
 				redirty_tail(inode);
 			}
+			spin_unlock(&bdi->wb.b_lock);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
@@ -407,7 +427,9 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
+			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/*
 			 * The inode is clean. If it is unused, then make sure
@@ -415,7 +437,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * does not move dirty inodes to the LRU and dirty
 			 * inodes are removed from the LRU during scanning.
 			 */
-			list_del_init(&inode->i_wb_list);
+			inode_wb_list_del(inode);
 			if (!inode->i_ref)
 				inode_lru_list_add(inode);
 		}
@@ -463,6 +485,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
+	assert_spin_locked(&wb->b_lock);
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -478,7 +501,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 				redirty_tail(inode);
 				continue;
 			}
-
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -487,7 +509,15 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
-		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+		/*
+		 * We can see I_FREEING here when the inode is in the process of
+		 * being reclaimed. In that case the freer is waiting on the
+		 * wb->b_lock that we currently hold to remove the inode from
+		 * the writeback list. So we don't spin on it here, requeue it
+		 * and move on to the next inode, which will allow the other
+		 * thread to free the inode when we drop the lock.
+		 */
+		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
 			requeue_io(inode);
 			continue;
 		}
@@ -498,10 +528,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		if (inode_dirtied_after(inode, wbc->wb_start))
 			return 1;
 
-		BUG_ON(inode->i_state & I_FREEING);
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&wb->b_lock);
+
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -509,12 +540,15 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
+			spin_lock(&wb->b_lock);
 			redirty_tail(inode);
+			spin_unlock(&wb->b_lock);
 		}
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -534,6 +568,8 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
+
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -552,6 +588,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		if (ret)
 			break;
 	}
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
@@ -562,9 +599,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -677,8 +716,10 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 */
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
+			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_wb_list);
+			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -991,8 +1032,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			spin_unlock(&bdi->wb.b_lock);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index 734aadf..959c8b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,6 +25,8 @@
 #include <linux/async.h>
 #include <linux/posix_acl.h>
 
+#include "internal.h"
+
 /*
  * Locking rules.
  *
@@ -36,6 +38,10 @@
  *   inode hash table, i_hash
  * sb inode lock protects:
  *   s_inodes, i_sb_list
+ * bdi writeback lock protects:
+ *   b_io, b_more_io, b_dirty, i_wb_list
+ * inode_lru_lock protects:
+ *   inode_lru, i_lru
  *
  * Lock orders
  * inode_lock
@@ -44,6 +50,17 @@
  *
  * inode_lock
  *   sb inode lock
+ *     inode_lru_lock
+ *       wb->b_lock
+ *         inode->i_lock
+ *
+ * inode_lock
+ *   wb->b_lock
+ *     sb_lock (pin sb for writeback)
+ *     inode->i_lock
+ *
+ * inode_lock
+ *   inode_lru
  *     inode->i_lock
  */
 /*
@@ -94,6 +111,7 @@ static struct hlist_bl_head *inode_hashtable __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 static LIST_HEAD(inode_lru);
+static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -354,20 +372,28 @@ void iref(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(iref);
 
+/*
+ * check against I_FREEING as inode writeback completion could race with
+ * setting the I_FREEING and removing the inode from the LRU.
+ */
 void inode_lru_list_add(struct inode *inode)
 {
-	if (list_empty(&inode->i_lru)) {
+	spin_lock(&inode_lru_lock);
+	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
 		list_add(&inode->i_lru, &inode_lru);
 		percpu_counter_inc(&nr_inodes_unused);
 	}
+	spin_unlock(&inode_lru_lock);
 }
 
 void inode_lru_list_del(struct inode *inode)
 {
+	spin_lock(&inode_lru_lock);
 	if (!list_empty(&inode->i_lru)) {
 		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
+	spin_unlock(&inode_lru_lock);
 }
 
 static void __inode_sb_list_add(struct inode *inode)
@@ -552,8 +578,18 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+
+			/*
+			 * move the inode off the IO lists and LRU once
+			 * I_FREEING is set so that it won't get moved back on
+			 * there if it is dirty.
+			 */
+			inode_wb_list_del(inode);
+
+			spin_lock(&inode_lru_lock);
 			list_move(&inode->i_lru, dispose);
 			percpu_counter_dec(&nr_inodes_unused);
+			spin_unlock(&inode_lru_lock);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
@@ -614,6 +650,7 @@ static void prune_icache(int nr_to_scan)
 
 	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
@@ -644,6 +681,7 @@ static void prune_icache(int nr_to_scan)
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			inode->i_ref++;
 			spin_unlock(&inode->i_lock);
+			spin_unlock(&inode_lru_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -661,19 +699,28 @@ static void prune_icache(int nr_to_scan)
 			 * same. Either way, we won't spin on it in this loop.
 			 */
 			spin_lock(&inode_lock);
+			spin_lock(&inode_lru_lock);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
-		list_move(&inode->i_lru, &freeable);
-		list_del_init(&inode->i_wb_list);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+
+		/*
+		 * move the inode off the io lists and lru once
+		 * i_freeing is set so that it won't get moved back on
+		 * there if it is dirty.
+		 */
+		inode_wb_list_del(inode);
+
+		list_move(&inode->i_lru, &freeable);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
+	spin_unlock(&inode_lru_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
@@ -1391,15 +1438,15 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
-	list_del_init(&inode->i_wb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
 	/*
-	 * After we delete the inode from the LRU here, we avoid moving dirty
-	 * inodes back onto the LRU now because I_FREEING is set and hence
-	 * writeback_single_inode() won't move the inode around.
+	 * After we delete the inode from the LRU and IO lists here, we avoid
+	 * moving dirty inodes back onto the LRU now because I_FREEING is set
+	 * and hence writeback_single_inode() won't move the inode around.
 	 */
+	inode_wb_list_del(inode);
 	inode_lru_list_del(inode);
 
 	__inode_sb_list_del(inode);
diff --git a/fs/internal.h b/fs/internal.h
index ece3565..f8825ae 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -107,3 +107,8 @@ extern void release_open_intent(struct nameidata *);
  */
 extern void inode_lru_list_add(struct inode *inode);
 extern void inode_lru_list_del(struct inode *inode);
+
+/*
+ * fs-writeback.c
+ */
+extern void inode_wb_list_del(struct inode *inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..995a3ad 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t b_lock;		/* writeback lists lock */
 };
 
 struct backing_dev_info {
@@ -108,6 +109,8 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 15d5097..52442bd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,12 +74,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -634,6 +636,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->b_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -671,6 +674,18 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
+void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2)
+{
+	if (bdi1 < bdi2) {
+		spin_lock(&bdi1->wb.b_lock);
+		spin_lock_nested(&bdi2->wb.b_lock, 1);
+	} else {
+		spin_lock(&bdi2->wb.b_lock);
+		spin_lock_nested(&bdi1->wb.b_lock, 1);
+	}
+}
+
 void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
@@ -683,9 +698,12 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
 		spin_lock(&inode_lock);
+		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		spin_unlock(&bdi->wb.b_lock);
+		spin_unlock(&dst->b_lock);
 		spin_unlock(&inode_lock);
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (14 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 15/21] fs: split locking of inode writeback and LRU lists Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-22  1:56   ` Al Viro
  2010-10-21  0:49 ` [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock Dave Chinner
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state.  Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.

Based on work originally written by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/drop_caches.c       |    9 +++--
 fs/fs-writeback.c      |   48 +++++++++++++++++++++------
 fs/inode.c             |   85 ++++++++++++++++++++++++++++++++++--------------
 fs/nilfs2/gcdat.c      |    1 +
 fs/notify/inode_mark.c |    6 ++-
 fs/quota/dquot.c       |   12 +++---
 6 files changed, 113 insertions(+), 48 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dfe8cb1..f958dd8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
-			continue;
 		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (inode->i_mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36106e6..807d936 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -304,10 +304,12 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 	}
 }
 
@@ -331,6 +333,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
+	spin_lock(&inode->i_lock);
 	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
@@ -346,6 +349,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
 			spin_unlock(&bdi->wb.b_lock);
@@ -363,6 +367,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -384,8 +389,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * write_inode()
 	 */
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -395,6 +402,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -403,6 +411,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
@@ -427,6 +436,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
 			spin_unlock(&bdi->wb.b_lock);
@@ -437,10 +447,15 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * does not move dirty inodes to the LRU and dirty
 			 * inodes are removed from the LRU during scanning.
 			 */
+			int unused = inode->i_ref == 0;
+			spin_unlock(&inode->i_lock);
 			inode_wb_list_del(inode);
-			if (!inode->i_ref)
+			if (unused)
 				inode_lru_list_add(inode);
 		}
+	} else {
+		/* freer will clean up */
+		spin_unlock(&inode->i_lock);
 	}
 	inode_sync_complete(inode);
 	return ret;
@@ -517,7 +532,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * and move on to the next inode, which will allow the other
 		 * thread to free the inode when we drop the lock.
 		 */
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
+			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			continue;
 		}
@@ -525,10 +542,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
+		if (inode_dirtied_after(inode, wbc->wb_start)) {
+			spin_unlock(&inode->i_lock);
 			return 1;
+		}
 
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->b_lock);
@@ -719,9 +737,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_wb_list);
+			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
+			spin_unlock(&inode->i_lock);
 		}
 		spin_unlock(&inode_lock);
 	}
@@ -987,6 +1007,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -998,7 +1019,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * superblock list, based upon its state.
 		 */
 		if (inode->i_state & I_SYNC)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * Only add valid (hashed) inodes to the superblock's
@@ -1006,10 +1027,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
 			if (inode_unhashed(inode))
-				goto out;
+				goto out_unlock;
 		}
 		if (inode->i_state & I_FREEING)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * If the inode was already on b_dirty/b_io/b_more_io, don't
@@ -1032,12 +1053,16 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.b_lock);
+			goto out;
 		}
 	}
+out_unlock:
+	spin_unlock(&inode->i_lock);
 out:
 	spin_unlock(&inode_lock);
 
@@ -1086,12 +1111,13 @@ static void wait_sb_inodes(struct super_block *sb)
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
+		spin_lock(&inode->i_lock);
 		mapping = inode->i_mapping;
-		if (mapping->nrpages == 0)
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
 			continue;
-		spin_lock(&inode->i_lock);
+		}
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
diff --git a/fs/inode.c b/fs/inode.c
index 959c8b5..32bac58 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -33,7 +33,7 @@
  * inode->i_lock is *always* the innermost lock.
  *
  * inode->i_lock protects:
- *   i_ref
+ *   i_ref i_state
  * inode hash lock protects:
  *   inode hash table, i_hash
  * sb inode lock protects:
@@ -178,7 +178,7 @@ int proc_nr_inodes(ctl_table *table, int write,
 static void wake_up_inode(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_NEW);
@@ -495,7 +495,9 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
+	spin_lock(&inode->i_lock);
 	inode->i_state = I_FREEING | I_CLEAR;
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(end_writeback);
 
@@ -570,14 +572,16 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
-		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
-			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			spin_unlock(&inode->i_lock);
 
 			/*
 			 * move the inode off the IO lists and LRU once
@@ -673,9 +677,9 @@ static void prune_icache(int nr_to_scan)
 
 		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
+			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_lru);
-			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -702,9 +706,9 @@ static void prune_icache(int nr_to_scan)
 			spin_lock(&inode_lru_lock);
 			continue;
 		}
-		spin_unlock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 
 		/*
 		 * move the inode off the io lists and lru once
@@ -758,9 +762,6 @@ static struct shrinker icache_shrinker = {
 
 static void __wait_on_freeing_inode(struct inode *inode);
 
-/*
- * Called with the inode lock held.
- */
 static struct inode *find_inode(struct super_block *sb,
 				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
@@ -773,15 +774,17 @@ repeat:
 	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		spin_lock(&inode->i_lock);
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
 			hlist_bl_lock(b);
 			goto repeat;
 		}
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -805,13 +808,13 @@ repeat:
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
 			hlist_bl_lock(b);
 			goto repeat;
 		}
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -846,9 +849,13 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_sb_list_add(inode);
+		/*
+		 * set the inode state before we make the inode accessible to
+		 * the outside world.
+		 */
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_sb_list_add(inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -916,10 +923,14 @@ static struct inode *get_new_inode(struct super_block *sb,
 			if (set(inode, data))
 				goto set_failed;
 
+			/*
+			 * Set the inode state before we make the inode
+			 * visible to the outside world.
+			 */
+			inode->i_state = I_NEW;
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
 			__inode_sb_list_add(inode);
-			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -966,11 +977,15 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
+			/*
+			 * Set the inode state before we make the inode
+			 * visible to the outside world.
+			 */
 			inode->i_ino = ino;
+			inode->i_state = I_NEW;
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
 			__inode_sb_list_add(inode);
-			inode->i_state = I_NEW;
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -1313,8 +1328,11 @@ int insert_inode_locked(struct inode *inode)
 				continue;
 			if (old->i_sb != sb)
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1323,7 +1341,6 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
@@ -1344,6 +1361,10 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 	struct super_block *sb = inode->i_sb;
 	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
+	/*
+	 * Nobody else can see the new inode yet, so it is safe to set flags
+	 * without locking here.
+	 */
 	inode->i_state |= I_NEW;
 
 	while (1) {
@@ -1357,8 +1378,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 				continue;
 			if (!test(old, data))
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1367,7 +1391,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
@@ -1416,6 +1439,8 @@ static void iput_final(struct inode *inode)
 	const struct super_operations *op = inode->i_sb->s_op;
 	int drop;
 
+	assert_spin_locked(&inode->i_lock);
+
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1424,22 +1449,30 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
-			if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
+			    list_empty(&inode->i_lru)) {
+				spin_unlock(&inode->i_lock);
 				inode_lru_list_add(inode);
+				return;
+			}
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		__remove_inode_hash(inode);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		__remove_inode_hash(inode);
 	}
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
 	/*
 	 * After we delete the inode from the LRU and IO lists here, we avoid
@@ -1470,12 +1503,11 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		BUG_ON(inode->i_state & I_CLEAR);
-
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
+		BUG_ON(inode->i_state & I_CLEAR);
+
 		if (--inode->i_ref == 0) {
-			spin_unlock(&inode->i_lock);
 			iput_final(inode);
 			return;
 		}
@@ -1661,6 +1693,8 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1668,6 +1702,7 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
diff --git a/fs/nilfs2/gcdat.c b/fs/nilfs2/gcdat.c
index 84a45d1..c51f0e8 100644
--- a/fs/nilfs2/gcdat.c
+++ b/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 4ed0e43..203146b 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -249,8 +249,11 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		/*
 		 * If i_ref is zero, the inode cannot have any watches and
@@ -258,7 +261,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * evict all inodes with zero i_ref from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
 			spin_unlock(&inode->i_lock);
 			continue;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 7ef5411..b02a3e1 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -899,18 +899,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    !atomic_read(&inode->i_writecount) ||
+		    !dqinit_needed(inode, type)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount))
-			continue;
-		if (!dqinit_needed(inode, type))
-			continue;
 
-		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (15 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  2:17   ` Christoph Hellwig
  2010-10-21  0:49 ` [PATCH 18/21] fs: introduce a per-cpu last_ino allocator Dave Chinner
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Currently we have an optimisation in place in unlock_new_inode() to
avoid taking the inode_lock. This uses memory barriers to ensure
that the the clearing of I_NEW can be seen by all other CPUs. It
also means the wake_up_inode() call relies on a memory barrier to
ensure that processes it is waking see any changes the i_state field
prior to the wakeup being issued.

This serialisation is also necessary to protect the inode against
lookup/freeing races once the inode_lock is removed. The current
lookup and wait code is atomic as it is all performed under the
inode_lock, while the modified code now splits the locks. Hence we
can get the following race:

Thread 1:				Thread 2:
					iput_final
					spin_lock(&inode->i_lock);
hlist_bl_lock()
  Find inode
    spin_lock(&inode->i_lock)
					......
					inode->i_state = I_FREEING;
					spin_unlock(&inode->i_lock);
					remove_inode_hash()
					   hlist_bl_lock()
    ......
    if (I_FREEING)
	hlist_bl_unlock()
					   ......
					   hlist_bl_del_init()
					   hlist_bl_unlock()
					wake_up_inode()

	__wait_on_freeing_inode()
	  put on waitqueue
	  spin_unlock(&inode->i_lock);
	  schedule()

To avoid this race, wake ups need to be serialised against the
waiters, and using inode->i_lock is the natural solution.

The memory barrier optimisation is no longer needed to avoid the
global inode_lock contention as the inode->i_state field is now
protected by inode->i_lock. Hence we can revert the code to a much
simpler form and correctly serialise wake ups against
__wait_on_freeing_inode() by holding the inode->i_lock while we do
the wakeup.

Given the newfound simplicity of wake_up_inode() and the fact that
we need to change i_state in unlock_new_inode() before the wakeup,
just open code the wakeup in the couple of spots it is used and
remove wake_up_inode() entirely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   41 ++++++++++++++++++-----------------------
 1 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 32bac58..ba514a1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -175,15 +175,6 @@ int proc_nr_inodes(ctl_table *table, int write,
 }
 #endif
 
-static void wake_up_inode(struct inode *inode)
-{
-	/*
-	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
-	 */
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
-}
-
 /**
  * inode_init_always - perform inode structure intialisation
  * @sb: superblock inode belongs to
@@ -540,7 +531,9 @@ static void dispose_list(struct list_head *head)
 		__inode_sb_list_del(inode);
 		spin_unlock(&inode_lock);
 
-		wake_up_inode(inode);
+		spin_lock(&inode->i_lock);
+		wake_up_bit(&inode->i_state, __I_NEW);
+		spin_unlock(&inode->i_lock);
 		destroy_inode(inode);
 	}
 }
@@ -862,6 +855,13 @@ struct inode *new_inode(struct super_block *sb)
 }
 EXPORT_SYMBOL(new_inode);
 
+/**
+ * unlock_new_inode - clear the I_NEW state and wake up any waiters
+ * @inode:	new inode to unlock
+ *
+ * Called when the inode is fully initialised to clear the new state of the
+ * inode and wake up anyone waiting for the inode to finish initialisation.
+ */
 void unlock_new_inode(struct inode *inode)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -881,19 +881,11 @@ void unlock_new_inode(struct inode *inode)
 		}
 	}
 #endif
-	/*
-	 * This is special!  We do not need the spinlock when clearing I_NEW,
-	 * because we're guaranteed that nobody else tries to do anything about
-	 * the state of the inode when it is locked, as we just created it (so
-	 * there can be no old holders that haven't tested I_NEW).
-	 * However we must emit the memory barrier so that other CPUs reliably
-	 * see the clearing of I_NEW after the other inode initialisation has
-	 * completed.
-	 */
-	smp_mb();
+	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW;
-	wake_up_inode(inode);
+	wake_up_bit(&inode->i_state, __I_NEW);
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(unlock_new_inode);
 
@@ -1486,8 +1478,10 @@ static void iput_final(struct inode *inode)
 	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
-	wake_up_inode(inode);
+	spin_lock(&inode->i_lock);
+	wake_up_bit(&inode->i_state, __I_NEW);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+	spin_unlock(&inode->i_lock);
 	destroy_inode(inode);
 }
 
@@ -1690,7 +1684,8 @@ EXPORT_SYMBOL(inode_wait);
  * to recheck inode state.
  *
  * It doesn't matter if I_NEW is not set initially, a call to
- * wake_up_inode() after removing from the hash list will DTRT.
+ * wake_up_bit(&inode->i_state, __I_NEW) with the i_lock held after removing
+ * from the hash list will DTRT.
  *
  * This is called with inode_lock held.
  *
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 18/21] fs: introduce a per-cpu last_ino allocator
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (16 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 19/21] fs: icache remove inode_lock Dave Chinner
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>

new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations.  This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/inode.c |   45 ++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ba514a1..b33b57c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -815,6 +815,43 @@ repeat:
 	return NULL;
 }
 
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+#define LAST_INO_BATCH 1024
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int get_next_ino(void)
+{
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -829,12 +866,6 @@ repeat:
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static unsigned int last_ino;
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -846,7 +877,7 @@ struct inode *new_inode(struct super_block *sb)
 		 * set the inode state before we make the inode accessible to
 		 * the outside world.
 		 */
-		inode->i_ino = ++last_ino;
+		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
 		__inode_sb_list_add(inode);
 		spin_unlock(&inode_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 19/21] fs: icache remove inode_lock
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (17 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 18/21] fs: introduce a per-cpu last_ino allocator Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  2:14   ` Christian Stroetmann
  2010-10-21  0:49 ` [PATCH 20/21] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

All the functionality that the inode_lock protected has now been
wrapped up in new independent locks and/or functionality. Hence the
inode_lock does not serve a purpose any longer and hence can now be
removed.

Based on work originally done by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/Locking |    2 +-
 Documentation/filesystems/porting |    8 ++-
 Documentation/filesystems/vfs.txt |    2 +-
 fs/block_dev.c                    |    2 -
 fs/buffer.c                       |    2 +-
 fs/drop_caches.c                  |    4 -
 fs/fs-writeback.c                 |   85 ++++++----------------
 fs/inode.c                        |  147 ++++++++-----------------------------
 fs/logfs/inode.c                  |    2 +-
 fs/notify/inode_mark.c            |   10 +--
 fs/notify/mark.c                  |    1 -
 fs/notify/vfsmount_mark.c         |    1 -
 fs/ntfs/inode.c                   |    4 +-
 fs/ocfs2/inode.c                  |    2 +-
 fs/quota/dquot.c                  |   12 +--
 include/linux/fs.h                |    2 +-
 include/linux/writeback.h         |    2 -
 mm/backing-dev.c                  |    4 -
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 20 files changed, 81 insertions(+), 223 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283..7f98cd5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -114,7 +114,7 @@ alloc_inode:
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..f182795 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -299,7 +299,7 @@ be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
 been for ->delete_inode().
-	->drop_inode() returns int now; it's called on final iput() with inode_lock
+	->drop_inode() returns int now; it's called on final iput() with i_lock
 held and it returns true if filesystems wants the inode to be dropped.  As before,
 generic_drop_inode() is still the default and it's been updated appropriately.
 generic_delete_inode() is also alive and it consists simply of return 1.  Note that
@@ -318,3 +318,9 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+[mandatory]
+        The i_count field in the inode has been replaced with i_ref, which is
+a regular integer instead of an atomic_t.  Filesystems should not manipulate
+it directly but use helpers like igrab(), iref() and iput().
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 0dbbbe4..cc0fd79 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -246,7 +246,7 @@ or bottom half).
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the inode_lock spinlock held.
+	with the i_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7909775..dae9871 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,14 +58,12 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 {
 	struct backing_dev_info *old = inode->i_data.backing_dev_info;
 
-	spin_lock(&inode_lock);
 	bdi_lock_two(old, dst);
 	inode->i_data.backing_dev_info = dst;
 	if (!list_empty(&inode->i_wb_list))
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&old->wb.b_lock);
 	spin_unlock(&dst->wb.b_lock);
-	spin_unlock(&inode_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..66f7afd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f958dd8..bd39f65 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 807d936..f0f5ca0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -206,7 +206,7 @@ static void requeue_io(struct inode *inode)
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -306,27 +306,30 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
-/*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has a reference on the inode or the inode has I_WILL_FREE set.
+/**
+ * sync_inode - write an inode and its pages to disk.
+ * @inode: the inode to sync
+ * @wbc: controls the writeback mode
+ *
+ * sync_inode() will write an inode and its pages to disk.  It will also
+ * correctly update the inode on its superblock's dirty inode lists and will
+ * update inode->i_state.
  *
- * If `wait' is set, wait on the writeout.
+ * The caller must have a ref on the inode or the inode has I_WILL_FREE set.
+ *
+ * If @wbc->sync_mode == WB_SYNC_ALL the we are doing a data integrity
+ * operation so we need to wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
- *
- * Called under inode_lock.
  */
-static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
@@ -368,7 +371,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -388,12 +390,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * due to delalloc, clear dirty metadata flags right before
 	 * write_inode()
 	 */
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -401,7 +401,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -460,6 +459,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode_sync_complete(inode);
 	return ret;
 }
+EXPORT_SYMBOL(sync_inode);
 
 /*
  * For background writeback the caller does not have the sb pinned
@@ -552,7 +552,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		spin_unlock(&wb->b_lock);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		sync_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -562,10 +562,8 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			redirty_tail(inode);
 			spin_unlock(&wb->b_lock);
 		}
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -585,9 +583,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
-
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -607,7 +603,6 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 			break;
 	}
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -616,13 +611,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -732,7 +725,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
@@ -743,7 +735,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 			inode_wait_for_writeback(inode);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_lock);
 	}
 
 	return wrote;
@@ -1006,7 +997,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -1064,8 +1054,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 out_unlock:
 	spin_unlock(&inode->i_lock);
 out:
-	spin_unlock(&inode_lock);
-
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
 }
@@ -1098,7 +1086,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 
 	/*
@@ -1121,14 +1108,12 @@ static void wait_sb_inodes(struct super_block *sb)
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * inode_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it
-		 * under inode_lock. So we keep the reference and iput
-		 * it later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * s_inodes_lock.  We cannot iput the inode now as we can be
+		 * holding the last reference and we cannot iput it under
+		 * s_inodes_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1137,11 +1122,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -1244,33 +1227,9 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
-	ret = writeback_single_inode(inode, &wbc);
-	spin_unlock(&inode_lock);
+	ret = sync_inode(inode, &wbc);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
 }
 EXPORT_SYMBOL(write_inode_now);
-
-/**
- * sync_inode - write an inode and its pages to disk.
- * @inode: the inode to sync
- * @wbc: controls the writeback mode
- *
- * sync_inode() will write an inode and its pages to disk.  It will also
- * correctly update the inode on its superblock's dirty inode lists and will
- * update inode->i_state.
- *
- * The caller must have a ref on the inode.
- */
-int sync_inode(struct inode *inode, struct writeback_control *wbc)
-{
-	int ret;
-
-	spin_lock(&inode_lock);
-	ret = writeback_single_inode(inode, wbc);
-	spin_unlock(&inode_lock);
-	return ret;
-}
-EXPORT_SYMBOL(sync_inode);
diff --git a/fs/inode.c b/fs/inode.c
index b33b57c..0046ea8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -44,24 +44,20 @@
  *   inode_lru, i_lru
  *
  * Lock orders
- * inode_lock
- *   inode hash bucket lock
- *     inode->i_lock
+ * inode hash bucket lock
+ *   inode->i_lock
  *
- * inode_lock
- *   sb inode lock
- *     inode_lru_lock
- *       wb->b_lock
- *         inode->i_lock
+ * sb inode lock
+ *   inode_lru_lock
+ *     wb->b_lock
+ *       inode->i_lock
  *
- * inode_lock
- *   wb->b_lock
- *     sb_lock (pin sb for writeback)
- *     inode->i_lock
+ * wb->b_lock
+ *   sb_lock (pin sb for writeback)
+ *   inode->i_lock
  *
- * inode_lock
- *   inode_lru
- *     inode->i_lock
+ * inode_lru
+ *   inode->i_lock
  */
 /*
  * This is needed for the following functions:
@@ -114,14 +110,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
- * A simple spinlock to protect the list manipulations.
- *
- * NOTE! You also have to own the lock if you change
- * the i_state of an inode while it is in use..
- */
-DEFINE_SPINLOCK(inode_lock);
-
-/*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
@@ -355,11 +343,9 @@ static void init_once(void *foo)
 void iref(struct inode *inode)
 {
 	WARN_ON(inode->i_ref < 1);
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_ref++;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
 
@@ -387,28 +373,21 @@ void inode_lru_list_del(struct inode *inode)
 	spin_unlock(&inode_lru_lock);
 }
 
-static void __inode_sb_list_add(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	spin_lock(&sb->s_inodes_lock);
-	list_add(&inode->i_sb_list, &sb->s_inodes);
-	spin_unlock(&sb->s_inodes_lock);
-}
-
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
  * @inode: inode to add
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode_lock);
-	__inode_sb_list_add(inode);
-	spin_unlock(&inode_lock);
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
+	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
-static void __inode_sb_list_del(struct inode *inode)
+static void inode_sb_list_del(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 
@@ -439,22 +418,17 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_bl_head *b = inode_hashtable + hash(inode->i_sb, hashval);
 
-	spin_lock(&inode_lock);
 	hlist_bl_lock(b);
 	hlist_bl_add_head(&inode->i_hash, b);
 	hlist_bl_unlock(b);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
- *	__remove_inode_hash - remove an inode from the hash
+ *	remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock. inode->i_lock must be
- *	held.
  */
-static void __remove_inode_hash(struct inode *inode)
+void remove_inode_hash(struct inode *inode)
 {
 	struct hlist_bl_head *b;
 
@@ -463,19 +437,6 @@ static void __remove_inode_hash(struct inode *inode)
 	hlist_bl_del_init(&inode->i_hash);
 	hlist_bl_unlock(b);
 }
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	spin_lock(&inode_lock);
-	__remove_inode_hash(inode);
-	spin_unlock(&inode_lock);
-}
 EXPORT_SYMBOL(remove_inode_hash);
 
 void end_writeback(struct inode *inode)
@@ -526,10 +487,8 @@ static void dispose_list(struct list_head *head)
 
 		evict(inode);
 
-		spin_lock(&inode_lock);
-		__remove_inode_hash(inode);
-		__inode_sb_list_del(inode);
-		spin_unlock(&inode_lock);
+		remove_inode_hash(inode);
+		inode_sb_list_del(inode);
 
 		spin_lock(&inode->i_lock);
 		wake_up_bit(&inode->i_state, __I_NEW);
@@ -558,7 +517,6 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
@@ -609,12 +567,10 @@ int invalidate_inodes(struct super_block *sb)
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -625,7 +581,7 @@ EXPORT_SYMBOL(invalidate_inodes);
 
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
- * temporary list and then are freed outside inode_lock by dispose_list().
+ * temporary list and then are freed outside locks by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  If the inode has metadata buffers attached to
@@ -646,7 +602,6 @@ static void prune_icache(int nr_to_scan)
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -679,7 +634,6 @@ static void prune_icache(int nr_to_scan)
 			inode->i_ref++;
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lru_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
@@ -695,7 +649,6 @@ static void prune_icache(int nr_to_scan)
 			 * the I_REFERENCED flag on the next pass and do the
 			 * same. Either way, we won't spin on it in this loop.
 			 */
-			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
 			continue;
 		}
@@ -718,7 +671,6 @@ static void prune_icache(int nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -868,19 +820,15 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_lock);
-
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
 		/*
 		 * set the inode state before we make the inode accessible to
 		 * the outside world.
 		 */
 		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
-		__inode_sb_list_add(inode);
-		spin_unlock(&inode_lock);
+		inode_sb_list_add(inode);
 	}
 	return inode;
 }
@@ -938,7 +886,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		hlist_bl_lock(b);
 		/* We released the lock, so.. */
 		old = find_inode(sb, b, test, data);
@@ -953,8 +900,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 			inode->i_state = I_NEW;
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
-			__inode_sb_list_add(inode);
-			spin_unlock(&inode_lock);
+			inode_sb_list_add(inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -968,7 +914,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * allocated.
 		 */
 		hlist_bl_unlock(b);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -977,7 +922,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 set_failed:
 	hlist_bl_unlock(b);
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -995,7 +939,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		hlist_bl_lock(b);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
@@ -1008,8 +951,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 			inode->i_state = I_NEW;
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
-			__inode_sb_list_add(inode);
-			spin_unlock(&inode_lock);
+			inode_sb_list_add(inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1023,7 +965,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * allocated.
 		 */
 		hlist_bl_unlock(b);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1081,7 +1022,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	static unsigned int counter;
 	ino_t res;
 
-	spin_lock(&inode_lock);
 	spin_lock(&iunique_lock);
 	do {
 		if (counter <= max_reserved)
@@ -1089,7 +1029,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		res = counter++;
 	} while (!test_inode_iunique(sb, res));
 	spin_unlock(&iunique_lock);
-	spin_unlock(&inode_lock);
 
 	return res;
 }
@@ -1097,7 +1036,6 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
 		inode->i_ref++;
@@ -1111,7 +1049,6 @@ struct inode *igrab(struct inode *inode)
 		 */
 		inode = NULL;
 	}
-	spin_unlock(&inode_lock);
 	return inode;
 }
 EXPORT_SYMBOL(igrab);
@@ -1133,7 +1070,7 @@ EXPORT_SYMBOL(igrab);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct hlist_bl_head *b,
@@ -1142,11 +1079,9 @@ static struct inode *ifind(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	hlist_bl_lock(b);
 	inode = find_inode(sb, b, test, data);
 	hlist_bl_unlock(b);
-	spin_unlock(&inode_lock);
 
 	if (inode && likely(wait))
 		wait_on_inode(inode);
@@ -1174,11 +1109,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	hlist_bl_lock(b);
 	inode = find_inode_fast(sb, b, ino);
 	hlist_bl_unlock(b);
-	spin_unlock(&inode_lock);
 
 	if (inode)
 		wait_on_inode(inode);
@@ -1204,7 +1137,7 @@ static struct inode *ifind_fast(struct super_block *sb,
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1232,7 +1165,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1283,7 +1216,7 @@ EXPORT_SYMBOL(ilookup);
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_lock held, so can't sleep.
+ * Note both @test and @set are called with the i_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
@@ -1344,7 +1277,6 @@ int insert_inode_locked(struct inode *inode)
 	while (1) {
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
-		spin_lock(&inode_lock);
 		hlist_bl_lock(b);
 		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_ino != ino)
@@ -1361,13 +1293,11 @@ int insert_inode_locked(struct inode *inode)
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
@@ -1394,7 +1324,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 		hlist_bl_lock(b);
 		hlist_bl_for_each_entry(old, node, b, i_hash) {
 			if (old->i_sb != sb)
@@ -1411,13 +1340,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, b);
 			hlist_bl_unlock(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		old->i_ref++;
 		spin_unlock(&old->i_lock);
 		hlist_bl_unlock(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
@@ -1479,16 +1406,13 @@ static void iput_final(struct inode *inode)
 				return;
 			}
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
-		__remove_inode_hash(inode);
+		remove_inode_hash(inode);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
@@ -1500,13 +1424,12 @@ static void iput_final(struct inode *inode)
 	/*
 	 * After we delete the inode from the LRU and IO lists here, we avoid
 	 * moving dirty inodes back onto the LRU now because I_FREEING is set
-	 * and hence writeback_single_inode() won't move the inode around.
+	 * and hence sync_inode() won't move the inode around.
 	 */
 	inode_wb_list_del(inode);
 	inode_lru_list_del(inode);
 
-	__inode_sb_list_del(inode);
-	spin_unlock(&inode_lock);
+	inode_sb_list_del(inode);
 	evict(inode);
 	remove_inode_hash(inode);
 	spin_lock(&inode->i_lock);
@@ -1528,7 +1451,6 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		BUG_ON(inode->i_state & I_CLEAR);
 
@@ -1537,7 +1459,6 @@ void iput(struct inode *inode)
 			return;
 		}
 		spin_unlock(&inode->i_lock);
-		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
@@ -1718,8 +1639,6 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_bit(&inode->i_state, __I_NEW) with the i_lock held after removing
  * from the hash list will DTRT.
  *
- * This is called with inode_lock held.
- *
  * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
@@ -1729,10 +1648,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index d8c71ec..a67b607 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -286,7 +286,7 @@ static int logfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-/* called with inode_lock held */
+/* called with i_lock held */
 static int logfs_drop_inode(struct inode *inode)
 {
 	struct logfs_super *super = logfs_super(inode->i_sb);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 203146b..265ecba 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -22,7 +22,6 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
@@ -232,9 +231,8 @@ out:
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with inode_lock held, protecting the unmounting super block's list
- * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * We temporarily drop inode_lock, however, and CAN block.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb->s_inodes_lock protects the super block's list of inodes.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -288,13 +286,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_lock here because we hold
+		 * We can safely drop sb->s_inodes_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -306,7 +303,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 325185e..50c0085 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -91,7 +91,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/srcu.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c
index 56772b5..6f8eefe 100644
--- a/fs/notify/vfsmount_mark.c
+++ b/fs/notify/vfsmount_mark.c
@@ -23,7 +23,6 @@
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 07fdef8..9b9375a 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -54,7 +54,7 @@
  *
  * Return 1 if the attributes match and 0 if not.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep.
  */
 int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
@@ -98,7 +98,7 @@ int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
  *
  * Return 0 on success and -errno on error.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
  */
 static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index eece3e0..65c61e2 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1195,7 +1195,7 @@ void ocfs2_evict_inode(struct inode *inode)
 	ocfs2_clear_inode(inode);
 }
 
-/* Called under inode_lock, with no more references on the
+/* Called under i_lock, with no more references on the
  * struct inode, so it's safe here to check the flags field
  * and to manipulate i_nlink without any other locks. */
 int ocfs2_drop_inode(struct inode *inode)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index b02a3e1..178bed4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -76,7 +76,7 @@
 #include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
-#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -896,7 +896,6 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -914,21 +913,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the inode_lock.
+		 * removed from s_inodes list while we dropped the lock.
 		 * We cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under inode_lock. So we
+		 * reference and we cannot iput it under the lock. So we
 		 * keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1009,7 +1005,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -1025,7 +1020,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f222ce8..abdb756 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1593,7 +1593,7 @@ struct super_operations {
 };
 
 /*
- * Inode state bits.  Protected by inode_lock.
+ * Inode state bits.  Protected by i_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 242b6f8..fa38cf0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
-
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 52442bd..b38aa5d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -73,7 +73,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -82,7 +81,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -697,14 +695,12 @@ void bdi_destroy(struct backing_dev_info *bdi)
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_lock);
 		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
 		spin_unlock(&bdi->wb.b_lock);
 		spin_unlock(&dst->b_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	bdi_unregister(bdi);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..ece6ef2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  ->inode_lock
+ *  ->i_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,8 +98,8 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(zap_pte_range->set_page_dirty)
+ *    ->i_lock			(page_remove_rmap->set_page_dirty)
+ *    ->i_lock			(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
  *  ->task->proc_lock
diff --git a/mm/rmap.c b/mm/rmap.c
index 92e6757..dbfccae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -31,11 +31,11 @@
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
- *               inode_lock (in set_page_dirty's __mark_inode_dirty)
- *                 sb_lock (within inode_lock in fs/fs-writeback.c)
+ *               i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                 sb_lock (within i_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_lock in __sync_single_inode)
+ *                           within i_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 20/21] fs: Reduce inode I_FREEING and factor inode disposal
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (18 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 19/21] fs: icache remove inode_lock Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  0:49 ` [PATCH 21/21] fs: do not assign default i_ino in new_inode Dave Chinner
  2010-10-21  5:04 ` Inode Lock Scalability V7 (was V6) Dave Chinner
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim can push many inodes into the I_FREEING state before
it actually frees them. During the time it gathers these inodes, it
can call iput(), invalidate_mapping_pages, be preempted, etc. As a
result, holding inodes in I_FREEING can cause pauses.

After the inode scalability work, there is not a big reason to batch
up inodes to reclaim them, so we can dispose them as they are found
from the LRU.

Unmount does a very similar reclaim process via invalidate_list(),
but currently uses the i_lru list to aggregate inodes for a batched
disposal. This requires taking the inode_lru_lock for every inode we
want to dispose. Instead, take the inodes off the superblock inode
list (as we already hold the lock) and use i_sb_list as the
aggregator for inodes to dispose to reduce lock traffic.

Further, iput_final() does the same inode cleanup as reclaim and
unmount, so convert them all to use a single function for destroying
inodes. This is written such that the callers can optimise list
removals to avoid unneccessary lock round trips when removing inodes
from lists.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |  112 ++++++++++++++++++++++++++++++++----------------------------
 1 files changed, 60 insertions(+), 52 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 0046ea8..d60e3b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -49,8 +49,8 @@
  *
  * sb inode lock
  *   inode_lru_lock
- *     wb->b_lock
- *       inode->i_lock
+ *   wb->b_lock
+ *     inode->i_lock
  *
  * wb->b_lock
  *   sb_lock (pin sb for writeback)
@@ -471,6 +471,48 @@ static void evict(struct inode *inode)
 }
 
 /*
+ * Free the inode passed in, removing it from the lists it is still connected
+ * to but avoiding unnecessary lock round-trips for the lists it is no longer
+ * on.
+ *
+ * An inode must already be marked I_FREEING so that we avoid the inode being
+ * moved back onto lists if we race with other code that manipulates the lists
+ * (e.g. sync_inode). The caller is responsisble for setting this.
+ */
+static void dispose_one_inode(struct inode *inode)
+{
+	BUG_ON(!(inode->i_state & I_FREEING));
+
+	/*
+	 * move the inode off the IO lists and LRU once
+	 * I_FREEING is set so that it won't get moved back on
+	 * there if it is dirty.
+	 */
+	if (!list_empty(&inode->i_wb_list))
+		inode_wb_list_del(inode);
+	if (!list_empty(&inode->i_lru))
+		inode_lru_list_del(inode);
+	if (!list_empty(&inode->i_sb_list))
+		inode_sb_list_del(inode);
+
+	evict(inode);
+
+	/*
+	 * Remove the inode from the hash before waking any waiters. This
+	 * ordering is necessary to ensure that any lookup that finds an inode
+	 * in the I_FREEING state does not race with the wake up below. The
+	 * i_lock around the wakeup ensures this is correctly serialised.
+	 */
+	remove_inode_hash(inode);
+	spin_lock(&inode->i_lock);
+	wake_up_bit(&inode->i_state, __I_NEW);
+	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+	spin_unlock(&inode->i_lock);
+
+	destroy_inode(inode);
+}
+
+/*
  * dispose_list - dispose of the contents of a local list
  * @head: the head of the list to free
  *
@@ -482,18 +524,10 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_lru);
-		list_del_init(&inode->i_lru);
-
-		evict(inode);
+		inode = list_first_entry(head, struct inode, i_sb_list);
+		list_del_init(&inode->i_sb_list);
 
-		remove_inode_hash(inode);
-		inode_sb_list_del(inode);
-
-		spin_lock(&inode->i_lock);
-		wake_up_bit(&inode->i_state, __I_NEW);
-		spin_unlock(&inode->i_lock);
-		destroy_inode(inode);
+		dispose_one_inode(inode);
 	}
 }
 
@@ -534,17 +568,8 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
 
-			/*
-			 * move the inode off the IO lists and LRU once
-			 * I_FREEING is set so that it won't get moved back on
-			 * there if it is dirty.
-			 */
-			inode_wb_list_del(inode);
-
-			spin_lock(&inode_lru_lock);
-			list_move(&inode->i_lru, dispose);
-			percpu_counter_dec(&nr_inodes_unused);
-			spin_unlock(&inode_lru_lock);
+			/* save a lock round trip by removing the inode here. */
+			list_move(&inode->i_sb_list, dispose);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
@@ -563,17 +588,17 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
  */
 int invalidate_inodes(struct super_block *sb)
 {
-	int busy;
 	LIST_HEAD(throw_away);
+	int busy;
 
 	down_write(&iprune_sem);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
 	spin_unlock(&sb->s_inodes_lock);
+	up_write(&iprune_sem);
 
 	dispose_list(&throw_away);
-	up_write(&iprune_sem);
 
 	return busy;
 }
@@ -597,7 +622,6 @@ EXPORT_SYMBOL(invalidate_inodes);
  */
 static void prune_icache(int nr_to_scan)
 {
-	LIST_HEAD(freeable);
 	int nr_scanned;
 	unsigned long reap = 0;
 
@@ -656,15 +680,15 @@ static void prune_icache(int nr_to_scan)
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
 
-		/*
-		 * move the inode off the io lists and lru once
-		 * i_freeing is set so that it won't get moved back on
-		 * there if it is dirty.
-		 */
-		inode_wb_list_del(inode);
-
-		list_move(&inode->i_lru, &freeable);
+		/* save a lock round trip by removing the inode here. */
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
+		spin_unlock(&inode_lru_lock);
+
+		dispose_one_inode(inode);
+		cond_resched();
+
+		spin_lock(&inode_lru_lock);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -672,7 +696,6 @@ static void prune_icache(int nr_to_scan)
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
 
-	dispose_list(&freeable);
 	up_read(&iprune_sem);
 }
 
@@ -1421,22 +1444,7 @@ static void iput_final(struct inode *inode)
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
 
-	/*
-	 * After we delete the inode from the LRU and IO lists here, we avoid
-	 * moving dirty inodes back onto the LRU now because I_FREEING is set
-	 * and hence sync_inode() won't move the inode around.
-	 */
-	inode_wb_list_del(inode);
-	inode_lru_list_del(inode);
-
-	inode_sb_list_del(inode);
-	evict(inode);
-	remove_inode_hash(inode);
-	spin_lock(&inode->i_lock);
-	wake_up_bit(&inode->i_state, __I_NEW);
-	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-	spin_unlock(&inode->i_lock);
-	destroy_inode(inode);
+	dispose_one_inode(inode);
 }
 
 /**
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 21/21] fs: do not assign default i_ino in new_inode
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (19 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 20/21] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
@ 2010-10-21  0:49 ` Dave Chinner
  2010-10-21  5:04 ` Inode Lock Scalability V7 (was V6) Dave Chinner
  21 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  0:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Christoph Hellwig <hch@lst.de>

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/infiniband/hw/ipath/ipath_fs.c |    1 +
 drivers/infiniband/hw/qib/qib_fs.c     |    1 +
 drivers/misc/ibmasm/ibmasmfs.c         |    1 +
 drivers/oprofile/oprofilefs.c          |    1 +
 drivers/usb/core/inode.c               |    1 +
 drivers/usb/gadget/f_fs.c              |    1 +
 drivers/usb/gadget/inode.c             |    1 +
 fs/anon_inodes.c                       |    1 +
 fs/autofs4/inode.c                     |    1 +
 fs/binfmt_misc.c                       |    1 +
 fs/configfs/inode.c                    |    1 +
 fs/debugfs/inode.c                     |    1 +
 fs/ext4/mballoc.c                      |    1 +
 fs/freevxfs/vxfs_inode.c               |    1 +
 fs/fuse/control.c                      |    1 +
 fs/hugetlbfs/inode.c                   |    1 +
 fs/inode.c                             |    4 ++--
 fs/ocfs2/dlmfs/dlmfs.c                 |    2 ++
 fs/pipe.c                              |    2 ++
 fs/proc/base.c                         |    2 ++
 fs/proc/proc_sysctl.c                  |    2 ++
 fs/ramfs/inode.c                       |    1 +
 fs/xfs/linux-2.6/xfs_buf.c             |    1 +
 include/linux/fs.h                     |    1 +
 ipc/mqueue.c                           |    1 +
 kernel/cgroup.c                        |    1 +
 mm/shmem.c                             |    1 +
 net/socket.c                           |    1 +
 net/sunrpc/rpc_pipe.c                  |    1 +
 security/inode.c                       |    1 +
 security/selinux/selinuxfs.c           |    1 +
 31 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c
index 2fca708..3d7c1df 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -57,6 +57,7 @@ static int ipathfs_mknod(struct inode *dir, struct dentry *dentry,
 		goto bail;
 	}
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	inode->i_private = data;
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index 9f989c0..0a8da2a 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -58,6 +58,7 @@ static int qibfs_mknod(struct inode *dir, struct dentry *dentry,
 		goto bail;
 	}
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_uid = 0;
 	inode->i_gid = 0;
diff --git a/drivers/misc/ibmasm/ibmasmfs.c b/drivers/misc/ibmasm/ibmasmfs.c
index 8844a3f..1ebe935 100644
--- a/drivers/misc/ibmasm/ibmasmfs.c
+++ b/drivers/misc/ibmasm/ibmasmfs.c
@@ -146,6 +146,7 @@ static struct inode *ibmasmfs_make_inode(struct super_block *sb, int mode)
 	struct inode *ret = new_inode(sb);
 
 	if (ret) {
+		ret->i_ino = get_next_ino();
 		ret->i_mode = mode;
 		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
 	}
diff --git a/drivers/oprofile/oprofilefs.c b/drivers/oprofile/oprofilefs.c
index 2766a6d..5acc58d 100644
--- a/drivers/oprofile/oprofilefs.c
+++ b/drivers/oprofile/oprofilefs.c
@@ -28,6 +28,7 @@ static struct inode *oprofilefs_get_inode(struct super_block *sb, int mode)
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	}
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index 095fa53..e2f63c0 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -276,6 +276,7 @@ static struct inode *usbfs_get_inode (struct super_block *sb, int mode, dev_t de
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
diff --git a/drivers/usb/gadget/f_fs.c b/drivers/usb/gadget/f_fs.c
index e4f5950..e093fd8 100644
--- a/drivers/usb/gadget/f_fs.c
+++ b/drivers/usb/gadget/f_fs.c
@@ -980,6 +980,7 @@ ffs_sb_make_inode(struct super_block *sb, void *data,
 	if (likely(inode)) {
 		struct timespec current_time = CURRENT_TIME;
 
+		inode->i_ino	 = usbfs_get_inode();
 		inode->i_mode    = perms->mode;
 		inode->i_uid     = perms->uid;
 		inode->i_gid     = perms->gid;
diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index fc35406..136e78d 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -1994,6 +1994,7 @@ gadgetfs_make_inode (struct super_block *sb,
 	struct inode *inode = new_inode (sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = default_uid;
 		inode->i_gid = default_gid;
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 451be78..327c484 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -189,6 +189,7 @@ static struct inode *anon_inode_mkinode(void)
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
 
+	inode->i_ino = get_next_ino();
 	inode->i_fop = &anon_inode_fops;
 
 	inode->i_mapping->a_ops = &anon_aops;
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 821b2b9..ac87e49 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -398,6 +398,7 @@ struct inode *autofs4_get_inode(struct super_block *sb,
 		inode->i_gid = sb->s_root->d_inode->i_gid;
 	}
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	inode->i_ino = get_next_ino();
 
 	if (S_ISDIR(inf->mode)) {
 		inode->i_nlink = 2;
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index fd0cc0b..37c4aef 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -495,6 +495,7 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
 	struct inode * inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime =
 			current_fs_time(inode->i_sb);
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index cf78d44..253476d 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -135,6 +135,7 @@ struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent * sd)
 {
 	struct inode * inode = new_inode(configfs_sb);
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mapping->a_ops = &configfs_aops;
 		inode->i_mapping->backing_dev_info = &configfs_backing_dev_info;
 		inode->i_op = &configfs_inode_operations;
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 30a87b3..a4ed838 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -40,6 +40,7 @@ static struct inode *debugfs_get_inode(struct super_block *sb, int mode, dev_t d
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4b4ad4b..96e2bf3 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2373,6 +2373,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
 		printk(KERN_ERR "EXT4-fs: can't get new inode\n");
 		goto err_freesgi;
 	}
+	sbi->s_buddy_cache->i_ino = get_next_ino();
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
 		desc = ext4_get_group_desc(sb, i, NULL);
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index 79d1b4e..8c04eac 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -260,6 +260,7 @@ vxfs_get_fake_inode(struct super_block *sbp, struct vxfs_inode_info *vip)
 	struct inode			*ip = NULL;
 
 	if ((ip = new_inode(sbp))) {
+		ip->i_ino = get_next_ino();
 		vxfs_iinit(ip, vip);
 		ip->i_mapping->a_ops = &vxfs_aops;
 	}
diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 3773fd6..3f67de2 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -218,6 +218,7 @@ static struct dentry *fuse_ctl_add_dentry(struct dentry *parent,
 	if (!inode)
 		return NULL;
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_uid = fc->user_id;
 	inode->i_gid = fc->group_id;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6e5bd42..b83f9ff 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -455,6 +455,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 	inode = new_inode(sb);
 	if (inode) {
 		struct hugetlbfs_inode_info *info;
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
diff --git a/fs/inode.c b/fs/inode.c
index d60e3b5..0b2d986 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -808,7 +808,7 @@ repeat:
 #define LAST_INO_BATCH 1024
 static DEFINE_PER_CPU(unsigned int, last_ino);
 
-static unsigned int get_next_ino(void)
+unsigned int get_next_ino(void)
 {
 	unsigned int *p = &get_cpu_var(last_ino);
 	unsigned int res = *p;
@@ -826,6 +826,7 @@ static unsigned int get_next_ino(void)
 	put_cpu_var(last_ino);
 	return res;
 }
+EXPORT_SYMBOL(get_next_ino);
 
 /**
  *	new_inode 	- obtain an inode
@@ -849,7 +850,6 @@ struct inode *new_inode(struct super_block *sb)
 		 * set the inode state before we make the inode accessible to
 		 * the outside world.
 		 */
-		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
 		inode_sb_list_add(inode);
 	}
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index c2903b8..124d400 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -400,6 +400,7 @@ static struct inode *dlmfs_get_root_inode(struct super_block *sb)
 	if (inode) {
 		ip = DLMFS_I(inode);
 
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
@@ -425,6 +426,7 @@ static struct inode *dlmfs_get_inode(struct inode *parent,
 	if (!inode)
 		return NULL;
 
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
diff --git a/fs/pipe.c b/fs/pipe.c
index 279eef9..acd453b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -954,6 +954,8 @@ static struct inode * get_pipe_inode(void)
 	if (!inode)
 		goto fail_inode;
 
+	inode->i_ino = get_next_ino();
+
 	pipe = alloc_pipe_info(inode);
 	if (!pipe)
 		goto fail_iput;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8e4adda..d2efd66 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1600,6 +1600,7 @@ static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_st
 
 	/* Common stuff */
 	ei = PROC_I(inode);
+	inode->i_ino = get_next_ino();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	inode->i_op = &proc_def_inode_operations;
 
@@ -2542,6 +2543,7 @@ static struct dentry *proc_base_instantiate(struct inode *dir,
 
 	/* Initialize the inode */
 	ei = PROC_I(inode);
+	inode->i_ino = get_next_ino();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 
 	/*
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 5be436e..f473a7b 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -23,6 +23,8 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	if (!inode)
 		goto out;
 
+	inode->i_ino = get_next_ino();
+
 	sysctl_head_get(head);
 	ei = PROC_I(inode);
 	ei->sysctl = head;
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a5ebae7..67fadb1 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -58,6 +58,7 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 	struct inode * inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 286e36e..a47e6db 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1572,6 +1572,7 @@ xfs_mapping_buftarg(
 			XFS_BUFTARG_NAME(btp));
 		return ENOMEM;
 	}
+	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFBLK;
 	inode->i_bdev = bdev;
 	inode->i_rdev = bdev->bd_dev;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index abdb756..213272b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2189,6 +2189,7 @@ extern struct inode * iget_locked(struct super_block *, unsigned long);
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
+extern unsigned int get_next_ino(void);
 
 extern void iref(struct inode *inode);
 extern void iget_failed(struct inode *);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index d53a2c1..a72f3c5 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -116,6 +116,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 
 	inode = new_inode(sb);
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c9483d8..e28f8e5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -778,6 +778,7 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
diff --git a/mm/shmem.c b/mm/shmem.c
index 419de2c..504ae65 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1586,6 +1586,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 
 	inode = new_inode(sb);
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
diff --git a/net/socket.c b/net/socket.c
index 715ca57..56114ec 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -480,6 +480,7 @@ static struct socket *sock_alloc(void)
 	sock = SOCKET_I(inode);
 
 	kmemcheck_annotate_bitfield(sock, type);
+	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 8c8eef2..70da9a4 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -453,6 +453,7 @@ rpc_get_inode(struct super_block *sb, umode_t mode)
 	struct inode *inode = new_inode(sb);
 	if (!inode)
 		return NULL;
+	inode->i_ino = get_next_ino();
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	switch(mode & S_IFMT) {
diff --git a/security/inode.c b/security/inode.c
index 8c777f0..d3321c2 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -60,6 +60,7 @@ static struct inode *get_inode(struct super_block *sb, int mode, dev_t dev)
 	struct inode *inode = new_inode(sb);
 
 	if (inode) {
+		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 79a1bb6..9e98cdc 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -785,6 +785,7 @@ static struct inode *sel_make_inode(struct super_block *sb, int mode)
 	struct inode *ret = new_inode(sb);
 
 	if (ret) {
+		ret->i_ino = get_next_ino();
 		ret->i_mode = mode;
 		ret->i_atime = ret->i_mtime = ret->i_ctime = CURRENT_TIME;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/21] fs: Clean up inode reference counting
  2010-10-21  0:49 ` [PATCH 06/21] fs: Clean up inode reference counting Dave Chinner
@ 2010-10-21  1:41   ` Christoph Hellwig
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-21  1:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 874972d..d1c2f08 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
>  	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
>  	BUG_ON(error);
>  	if (!nfsi->npages) {
> -		igrab(inode);
> +		iref(inode);
>  		if (nfs_have_delegation(inode, FMODE_WRITE))
>  			nfsi->change_attr++;

This still needs to be an unlocked increment, as point out in reply to
the last iteration.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 19/21] fs: icache remove inode_lock
  2010-10-21  0:49 ` [PATCH 19/21] fs: icache remove inode_lock Dave Chinner
@ 2010-10-21  2:14   ` Christian Stroetmann
  0 siblings, 0 replies; 58+ messages in thread
From: Christian Stroetmann @ 2010-10-21  2:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

  Aloha;

On the 21.10.2010 02:49, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> All the functionality that the inode_lock protected has now been
> wrapped up in new independent locks and/or functionality. Hence the
> inode_lock does not serve a purpose any longer and hence can now be
> removed.
>
> Based on work originally done by Nick Piggin.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig<hch@lst.de>
> ---
>   Documentation/filesystems/Locking |    2 +-
>   Documentation/filesystems/porting |    8 ++-
>   Documentation/filesystems/vfs.txt |    2 +-
>   fs/block_dev.c                    |    2 -
>   fs/buffer.c                       |    2 +-
>   fs/drop_caches.c                  |    4 -
>   fs/fs-writeback.c                 |   85 ++++++----------------
>   fs/inode.c                        |  147 ++++++++-----------------------------
>   fs/logfs/inode.c                  |    2 +-
>   fs/notify/inode_mark.c            |   10 +--
>   fs/notify/mark.c                  |    1 -
>   fs/notify/vfsmount_mark.c         |    1 -
>   fs/ntfs/inode.c                   |    4 +-
>   fs/ocfs2/inode.c                  |    2 +-
>   fs/quota/dquot.c                  |   12 +--
>   include/linux/fs.h                |    2 +-
>   include/linux/writeback.h         |    2 -
>   mm/backing-dev.c                  |    4 -
>   mm/filemap.c                      |    6 +-
>   mm/rmap.c                         |    6 +-
>   20 files changed, 81 insertions(+), 223 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 2db4283..7f98cd5 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -114,7 +114,7 @@ alloc_inode:
>   destroy_inode:
>   dirty_inode:				(must not sleep)
>   write_inode:
> -drop_inode:				!!!inode_lock!!!
> +drop_inode:				!!!i_lock!!!
>   evict_inode:
>   put_super:		write
>   write_super:		read
> diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
> index b12c895..f182795 100644
> --- a/Documentation/filesystems/porting
> +++ b/Documentation/filesystems/porting
> @@ -299,7 +299,7 @@ be used instead.  It gets called whenever the inode is evicted, whether it has
>   remaining links or not.  Caller does *not* evict the pagecache or inode-associated
>   metadata buffers; getting rid of those is responsibility of method, as it had
>   been for ->delete_inode().
> -	->drop_inode() returns int now; it's called on final iput() with inode_lock
> +	->drop_inode() returns int now; it's called on final iput() with i_lock
>   held and it returns true if filesystems wants the inode to be dropped.  As before,
still exists :-) :
filesystems want
>   generic_drop_inode() is still the default and it's been updated appropriately.
>   generic_delete_inode() is also alive and it consists simply of return 1.  Note that
> @@ -318,3 +318,9 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
>   may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
>   free the on-disk inode, you may end up doing that while ->write_inode() is writing
>   to it.
> +
> +[mandatory]
> +        The i_count field in the inode has been replaced with i_ref, which is
> +a regular integer instead of an atomic_t.  Filesystems should not manipulate
> +it directly but use helpers like igrab(), iref() and iput().
> +
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 0dbbbe4..cc0fd79 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -246,7 +246,7 @@ or bottom half).
>   	should be synchronous or not, not all filesystems check this flag.
>
>     drop_inode: called when the last access to the inode is dropped,
> -	with the inode_lock spinlock held.
> +	with the i_lock spinlock held.
>
>   	This method should be either NULL (normal UNIX filesystem
>   	semantics) or "generic_delete_inode" (for filesystems that do not
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7909775..dae9871 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -58,14 +58,12 @@ static void bdev_inode_switch_bdi(struct inode *inode,
>   {
>   	struct backing_dev_info *old = inode->i_data.backing_dev_info;
>
> -	spin_lock(&inode_lock);
>   	bdi_lock_two(old, dst);
>   	inode->i_data.backing_dev_info = dst;
>   	if (!list_empty(&inode->i_wb_list))
>   		list_move(&inode->i_wb_list,&dst->wb.b_dirty);
>   	spin_unlock(&old->wb.b_lock);
>   	spin_unlock(&dst->wb.b_lock);
> -	spin_unlock(&inode_lock);
>   }
>
>   static sector_t max_block(struct block_device *bdev)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 3e7dca2..66f7afd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
>    * inode list.
>    *
>    * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
> - * mapping->tree_lock and the global inode_lock.
> + * and mapping->tree_lock.
>    */
>   void mark_buffer_dirty(struct buffer_head *bh)
>   {
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index f958dd8..bd39f65 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>   {
>   	struct inode *inode, *toput_inode = NULL;
>
> -	spin_lock(&inode_lock);
>   	spin_lock(&sb->s_inodes_lock);
>   	list_for_each_entry(inode,&sb->s_inodes, i_sb_list) {
>   		spin_lock(&inode->i_lock);
> @@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>   		inode->i_ref++;
>   		spin_unlock(&inode->i_lock);
>   		spin_unlock(&sb->s_inodes_lock);
> -		spin_unlock(&inode_lock);
>   		invalidate_mapping_pages(inode->i_mapping, 0, -1);
>   		iput(toput_inode);
>   		toput_inode = inode;
> -		spin_lock(&inode_lock);
>   		spin_lock(&sb->s_inodes_lock);
>   	}
>   	spin_unlock(&sb->s_inodes_lock);
> -	spin_unlock(&inode_lock);
>   	iput(toput_inode);
>   }
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 807d936..f0f5ca0 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -206,7 +206,7 @@ static void requeue_io(struct inode *inode)
>   static void inode_sync_complete(struct inode *inode)
>   {
>   	/*
> -	 * Prevent speculative execution through spin_unlock(&inode_lock);
> +	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
>   	 */
>   	smp_mb();
>   	wake_up_bit(&inode->i_state, __I_SYNC);
> @@ -306,27 +306,30 @@ static void inode_wait_for_writeback(struct inode *inode)
>   	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
>   	while (inode->i_state&  I_SYNC) {
>   		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_lock);
>   		__wait_on_bit(wqh,&wq, inode_wait, TASK_UNINTERRUPTIBLE);
> -		spin_lock(&inode_lock);
>   		spin_lock(&inode->i_lock);
>   	}
>   }
>
> -/*
> - * Write out an inode's dirty pages.  Called under inode_lock.  Either the
> - * caller has a reference on the inode or the inode has I_WILL_FREE set.
> +/**
> + * sync_inode - write an inode and its pages to disk.
> + * @inode: the inode to sync
> + * @wbc: controls the writeback mode
> + *
> + * sync_inode() will write an inode and its pages to disk.  It will also
> + * correctly update the inode on its superblock's dirty inode lists and will
> + * update inode->i_state.
>    *
> - * If `wait' is set, wait on the writeout.
> + * The caller must have a ref on the inode or the inode has I_WILL_FREE set.
> + *
> + * If @wbc->sync_mode == WB_SYNC_ALL the we are doing a data integrity
> + * operation so we need to wait on the writeout.
>    *
>    * The whole writeout design is quite complex and fragile.  We want to avoid
>    * starvation of particular inodes when others are being redirtied, prevent
>    * livelocks, etc.
> - *
> - * Called under inode_lock.
>    */
> -static int
> -writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> +int sync_inode(struct inode *inode, struct writeback_control *wbc)
>   {
>   	struct backing_dev_info *bdi = inode_to_bdi(inode);
>   	struct address_space *mapping = inode->i_mapping;
> @@ -368,7 +371,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   	inode->i_state |= I_SYNC;
>   	inode->i_state&= ~I_DIRTY_PAGES;
>   	spin_unlock(&inode->i_lock);
> -	spin_unlock(&inode_lock);
>
>   	ret = do_writepages(mapping, wbc);
>
> @@ -388,12 +390,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   	 * due to delalloc, clear dirty metadata flags right before
>   	 * write_inode()
>   	 */
> -	spin_lock(&inode_lock);
>   	spin_lock(&inode->i_lock);
>   	dirty = inode->i_state&  I_DIRTY;
>   	inode->i_state&= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
>   	spin_unlock(&inode->i_lock);
> -	spin_unlock(&inode_lock);
>   	/* Don't write the inode if only I_DIRTY_PAGES was set */
>   	if (dirty&  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>   		int err = write_inode(inode, wbc);
> @@ -401,7 +401,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   			ret = err;
>   	}
>
> -	spin_lock(&inode_lock);
>   	spin_lock(&inode->i_lock);
>   	inode->i_state&= ~I_SYNC;
>   	if (!(inode->i_state&  I_FREEING)) {
> @@ -460,6 +459,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   	inode_sync_complete(inode);
>   	return ret;
>   }
> +EXPORT_SYMBOL(sync_inode);
>
>   /*
>    * For background writeback the caller does not have the sb pinned
> @@ -552,7 +552,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
>   		spin_unlock(&wb->b_lock);
>
>   		pages_skipped = wbc->pages_skipped;
> -		writeback_single_inode(inode, wbc);
> +		sync_inode(inode, wbc);
>   		if (wbc->pages_skipped != pages_skipped) {
>   			/*
>   			 * writeback is not making progress due to locked
> @@ -562,10 +562,8 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
>   			redirty_tail(inode);
>   			spin_unlock(&wb->b_lock);
>   		}
> -		spin_unlock(&inode_lock);
>   		iput(inode);
>   		cond_resched();
> -		spin_lock(&inode_lock);
>   		spin_lock(&wb->b_lock);
>   		if (wbc->nr_to_write<= 0) {
>   			wbc->more_io = 1;
> @@ -585,9 +583,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>
>   	if (!wbc->wb_start)
>   		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&inode_lock);
>   	spin_lock(&wb->b_lock);
> -
>   	if (!wbc->for_kupdate || list_empty(&wb->b_io))
>   		queue_io(wb, wbc->older_than_this);
>
> @@ -607,7 +603,6 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>   			break;
>   	}
>   	spin_unlock(&wb->b_lock);
> -	spin_unlock(&inode_lock);
>   	/* Leave any unwritten inodes on b_io */
>   }
>
> @@ -616,13 +611,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
>   {
>   	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> -	spin_lock(&inode_lock);
>   	spin_lock(&wb->b_lock);
>   	if (!wbc->for_kupdate || list_empty(&wb->b_io))
>   		queue_io(wb, wbc->older_than_this);
>   	writeback_sb_inodes(sb, wb, wbc, true);
>   	spin_unlock(&wb->b_lock);
> -	spin_unlock(&inode_lock);
>   }
>
>   /*
> @@ -732,7 +725,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>   		 * become available for writeback. Otherwise
>   		 * we'll just busyloop.
>   		 */
> -		spin_lock(&inode_lock);
>   		if (!list_empty(&wb->b_more_io))  {
>   			spin_lock(&wb->b_lock);
>   			inode = list_entry(wb->b_more_io.prev,
> @@ -743,7 +735,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>   			inode_wait_for_writeback(inode);
>   			spin_unlock(&inode->i_lock);
>   		}
> -		spin_unlock(&inode_lock);
>   	}
>
>   	return wrote;
> @@ -1006,7 +997,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>   	if (unlikely(block_dump))
>   		block_dump___mark_inode_dirty(inode);
>
> -	spin_lock(&inode_lock);
>   	spin_lock(&inode->i_lock);
>   	if ((inode->i_state&  flags) != flags) {
>   		const int was_dirty = inode->i_state&  I_DIRTY;
> @@ -1064,8 +1054,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>   out_unlock:
>   	spin_unlock(&inode->i_lock);
>   out:
> -	spin_unlock(&inode_lock);
> -
>   	if (wakeup_bdi)
>   		bdi_wakeup_thread_delayed(bdi);
>   }
> @@ -1098,7 +1086,6 @@ static void wait_sb_inodes(struct super_block *sb)
>   	 */
>   	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> -	spin_lock(&inode_lock);
>   	spin_lock(&sb->s_inodes_lock);
>
>   	/*
> @@ -1121,14 +1108,12 @@ static void wait_sb_inodes(struct super_block *sb)
>   		inode->i_ref++;
>   		spin_unlock(&inode->i_lock);
>   		spin_unlock(&sb->s_inodes_lock);
> -		spin_unlock(&inode_lock);
>   		/*
> -		 * We hold a reference to 'inode' so it couldn't have
> -		 * been removed from s_inodes list while we dropped the
> -		 * inode_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it
> -		 * under inode_lock. So we keep the reference and iput
> -		 * it later.
> +		 * We hold a reference to 'inode' so it couldn't have been
> +		 * removed from s_inodes list while we dropped the
> +		 * s_inodes_lock.  We cannot iput the inode now as we can be
> +		 * holding the last reference and we cannot iput it under
> +		 * s_inodes_lock. So we keep the reference and iput it later.
>   		 */
>   		iput(old_inode);
>   		old_inode = inode;
> @@ -1137,11 +1122,9 @@ static void wait_sb_inodes(struct super_block *sb)
>
>   		cond_resched();
>
> -		spin_lock(&inode_lock);
>   		spin_lock(&sb->s_inodes_lock);
>   	}
>   	spin_unlock(&sb->s_inodes_lock);
> -	spin_unlock(&inode_lock);
>   	iput(old_inode);
>   }
>
> @@ -1244,33 +1227,9 @@ int write_inode_now(struct inode *inode, int sync)
>   		wbc.nr_to_write = 0;
>
>   	might_sleep();
> -	spin_lock(&inode_lock);
> -	ret = writeback_single_inode(inode,&wbc);
> -	spin_unlock(&inode_lock);
> +	ret = sync_inode(inode,&wbc);
>   	if (sync)
>   		inode_sync_wait(inode);
>   	return ret;
>   }
>   EXPORT_SYMBOL(write_inode_now);
> -
> -/**
> - * sync_inode - write an inode and its pages to disk.
> - * @inode: the inode to sync
> - * @wbc: controls the writeback mode
> - *
> - * sync_inode() will write an inode and its pages to disk.  It will also
> - * correctly update the inode on its superblock's dirty inode lists and will
> - * update inode->i_state.
> - *
> - * The caller must have a ref on the inode.
> - */
> -int sync_inode(struct inode *inode, struct writeback_control *wbc)
> -{
> -	int ret;
> -
> -	spin_lock(&inode_lock);
> -	ret = writeback_single_inode(inode, wbc);
> -	spin_unlock(&inode_lock);
> -	return ret;
> -}
> -EXPORT_SYMBOL(sync_inode);
> diff --git a/fs/inode.c b/fs/inode.c
> index b33b57c..0046ea8 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -44,24 +44,20 @@
>    *   inode_lru, i_lru
>    *
>    * Lock orders
> - * inode_lock
> - *   inode hash bucket lock
> - *     inode->i_lock
> + * inode hash bucket lock
> + *   inode->i_lock
>    *
> - * inode_lock
> - *   sb inode lock
> - *     inode_lru_lock
> - *       wb->b_lock
> - *         inode->i_lock
> + * sb inode lock
> + *   inode_lru_lock
> + *     wb->b_lock
> + *       inode->i_lock
>    *
> - * inode_lock
> - *   wb->b_lock
> - *     sb_lock (pin sb for writeback)
> - *     inode->i_lock
> + * wb->b_lock
> + *   sb_lock (pin sb for writeback)
> + *   inode->i_lock
>    *
> - * inode_lock
> - *   inode_lru
> - *     inode->i_lock
> + * inode_lru
> + *   inode->i_lock
>    */
>   /*
>    * This is needed for the following functions:
> @@ -114,14 +110,6 @@ static LIST_HEAD(inode_lru);
>   static DEFINE_SPINLOCK(inode_lru_lock);
>
>   /*
> - * A simple spinlock to protect the list manipulations.
> - *
> - * NOTE! You also have to own the lock if you change
> - * the i_state of an inode while it is in use..
> - */
> -DEFINE_SPINLOCK(inode_lock);
> -
> -/*
>    * iprune_sem provides exclusion between the kswapd or try_to_free_pages
>    * icache shrinking path, and the umount path.  Without this exclusion,
>    * by the time prune_icache calls iput for the inode whose pages it has
> @@ -355,11 +343,9 @@ static void init_once(void *foo)
>   void iref(struct inode *inode)
>   {
>   	WARN_ON(inode->i_ref<  1);
> -	spin_lock(&inode_lock);
>   	spin_lock(&inode->i_lock);
>   	inode->i_ref++;
>   	spin_unlock(&inode->i_lock);
> -	spin_unlock(&inode_lock);
>   }
>   EXPORT_SYMBOL_GPL(iref);
>
> @@ -387,28 +373,21 @@ void inode_lru_list_del(struct inode *inode)
>   	spin_unlock(&inode_lru_lock);
>   }
>
> -static void __inode_sb_list_add(struct inode *inode)
> -{
> -	struct super_block *sb = inode->i_sb;
> -
> -	spin_lock(&sb->s_inodes_lock);
> -	list_add(&inode->i_sb_list,&sb->s_inodes);
> -	spin_unlock(&sb->s_inodes_lock);
> -}
> -
>   /**
>    * inode_sb_list_add - add inode to the superblock list of inodes
>    * @inode: inode to add
>    */
>   void inode_sb_list_add(struct inode *inode)
>   {
> -	spin_lock(&inode_lock);
> -	__inode_sb_list_add(inode);
> -	spin_unlock(&inode_lock);
> +	struct super_block *sb = inode->i_sb;
> +
> +	spin_lock(&sb->s_inodes_lock);
> +	list_add(&inode->i_sb_list,&sb->s_inodes);
> +	spin_unlock(&sb->s_inodes_lock);
>   }
>   EXPORT_SYMBOL_GPL(inode_sb_list_add);
>
> -static void __inode_sb_list_del(struct inode *inode)
> +static void inode_sb_list_del(struct inode *inode)
>   {
>   	struct super_block *sb = inode->i_sb;
>
> @@ -439,22 +418,17 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
>   {
>   	struct hlist_bl_head *b = inode_hashtable + hash(inode->i_sb, hashval);
>
> -	spin_lock(&inode_lock);
>   	hlist_bl_lock(b);
>   	hlist_bl_add_head(&inode->i_hash, b);
>   	hlist_bl_unlock(b);
> -	spin_unlock(&inode_lock);
>   }
>   EXPORT_SYMBOL(__insert_inode_hash);
>
>   /**
> - *	__remove_inode_hash - remove an inode from the hash
> + *	remove_inode_hash - remove an inode from the hash
>    *	@inode: inode to unhash
> - *
> - *	Remove an inode from the superblock. inode->i_lock must be
> - *	held.
>    */
> -static void __remove_inode_hash(struct inode *inode)
> +void remove_inode_hash(struct inode *inode)
>   {
>   	struct hlist_bl_head *b;
>
> @@ -463,19 +437,6 @@ static void __remove_inode_hash(struct inode *inode)
>   	hlist_bl_del_init(&inode->i_hash);
>   	hlist_bl_unlock(b);
>   }
> -
> -/**
> - *	remove_inode_hash - remove an inode from the hash
> - *	@inode: inode to unhash
> - *
> - *	Remove an inode from the superblock.
> - */
> -void remove_inode_hash(struct inode *inode)
> -{
> -	spin_lock(&inode_lock);
> -	__remove_inode_hash(inode);
> -	spin_unlock(&inode_lock);
> -}
>   EXPORT_SYMBOL(remove_inode_hash);
>
>   void end_writeback(struct inode *inode)
> @@ -526,10 +487,8 @@ static void dispose_list(struct list_head *head)
>
>   		evict(inode);
>
> -		spin_lock(&inode_lock);
> -		__remove_inode_hash(inode);
> -		__inode_sb_list_del(inode);
> -		spin_unlock(&inode_lock);
> +		remove_inode_hash(inode);
> +		inode_sb_list_del(inode);
>
>   		spin_lock(&inode->i_lock);
>   		wake_up_bit(&inode->i_state, __I_NEW);
> @@ -558,7 +517,6 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
>   		 * change during umount anymore, and because iprune_sem keeps
>   		 * shrink_icache_memory() away.
>   		 */
> -		cond_resched_lock(&inode_lock);
>   		cond_resched_lock(&sb->s_inodes_lock);
>
>   		next = next->next;
> @@ -609,12 +567,10 @@ int invalidate_inodes(struct super_block *sb)
>   	LIST_HEAD(throw_away);
>
>   	down_write(&iprune_sem);
> -	spin_lock(&inode_lock);
>   	spin_lock(&sb->s_inodes_lock);
>   	fsnotify_unmount_inodes(&sb->s_inodes);
>   	busy = invalidate_list(sb,&sb->s_inodes,&throw_away);
>   	spin_unlock(&sb->s_inodes_lock);
> -	spin_unlock(&inode_lock);
>
>   	dispose_list(&throw_away);
>   	up_write(&iprune_sem);
> @@ -625,7 +581,7 @@ EXPORT_SYMBOL(invalidate_inodes);
>
>   /*
>    * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
> - * temporary list and then are freed outside inode_lock by dispose_list().
> + * temporary list and then are freed outside locks by dispose_list().
>    *
>    * Any inodes which are pinned purely because of attached pagecache have their
>    * pagecache removed.  If the inode has metadata buffers attached to
> @@ -646,7 +602,6 @@ static void prune_icache(int nr_to_scan)
>   	unsigned long reap = 0;
>
>   	down_read(&iprune_sem);
> -	spin_lock(&inode_lock);
>   	spin_lock(&inode_lru_lock);
>   	for (nr_scanned = 0; nr_scanned<  nr_to_scan; nr_scanned++) {
>   		struct inode *inode;
> @@ -679,7 +634,6 @@ static void prune_icache(int nr_to_scan)
>   			inode->i_ref++;
>   			spin_unlock(&inode->i_lock);
>   			spin_unlock(&inode_lru_lock);
> -			spin_unlock(&inode_lock);
>   			if (remove_inode_buffers(inode))
>   				reap += invalidate_mapping_pages(&inode->i_data,
>   								0, -1);
> @@ -695,7 +649,6 @@ static void prune_icache(int nr_to_scan)
>   			 * the I_REFERENCED flag on the next pass and do the
>   			 * same. Either way, we won't spin on it in this loop.
>   			 */
> -			spin_lock(&inode_lock);
>   			spin_lock(&inode_lru_lock);
>   			continue;
>   		}
> @@ -718,7 +671,6 @@ static void prune_icache(int nr_to_scan)
>   	else
>   		__count_vm_events(PGINODESTEAL, reap);
>   	spin_unlock(&inode_lru_lock);
> -	spin_unlock(&inode_lock);
>
>   	dispose_list(&freeable);
>   	up_read(&iprune_sem);
> @@ -868,19 +820,15 @@ struct inode *new_inode(struct super_block *sb)
>   {
>   	struct inode *inode;
>
> -	spin_lock_prefetch(&inode_lock);
> -
>   	inode = alloc_inode(sb);
>   	if (inode) {
> -		spin_lock(&inode_lock);
>   		/*
>   		 * set the inode state before we make the inode accessible to
>   		 * the outside world.
>   		 */
>   		inode->i_ino = get_next_ino();
>   		inode->i_state = 0;
> -		__inode_sb_list_add(inode);
> -		spin_unlock(&inode_lock);
> +		inode_sb_list_add(inode);
>   	}
>   	return inode;
>   }
> @@ -938,7 +886,6 @@ static struct inode *get_new_inode(struct super_block *sb,
>   	if (inode) {
>   		struct inode *old;
>
> -		spin_lock(&inode_lock);
>   		hlist_bl_lock(b);
>   		/* We released the lock, so.. */
>   		old = find_inode(sb, b, test, data);
> @@ -953,8 +900,7 @@ static struct inode *get_new_inode(struct super_block *sb,
>   			inode->i_state = I_NEW;
>   			hlist_bl_add_head(&inode->i_hash, b);
>   			hlist_bl_unlock(b);
> -			__inode_sb_list_add(inode);
> -			spin_unlock(&inode_lock);
> +			inode_sb_list_add(inode);
>
>   			/* Return the locked inode with I_NEW set, the
>   			 * caller is responsible for filling in the contents
> @@ -968,7 +914,6 @@ static struct inode *get_new_inode(struct super_block *sb,
>   		 * allocated.
>   		 */
>   		hlist_bl_unlock(b);
> -		spin_unlock(&inode_lock);
>   		destroy_inode(inode);
>   		inode = old;
>   		wait_on_inode(inode);
> @@ -977,7 +922,6 @@ static struct inode *get_new_inode(struct super_block *sb,
>
>   set_failed:
>   	hlist_bl_unlock(b);
> -	spin_unlock(&inode_lock);
>   	destroy_inode(inode);
>   	return NULL;
>   }
> @@ -995,7 +939,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
>   	if (inode) {
>   		struct inode *old;
>
> -		spin_lock(&inode_lock);
>   		hlist_bl_lock(b);
>   		/* We released the lock, so.. */
>   		old = find_inode_fast(sb, b, ino);
> @@ -1008,8 +951,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
>   			inode->i_state = I_NEW;
>   			hlist_bl_add_head(&inode->i_hash, b);
>   			hlist_bl_unlock(b);
> -			__inode_sb_list_add(inode);
> -			spin_unlock(&inode_lock);
> +			inode_sb_list_add(inode);
>
>   			/* Return the locked inode with I_NEW set, the
>   			 * caller is responsible for filling in the contents
> @@ -1023,7 +965,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
>   		 * allocated.
>   		 */
>   		hlist_bl_unlock(b);
> -		spin_unlock(&inode_lock);
>   		destroy_inode(inode);
>   		inode = old;
>   		wait_on_inode(inode);
> @@ -1081,7 +1022,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
>   	static unsigned int counter;
>   	ino_t res;
>
> -	spin_lock(&inode_lock);
>   	spin_lock(&iunique_lock);
>   	do {
>   		if (counter<= max_reserved)
> @@ -1089,7 +1029,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
>   		res = counter++;
>   	} while (!test_inode_iunique(sb, res));
>   	spin_unlock(&iunique_lock);
> -	spin_unlock(&inode_lock);
>
>   	return res;
>   }
> @@ -1097,7 +1036,6 @@ EXPORT_SYMBOL(iunique);
>
>   struct inode *igrab(struct inode *inode)
>   {
> -	spin_lock(&inode_lock);
>   	spin_lock(&inode->i_lock);
>   	if (!(inode->i_state&  (I_FREEING|I_WILL_FREE))) {
>   		inode->i_ref++;
> @@ -1111,7 +1049,6 @@ struct inode *igrab(struct inode *inode)
>   		 */
>   		inode = NULL;
>   	}
> -	spin_unlock(&inode_lock);
>   	return inode;
>   }
>   EXPORT_SYMBOL(igrab);
> @@ -1133,7 +1070,7 @@ EXPORT_SYMBOL(igrab);
>    *
>    * Otherwise NULL is returned.
>    *
> - * Note, @test is called with the inode_lock held, so can't sleep.
> + * Note, @test is called with the i_lock held, so can't sleep.
>    */
>   static struct inode *ifind(struct super_block *sb,
>   		struct hlist_bl_head *b,
> @@ -1142,11 +1079,9 @@ static struct inode *ifind(struct super_block *sb,
>   {
>   	struct inode *inode;
>
> -	spin_lock(&inode_lock);
>   	hlist_bl_lock(b);
>   	inode = find_inode(sb, b, test, data);
>   	hlist_bl_unlock(b);
> -	spin_unlock(&inode_lock);
>
>   	if (inode&&  likely(wait))
>   		wait_on_inode(inode);
> @@ -1174,11 +1109,9 @@ static struct inode *ifind_fast(struct super_block *sb,
>   {
>   	struct inode *inode;
>
> -	spin_lock(&inode_lock);
>   	hlist_bl_lock(b);
>   	inode = find_inode_fast(sb, b, ino);
>   	hlist_bl_unlock(b);
> -	spin_unlock(&inode_lock);
>
>   	if (inode)
>   		wait_on_inode(inode);
> @@ -1204,7 +1137,7 @@ static struct inode *ifind_fast(struct super_block *sb,
>    *
>    * Otherwise NULL is returned.
>    *
> - * Note, @test is called with the inode_lock held, so can't sleep.
> + * Note, @test is called with the i_lock held, so can't sleep.
>    */
>   struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
>   		int (*test)(struct inode *, void *), void *data)
> @@ -1232,7 +1165,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
>    *
>    * Otherwise NULL is returned.
>    *
> - * Note, @test is called with the inode_lock held, so can't sleep.
> + * Note, @test is called with the i_lock held, so can't sleep.
>    */
>   struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
>   		int (*test)(struct inode *, void *), void *data)
> @@ -1283,7 +1216,7 @@ EXPORT_SYMBOL(ilookup);
>    * inode and this is returned locked, hashed, and with the I_NEW flag set. The
>    * file system gets to fill it in before unlocking it via unlock_new_inode().
>    *
> - * Note both @test and @set are called with the inode_lock held, so can't sleep.
> + * Note both @test and @set are called with the i_lock held, so can't sleep.
mentioned before, so maybe this is more consistent with the other notes 
below:
NOTE: Both @test and @set are called with the i_lock held, so can't sleep.

[...]
> - * NOTE: This function runs with the inode_lock spin lock held so it is not
> + * NOTE: This function runs with the i_lock spin lock held so it is not
>    * allowed to sleep.
>    */
>   int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
> @@ -98,7 +98,7 @@ int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
>    *
>    * Return 0 on success and -errno on error.
>    *
> - * NOTE: This function runs with the inode_lock spin lock held so it is not
> + * NOTE: This function runs with the i_lock spin lock held so it is not
>    * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
>    */

Have fun
Christian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/21] fs: Implement lazy LRU updates for inodes
  2010-10-21  0:49 ` [PATCH 04/21] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-10-21  2:14   ` Christian Stroetmann
  2010-10-21 10:07   ` Nick Piggin
  2010-10-23  9:32   ` Al Viro
  2 siblings, 0 replies; 58+ messages in thread
From: Christian Stroetmann @ 2010-10-21  2:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

  Just only 2 typos

On the 21.10.2010 02:49, Dave Chinner wrote:
> From: Nick Piggin<npiggin@suse.de>
>
> Convert the inode LRU to use lazy updates to reduce lock and
> cacheline traffic.  We avoid moving inodes around in the LRU list
> during iget/iput operations so these frequent operations don't need
> to access the LRUs. Instead, we defer the refcount checks to
> reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
> reclaim that iget has touched the inode in the past. This means that
> only reclaim should be touching the LRU with any frequency, hence
> significantly reducing lock acquisitions and the amount contention
> on LRU updates.
>
> This also removes the inode_in_use list, which means we now only
> have one list for tracking the inode LRU status. This makes it much
> simpler to split out the LRU list operations under it's own lock.
>
> Signed-off-by: Nick Piggin<npiggin@suse.de>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig<hch@lst.de>
> ---
>   fs/fs-writeback.c         |   14 +++---
>   fs/inode.c                |  111 +++++++++++++++++++++++++++-----------------
>   fs/internal.h             |    6 +++
>   include/linux/fs.h        |   13 +++---
>   include/linux/writeback.h |    1 -
>   5 files changed, 88 insertions(+), 57 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 58a95b7..33e9857 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -408,16 +408,16 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   			 * completion.
>   			 */
>   			redirty_tail(inode);
> -		} else if (atomic_read(&inode->i_count)) {
> -			/*
> -			 * The inode is clean, inuse
> -			 */
> -			list_move(&inode->i_list,&inode_in_use);
>   		} else {
>   			/*
> -			 * The inode is clean, unused
> +			 * The inode is clean. If it is unused, then make sure
> +			 * that it is put on the LRU correctly as iput_final()
> +			 * does not move dirty inodes to the LRU and dirty
> +			 * inodes are removed from the LRU during scanning.
>   			 */
> -			list_move(&inode->i_list,&inode_unused);
> +			list_del_init(&inode->i_list);
> +			if (!atomic_read(&inode->i_count))
> +				inode_lru_list_add(inode);
>   		}
>   	}
>   	inode_sync_complete(inode);
> diff --git a/fs/inode.c b/fs/inode.c
> index b3b6a4b..f47ec71 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -72,7 +72,6 @@ static unsigned int i_hash_shift __read_mostly;
>    * allowing for low-overhead inode sync() operations.
>    */
>
> -LIST_HEAD(inode_in_use);
>   LIST_HEAD(inode_unused);
>   static struct hlist_head *inode_hashtable __read_mostly;
>
> @@ -291,6 +290,7 @@ void inode_init_once(struct inode *inode)
>   	INIT_HLIST_NODE(&inode->i_hash);
>   	INIT_LIST_HEAD(&inode->i_dentry);
>   	INIT_LIST_HEAD(&inode->i_devices);
> +	INIT_LIST_HEAD(&inode->i_list);
>   	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
>   	spin_lock_init(&inode->i_data.tree_lock);
>   	spin_lock_init(&inode->i_data.i_mmap_lock);
> @@ -317,12 +317,21 @@ static void init_once(void *foo)
>    */
>   void __iget(struct inode *inode)
>   {
> -	if (atomic_inc_return(&inode->i_count) != 1)
> -		return;
> +	atomic_inc(&inode->i_count);
> +}
> +
> +void inode_lru_list_add(struct inode *inode)
> +{
> +	list_add(&inode->i_list,&inode_unused);
> +	percpu_counter_inc(&nr_inodes_unused);
> +}
>
> -	if (!(inode->i_state&  (I_DIRTY|I_SYNC)))
> -		list_move(&inode->i_list,&inode_in_use);
> -	percpu_counter_dec(&nr_inodes_unused);
> +void inode_lru_list_del(struct inode *inode)
> +{
> +	if (!list_empty(&inode->i_list)) {
> +		list_del_init(&inode->i_list);
> +		percpu_counter_dec(&nr_inodes_unused);
> +	}
>   }
>
>   void end_writeback(struct inode *inode)
> @@ -367,7 +376,7 @@ static void dispose_list(struct list_head *head)
>   		struct inode *inode;
>
>   		inode = list_first_entry(head, struct inode, i_list);
> -		list_del(&inode->i_list);
> +		list_del_init(&inode->i_list);
>
>   		evict(inode);
>
> @@ -410,9 +419,9 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
>   			continue;
>   		invalidate_inode_buffers(inode);
>   		if (!atomic_read(&inode->i_count)) {
> -			list_move(&inode->i_list, dispose);
>   			WARN_ON(inode->i_state&  I_NEW);
>   			inode->i_state |= I_FREEING;
> +			list_move(&inode->i_list, dispose);
>   			percpu_counter_dec(&nr_inodes_unused);
>   			continue;
>   		}
> @@ -447,31 +456,21 @@ int invalidate_inodes(struct super_block *sb)
>   }
>   EXPORT_SYMBOL(invalidate_inodes);
>
> -static int can_unuse(struct inode *inode)
> -{
> -	if (inode->i_state)
> -		return 0;
> -	if (inode_has_buffers(inode))
> -		return 0;
> -	if (atomic_read(&inode->i_count))
> -		return 0;
> -	if (inode->i_data.nrpages)
> -		return 0;
> -	return 1;
> -}
> -
>   /*
> - * Scan `goal' inodes on the unused list for freeable ones. They are moved to
> - * a temporary list and then are freed outside inode_lock by dispose_list().
> + * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
> + * temporary list and then are freed outside inode_lock by dispose_list().
>    *
>    * Any inodes which are pinned purely because of attached pagecache have their
> - * pagecache removed.  We expect the final iput() on that inode to add it to
> - * the front of the inode_unused list.  So look for it there and if the
> - * inode is still freeable, proceed.  The right inode is found 99.9% of the
> - * time in testing on a 4-way.
> + * pagecache removed.  If the inode has metadata buffers attached to
> + * mapping->private_list then try to remove them.
>    *
> - * If the inode has metadata buffers attached to mapping->private_list then
> - * try to remove them.
> + * If the inode has the I_REFERENCED flag set, it means that it has been used
, then it means
> + * recently - the flag is set in iput_final(). When we encounter such an inode,
> + * clear the flag and move it to the back of the LRU so it gets another pass
> + * through the LRU before it gets reclaimed. This is necessary because of the
> + * fact we are doing lazy LRU updates to minimise lock contention so the LRU
> + * does not have strict ordering. Hence we don't want to reclaim inodes with
> + * this flag set because they are the inodes that are out of order.
>    */
>   static void prune_icache(int nr_to_scan)
>   {
> @@ -489,8 +488,21 @@ static void prune_icache(int nr_to_scan)
>
>   		inode = list_entry(inode_unused.prev, struct inode, i_list);
>
> -		if (inode->i_state || atomic_read(&inode->i_count)) {
> +		/*
> +		 * Referenced or dirty inodes are still in use. Give them
> +		 * another pass through the LRU as we canot reclaim them now.
> +		 */
> +		if (atomic_read(&inode->i_count) ||
> +		    (inode->i_state&  ~I_REFERENCED)) {
> +			list_del_init(&inode->i_list);
> +			percpu_counter_dec(&nr_inodes_unused);
> +			continue;
> +		}
> +
> +		/* recently referenced inodes get one more pass */
> +		if (inode->i_state&  I_REFERENCED) {
>   			list_move(&inode->i_list,&inode_unused);
> +			inode->i_state&= ~I_REFERENCED;
>   			continue;
>   		}
>   		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> @@ -500,13 +512,19 @@ static void prune_icache(int nr_to_scan)
>   				reap += invalidate_mapping_pages(&inode->i_data,
>   								0, -1);
>   			iput(inode);
> -			spin_lock(&inode_lock);
>
> -			if (inode != list_entry(inode_unused.next,
> -						struct inode, i_list))
> -				continue;	/* wrong inode or list_empty */
> -			if (!can_unuse(inode))
> -				continue;
> +			/*
> +			 * Rather than try to determine if we can still use the
> +			 * inode after calling iput(), leave the inode where it
> +			 * is on the LRU. If we race with another reclaimer,
, then that reclaimer
> +			 * that reclaimer will either see a reference count
> +			 * or the I_REFERENCED flag, and move the inode to the
> +			 * back of the LRU. If we don't race, then we'll see
> +			 * the I_REFERENCED flag on the next pass and do the
> +			 * same. Either way, we won't spin on it in this loop.
> +			 */
> +			spin_lock(&inode_lock);
> +			continue;
[...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock
  2010-10-21  0:49 ` [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock Dave Chinner
@ 2010-10-21  2:17   ` Christoph Hellwig
  2010-10-21 13:16     ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-21  2:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Looks good, and thanks for documenting unlock_new_inode.


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
                   ` (20 preceding siblings ...)
  2010-10-21  0:49 ` [PATCH 21/21] fs: do not assign default i_ino in new_inode Dave Chinner
@ 2010-10-21  5:04 ` Dave Chinner
  2010-10-21 13:20   ` Nick Piggin
  21 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21  5:04 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Folks,

I just pushed a new version of this patchset which is pretty much a rebase on
2.6.36 out. I'm not going to post all the patches as not much changed - mostly
comments were changed. The changelog for the update is:

Version 7:
- rebase on 2.6.36
- iref() to inode->i_ref++ conversion in fs/nfs/write.c
- removed stray inode hash removal call from patches it didn't
  belong in.
- cleaned up another stale remove_inode_hash comment.
- cleaned up more comments as reported by Christian Stroetmann
  <stroetmann@ontolinux.com>.

--

The following changes since commit f6f94e2ab1b33f0082ac22d71f66385a60d8157f:

  Linux 2.6.36 (2010-10-20 13:30:22 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Christoph Hellwig (4):
      fs: Stop abusing find_inode_fast in iunique
      fs: move i_ref increments into find_inode/find_inode_fast
      fs: remove inode_add_to_list/__inode_add_to_list
      fs: do not assign default i_ino in new_inode

Dave Chinner (13):
      fs: switch bdev inode bdi's correctly
      fs: Convert nr_inodes and nr_unused to per-cpu counters
      fs: Clean up inode reference counting
      exofs: use iput() for inode reference count decrements
      fs: rework icount to be a locked variable
      fs: Factor inode hash operations into functions
      fs: Introduce per-bucket inode hash locks
      fs: add a per-superblock lock for the inode list
      fs: split locking of inode writeback and LRU lists
      fs: Protect inode->i_state with the inode->i_lock
      fs: protect wake_up_inode with inode->i_lock
      fs: icache remove inode_lock
      fs: Reduce inode I_FREEING and factor inode disposal

Eric Dumazet (1):
      fs: introduce a per-cpu last_ino allocator

Nick Piggin (3):
      kernel: add bl_list
      fs: Implement lazy LRU updates for inodes
      fs: inode split IO and LRU lists

 Documentation/filesystems/Locking        |    2 +-
 Documentation/filesystems/porting        |   16 +-
 Documentation/filesystems/vfs.txt        |   16 +-
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/infiniband/hw/ipath/ipath_fs.c   |    1 +
 drivers/infiniband/hw/qib/qib_fs.c       |    1 +
 drivers/misc/ibmasm/ibmasmfs.c           |    1 +
 drivers/oprofile/oprofilefs.c            |    1 +
 drivers/usb/core/inode.c                 |    1 +
 drivers/usb/gadget/f_fs.c                |    1 +
 drivers/usb/gadget/inode.c               |    1 +
 fs/9p/vfs_inode.c                        |    5 +-
 fs/affs/inode.c                          |    2 +-
 fs/afs/dir.c                             |    2 +-
 fs/anon_inodes.c                         |    8 +-
 fs/autofs4/inode.c                       |    1 +
 fs/bfs/dir.c                             |    2 +-
 fs/binfmt_misc.c                         |    1 +
 fs/block_dev.c                           |   42 +-
 fs/btrfs/inode.c                         |   18 +-
 fs/buffer.c                              |    2 +-
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/coda/dir.c                            |    2 +-
 fs/configfs/inode.c                      |    1 +
 fs/debugfs/inode.c                       |    1 +
 fs/drop_caches.c                         |   19 +-
 fs/exofs/inode.c                         |    6 +-
 fs/exofs/namei.c                         |    2 +-
 fs/ext2/namei.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext3/namei.c                          |    2 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/ext4/mballoc.c                        |    1 +
 fs/ext4/namei.c                          |    2 +-
 fs/freevxfs/vxfs_inode.c                 |    1 +
 fs/fs-writeback.c                        |  235 +++++----
 fs/fuse/control.c                        |    1 +
 fs/gfs2/ops_inode.c                      |    2 +-
 fs/hfs/hfs_fs.h                          |    2 +-
 fs/hfs/inode.c                           |    2 +-
 fs/hfsplus/dir.c                         |    2 +-
 fs/hfsplus/hfsplus_fs.h                  |    2 +-
 fs/hfsplus/inode.c                       |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/hugetlbfs/inode.c                     |    1 +
 fs/inode.c                               |  852 +++++++++++++++++++-----------
 fs/internal.h                            |   11 +
 fs/jffs2/dir.c                           |    4 +-
 fs/jfs/jfs_txnmgr.c                      |    2 +-
 fs/jfs/namei.c                           |    2 +-
 fs/libfs.c                               |    2 +-
 fs/locks.c                               |    2 +-
 fs/logfs/dir.c                           |    2 +-
 fs/logfs/inode.c                         |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/minix/namei.c                         |    2 +-
 fs/namei.c                               |    2 +-
 fs/nfs/dir.c                             |    2 +-
 fs/nfs/getroot.c                         |    2 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nfs/write.c                           |    2 +-
 fs/nilfs2/gcdat.c                        |    1 +
 fs/nilfs2/gcinode.c                      |   22 +-
 fs/nilfs2/mdt.c                          |    5 +-
 fs/nilfs2/namei.c                        |    2 +-
 fs/nilfs2/segment.c                      |    2 +-
 fs/nilfs2/the_nilfs.h                    |    2 +-
 fs/notify/inode_mark.c                   |   46 +-
 fs/notify/mark.c                         |    1 -
 fs/notify/vfsmount_mark.c                |    1 -
 fs/ntfs/inode.c                          |   10 +-
 fs/ntfs/super.c                          |    6 +-
 fs/ocfs2/dlmfs/dlmfs.c                   |    2 +
 fs/ocfs2/inode.c                         |    2 +-
 fs/ocfs2/namei.c                         |    2 +-
 fs/pipe.c                                |    2 +
 fs/proc/base.c                           |    2 +
 fs/proc/proc_sysctl.c                    |    2 +
 fs/quota/dquot.c                         |   32 +-
 fs/ramfs/inode.c                         |    1 +
 fs/reiserfs/namei.c                      |    2 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/reiserfs/xattr.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/super.c                               |    1 +
 fs/sysv/namei.c                          |    2 +-
 fs/ubifs/dir.c                           |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/inode.c                           |    2 +-
 fs/udf/namei.c                           |    2 +-
 fs/ufs/namei.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c               |    1 +
 fs/xfs/linux-2.6/xfs_iops.c              |    6 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    3 +-
 include/linux/backing-dev.h              |    3 +
 include/linux/fs.h                       |   43 +-
 include/linux/list_bl.h                  |  146 +++++
 include/linux/poison.h                   |    2 +
 include/linux/writeback.h                |    4 -
 ipc/mqueue.c                             |    3 +-
 kernel/cgroup.c                          |    1 +
 kernel/futex.c                           |    2 +-
 kernel/sysctl.c                          |    4 +-
 mm/backing-dev.c                         |   28 +-
 mm/filemap.c                             |    6 +-
 mm/rmap.c                                |    6 +-
 mm/shmem.c                               |    7 +-
 net/socket.c                             |    3 +-
 net/sunrpc/rpc_pipe.c                    |    1 +
 security/inode.c                         |    1 +
 security/selinux/selinuxfs.c             |    1 +
 114 files changed, 1134 insertions(+), 628 deletions(-)
 create mode 100644 include/linux/list_bl.h

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/21] fs: Implement lazy LRU updates for inodes
  2010-10-21  0:49 ` [PATCH 04/21] fs: Implement lazy LRU updates for inodes Dave Chinner
  2010-10-21  2:14   ` Christian Stroetmann
@ 2010-10-21 10:07   ` Nick Piggin
  2010-10-21 12:22     ` Christoph Hellwig
  2010-10-23  9:32   ` Al Viro
  2 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2010-10-21 10:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 11:49:29AM +1100, Dave Chinner wrote:
>  		} else {
>  			/*
> -			 * The inode is clean, unused
> +			 * The inode is clean. If it is unused, then make sure
> +			 * that it is put on the LRU correctly as iput_final()
> +			 * does not move dirty inodes to the LRU and dirty
> +			 * inodes are removed from the LRU during scanning.
>  			 */
> -			list_move(&inode->i_list, &inode_unused);
> +			list_del_init(&inode->i_list);
> +			if (!atomic_read(&inode->i_count))
> +				inode_lru_list_add(inode);

This "optimisation" is surely wrong. How could we have no reference
on the inode at this point?


> -static int can_unuse(struct inode *inode)
> -{
> -	if (inode->i_state)
> -		return 0;
> -	if (inode_has_buffers(inode))
> -		return 0;
> -	if (atomic_read(&inode->i_count))
> -		return 0;
> -	if (inode->i_data.nrpages)
> -		return 0;
> -	return 1;
> -}

Avoiding the reclaim optimisation? As I said, I noticed some increased
scanning in heavy reclaim from removing this.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/21] fs: Implement lazy LRU updates for inodes
  2010-10-21 10:07   ` Nick Piggin
@ 2010-10-21 12:22     ` Christoph Hellwig
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-21 12:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 09:07:06PM +1100, Nick Piggin wrote:
> On Thu, Oct 21, 2010 at 11:49:29AM +1100, Dave Chinner wrote:
> >  		} else {
> >  			/*
> > -			 * The inode is clean, unused
> > +			 * The inode is clean. If it is unused, then make sure
> > +			 * that it is put on the LRU correctly as iput_final()
> > +			 * does not move dirty inodes to the LRU and dirty
> > +			 * inodes are removed from the LRU during scanning.
> >  			 */
> > -			list_move(&inode->i_list, &inode_unused);
> > +			list_del_init(&inode->i_list);
> > +			if (!atomic_read(&inode->i_count))
> > +				inode_lru_list_add(inode);
> 
> This "optimisation" is surely wrong. How could we have no reference
> on the inode at this point?

Good question.  iput_final does so for unlinked inodes or umount,
and that should be about it as it's the only place setting I_WILL_FREE
and we require that for a 0 refcount at the beginning of
writeback_single_inode.  But adding it to the LRU case for that
is rather pointless as we will remove it a little bit later.

So I think the assignment can be safely removed, but I'd rather do in
a separate, properly documented patch rather than hiding it somewhere
unrelated.  That patch could however go towards the beggining of the
series to make things easier.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock
  2010-10-21  2:17   ` Christoph Hellwig
@ 2010-10-21 13:16     ` Nick Piggin
  0 siblings, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2010-10-21 13:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Oct 20, 2010 at 10:17:22PM -0400, Christoph Hellwig wrote:
> Looks good, and thanks for documenting unlock_new_inode.

You wanted an example of how the irregular locking requires
handling of new concurrency situations? This is another good
one.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-21  5:04 ` Inode Lock Scalability V7 (was V6) Dave Chinner
@ 2010-10-21 13:20   ` Nick Piggin
  2010-10-21 23:52     ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2010-10-21 13:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

It seems we are at an impasse.

It doesn't help that you are ignoring the most important concerns
I've been raising with these patches. The locking model and the
patch split up. I'd really like not to get deadlocked on this (haha),
so please let's try to debate points. I've tried to reply to each
point others have questioned me about, whether I agree or not I've
given reasons.

So, you know my objections to this approach already... I've got an
update on my patchset coming, so I'd like to get some discussion
going. I've cut out some of the stuff from mine so we don't get
bogged down in boring things like per-zone locking or changing of
the hash table data structure.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/21] fs: rework icount to be a locked variable
  2010-10-21  0:49 ` [PATCH 08/21] fs: rework icount to be a locked variable Dave Chinner
@ 2010-10-21 19:40   ` Al Viro
  2010-10-21 22:32     ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Al Viro @ 2010-10-21 19:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 11:49:33AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode reference count is currently an atomic variable so that it
> can be sampled/modified outside the inode_lock. However, the
> inode_lock is still needed to synchronise the final reference count
> and checks against the inode state.
> 
> To avoid needing the protection of the inode lock, protect the inode
> reference count with the per-inode i_lock and convert it to a normal
> variable. To avoid existing out-of-tree code accidentally compiling
> against the new method, rename the i_count field to i_ref. This is
> relatively straight forward as there are limited external references
> to the i_count field remaining.

BTW, the same thing as with Nick's set - separate patch for "clone the
reference to inode we are already holding" helper, in front of queue.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/21] fs: rework icount to be a locked variable
  2010-10-21 19:40   ` Al Viro
@ 2010-10-21 22:32     ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-21 22:32 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 08:40:45PM +0100, Al Viro wrote:
> On Thu, Oct 21, 2010 at 11:49:33AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The inode reference count is currently an atomic variable so that it
> > can be sampled/modified outside the inode_lock. However, the
> > inode_lock is still needed to synchronise the final reference count
> > and checks against the inode state.
> > 
> > To avoid needing the protection of the inode lock, protect the inode
> > reference count with the per-inode i_lock and convert it to a normal
> > variable. To avoid existing out-of-tree code accidentally compiling
> > against the new method, rename the i_count field to i_ref. This is
> > relatively straight forward as there are limited external references
> > to the i_count field remaining.
> 
> BTW, the same thing as with Nick's set - separate patch for "clone the
> reference to inode we are already holding" helper, in front of queue.

Isn't that already done by the patch 6 "fs: Clean up inode reference
counting"? That patch does the converting of stand-alone
atomic_inc(&inode->i-count) into iref(inode), not this one.
Maybe I've misunderstood what you are wanting to be changed with
this patch - can you clarify, Al?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-21 13:20   ` Nick Piggin
@ 2010-10-21 23:52     ` Dave Chinner
  2010-10-22  0:45       ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2010-10-21 23:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 12:20:34AM +1100, Nick Piggin wrote:
> It seems we are at an impasse.
> 
> It doesn't help that you are ignoring the most important concerns
> I've been raising with these patches. The locking model and the
> patch split up. I'd really like not to get deadlocked on this (haha),
> so please let's try to debate points. I've tried to reply to each
> point others have questioned me about, whether I agree or not I've
> given reasons.
> 
> So, you know my objections to this approach already... I've got an
> update on my patchset coming, so I'd like to get some discussion
> going. I've cut out some of the stuff from mine so we don't get
> bogged down in boring things like per-zone locking or changing of
> the hash table data structure.

No point appealing to me, Nick, it's not me that you have to
convince. As I've said from the start, all I really care about is
getting the code into shape that is acceptable to the reviewers.

As such, I don't think there is anything _new_ to discuss - I'd
simply be rehashing the same points I've already made to you over
the past couple of weeks. That has done nothing to change you mind
about anything, so it strikes me as a continuing exercise in
futility.

We have different ways of acheiving the same thing which have their
pros and cons, and I think that the reviewers of the patch sets are
aware of this. The reviewers are the people that will make the
decision on the best way to proceed, and I'll follow their lead
exactly as I have been since I started this process.

So, if you want to continue arguing that your locking model is the
One True Way, you need to convince the reviewers of the fact, not
me. I'm just going to continue to address the concerns of the
reviewers and fix bugs that reported until a decision is made one
way or the other....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-21 23:52     ` Dave Chinner
@ 2010-10-22  0:45       ` Nick Piggin
  2010-10-22  2:20         ` Al Viro
  0 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  0:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 10:52:27AM +1100, Dave Chinner wrote:
> On Fri, Oct 22, 2010 at 12:20:34AM +1100, Nick Piggin wrote:
> > It seems we are at an impasse.
> > 
> > It doesn't help that you are ignoring the most important concerns
> > I've been raising with these patches. The locking model and the
> > patch split up. I'd really like not to get deadlocked on this (haha),
> > so please let's try to debate points. I've tried to reply to each
> > point others have questioned me about, whether I agree or not I've
> > given reasons.
> > 
> > So, you know my objections to this approach already... I've got an
> > update on my patchset coming, so I'd like to get some discussion
> > going. I've cut out some of the stuff from mine so we don't get
> > bogged down in boring things like per-zone locking or changing of
> > the hash table data structure.
> 
> No point appealing to me, Nick, it's not me that you have to
> convince. As I've said from the start, all I really care about is
> getting the code into shape that is acceptable to the reviewers.

"The reviewers"? I _am_ a reviewer of your code, and I've made
some points, and you've ignored them.

When you've reviewed my code and had comments, I've responded to
them every time. I might not have agreed, but I tried to give you
answers.

 
> As such, I don't think there is anything _new_ to discuss - I'd
> simply be rehashing the same points I've already made to you over
> the past couple of weeks. That has done nothing to change you mind
> about anything, so it strikes me as a continuing exercise in
> futility.

No you didn't make these points to me over the past couple of weeks.
Specifically, do you agree or disagree about these points:
- introducing new concurrency situations from not having a single lock
  for an inode's icache state is a negative?
- if yes, then what aspect of your locking model justifies and outweighs
  it?
- before the inode_lock is lifted, locking changes should be as simple
  and verifiable as absolutely possible, so that bisection has less
  chance of hitting the inode_lock wall?
- further locking changes making the locking less regular and more
  complex should be done in small steps, after inode_lock is lifted

And I have kept saying I would welcome your idea to reduce i_lock width
in a small incremental patch. I still haven't figured out quite what
is so important that can't be achieved in simpler ways (like rcu, or
using a seperate inode lock).

 
> We have different ways of acheiving the same thing which have their
> pros and cons, and I think that the reviewers of the patch sets are
> aware of this. The reviewers are the people that will make the
> decision on the best way to proceed, and I'll follow their lead
> exactly as I have been since I started this process.

Everybody is a reviewer. You need to be able to defend your work.

 
> So, if you want to continue arguing that your locking model is the
> One True Way, you need to convince the reviewers of the fact, not
> me.

Yes of course I want to continue to argue that, because that is what
my opinion is. What I need from you is to know why you believe yours
is better, that I might concede I was wrong; point out where you are
wrong; or show that one can be extended to have the positive aspects
of another etc.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-21  0:49 ` [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
@ 2010-10-22  1:56   ` Al Viro
  2010-10-22  2:26     ` Nick Piggin
  2010-10-22  3:14     ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: Al Viro @ 2010-10-22  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 11:49:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently protect the per-inode state flags with the inode_lock.
> Using a global lock to protect per-object state is overkill when we
> coul duse a per-inode lock to protect the state.  Use the
> inode->i_lock for this, and wrap all the state changes and checks
> with the inode->i_lock.

> @@ -1424,22 +1449,30 @@ static void iput_final(struct inode *inode)
>  	if (!drop) {
>  		if (sb->s_flags & MS_ACTIVE) {
>  			inode->i_state |= I_REFERENCED;
> -			if (!(inode->i_state & (I_DIRTY|I_SYNC)))
> +			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
> +			    list_empty(&inode->i_lru)) {
> +				spin_unlock(&inode->i_lock);
>  				inode_lru_list_add(inode);
> +				return;

Sorry, that's broken.  Leaving aside leaking inode_lock here (you remove
taking it later anyway), this is racy.

Look: inode is hashed.  It's alive and well.  It just has no references outside
of the lists.  Right?  Now, what's to stop another CPU from
	a) looking it up in icache
	b) doing unlink()
	c) dropping the last reference
	d) freeing the sucker
... just as you are about to call inode_lru_list_add() here?

For paths in iput() where we do set I_FREEING/I_WILL_FREE it's perfectly
fine to drop all locks once that's done.  Inode is ours, nobody will pick
it and we are free to do as we wish.

For the path that leaves the inode alive and hashed - sorry, can't do.
AFAICS, unlike hash, wb and sb locks, lru lock should nest *inside*
->i_lock.  And yes, it means trylock in prune_icache(), with "put it in
the end of the list for one more spin" if we fail.  In that case it's
really cleaner that way.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  0:45       ` Nick Piggin
@ 2010-10-22  2:20         ` Al Viro
  2010-10-22  2:34           ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Al Viro @ 2010-10-22  2:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 11:45:40AM +1100, Nick Piggin wrote:

> No you didn't make these points to me over the past couple of weeks.
> Specifically, do you agree or disagree about these points:
> - introducing new concurrency situations from not having a single lock
>   for an inode's icache state is a negative?

I disagree.

> And I have kept saying I would welcome your idea to reduce i_lock width
> in a small incremental patch. I still haven't figured out quite what
> is so important that can't be achieved in simpler ways (like rcu, or
> using a seperate inode lock).

No, it's not a small incremental.  It's your locking order being wrong;
the natural one is
	[hash, wb, sb] > ->i_lock > [lru]
and that's one hell of a difference compared to what you are doing.

Look:
	* iput_final() should happen under ->i_lock
	* if it leaves the inode alive, that's it; we can put it on LRU list
since lru lock nests inside ->i_lock
	* if it decides to kill the inode, it sets I_FREEING or I_WILL_FREE
before dropping ->i_lock.  Once that's done, the inode is ours and nobody
will pick it through the lists.  We can release ->i_lock and then do what's
needed.  Safely.
	* accesses of ->i_state are under ->i_lock, including the switchover
from I_WILL_FREE to I_FREEING
	* walkers of the sb, wb and hash lists can grab ->i_lock at will;
it nests inside their locks.
	* prune_icache() grabs lru lock, then trylocks ->i_lock on the
first element.  If trylock fails, we just give inode another spin through
the list by moving it to the tail; if it doesn't, we are holding ->i_lock
and can proceed safely.

What you seem to miss is that there are very few places accessing inode through
the lists (i.e. via pointers that do not contribute to refcount) and absolute
majority already checks for I_FREEING/I_WILL_FREE, refusing to pick such
inodes.  It's not an accidental subtle property of the code, it's bloody
fundamental.

As I've said, I've no religious problems with trylocks; we *do* need them for
prune_icache() to get a sane locking scheme.  But the way you put ->i_lock on
the top of hierarchy is simply wrong.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22  1:56   ` Al Viro
@ 2010-10-22  2:26     ` Nick Piggin
  2010-10-22  3:14     ` Dave Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  2:26 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 02:56:22AM +0100, Al Viro wrote:
> On Thu, Oct 21, 2010 at 11:49:41AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> For the path that leaves the inode alive and hashed - sorry, can't do.
> AFAICS, unlike hash, wb and sb locks, lru lock should nest *inside*
> ->i_lock.  And yes, it means trylock in prune_icache(), with "put it in
> the end of the list for one more spin" if we fail.  In that case it's
> really cleaner that way.

I still have found that nesting them all inside i_lock is much more
natural a way to protect the inode's icache state.

Typically, we either have an icache data structure that we want to
look up one inode from, or we have an inode that we need to do one
*or more* icache state manipulations on.

When it involves putting the inode on or off different data structures,
holding i_lock over the sequence allows us to lift inode_lock without
much further thought.

One downside is the trylocks -- most can be subsequently removed quite
easily by doing data structure lookups without locks. I prefer this
approach, and be easily able to hold i_lock over the entirety of
something like iput_final. At least for the initial lock break work.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  2:20         ` Al Viro
@ 2010-10-22  2:34           ` Nick Piggin
  2010-10-22  2:41             ` Nick Piggin
  2010-10-22  3:07             ` Al Viro
  0 siblings, 2 replies; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  2:34 UTC (permalink / raw)
  To: Al Viro; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 03:20:10AM +0100, Al Viro wrote:
> On Fri, Oct 22, 2010 at 11:45:40AM +1100, Nick Piggin wrote:
> 
> > No you didn't make these points to me over the past couple of weeks.
> > Specifically, do you agree or disagree about these points:
> > - introducing new concurrency situations from not having a single lock
> >   for an inode's icache state is a negative?
> 
> I disagree.
> 
> > And I have kept saying I would welcome your idea to reduce i_lock width
> > in a small incremental patch. I still haven't figured out quite what
> > is so important that can't be achieved in simpler ways (like rcu, or
> > using a seperate inode lock).
> 
> No, it's not a small incremental.  It's your locking order being wrong;
> the natural one is
> 	[hash, wb, sb] > ->i_lock > [lru]
> and that's one hell of a difference compared to what you are doing.

There is no reason it can't be moved to that lock order (or have
new concurrency situations), but the point is that in the first
lock breaking pass, it does not.


> Look:
> 	* iput_final() should happen under ->i_lock
> 	* if it leaves the inode alive, that's it; we can put it on LRU list
> since lru lock nests inside ->i_lock
> 	* if it decides to kill the inode, it sets I_FREEING or I_WILL_FREE
> before dropping ->i_lock.  Once that's done, the inode is ours and nobody
> will pick it through the lists.  We can release ->i_lock and then do what's
> needed.  Safely.
> 	* accesses of ->i_state are under ->i_lock, including the switchover
> from I_WILL_FREE to I_FREEING
> 	* walkers of the sb, wb and hash lists can grab ->i_lock at will;
> it nests inside their locks.

What about if it is going on or off multiple data structures while
the inode is live, like inode_lock can protect today. Such as putting
it on the hash and sb list.


> 	* prune_icache() grabs lru lock, then trylocks ->i_lock on the
> first element.  If trylock fails, we just give inode another spin through
> the list by moving it to the tail; if it doesn't, we are holding ->i_lock
> and can proceed safely.
> 
> What you seem to miss is that there are very few places accessing inode through
> the lists (i.e. via pointers that do not contribute to refcount) and absolute
> majority already checks for I_FREEING/I_WILL_FREE, refusing to pick such
> inodes.  It's not an accidental subtle property of the code, it's bloody
> fundamental.

I didn't miss that, and I agree that at the point of my initial lock
break up, the locking is "wrong". Whether you correct it by changing
the lock ordering or by using RCU to do lookups is something I want to
debate further.

I think it is natural to be able to lock the inode and have it lock the
icache state.


> As I've said, I've no religious problems with trylocks; we *do* need them for
> prune_icache() to get a sane locking scheme.  But the way you put ->i_lock on
> the top of hierarchy is simply wrong.

(well that could be avoided with RCU too)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  2:34           ` Nick Piggin
@ 2010-10-22  2:41             ` Nick Piggin
  2010-10-22  2:48               ` Nick Piggin
  2010-10-22  3:07             ` Al Viro
  1 sibling, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  2:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 01:34:44PM +1100, Nick Piggin wrote:
> On Fri, Oct 22, 2010 at 03:20:10AM +0100, Al Viro wrote:
> > On Fri, Oct 22, 2010 at 11:45:40AM +1100, Nick Piggin wrote:
> > majority already checks for I_FREEING/I_WILL_FREE, refusing to pick such
> > inodes.  It's not an accidental subtle property of the code, it's bloody
> > fundamental.
> 
> I didn't miss that, and I agree that at the point of my initial lock
> break up, the locking is "wrong". Whether you correct it by changing
> the lock ordering or by using RCU to do lookups is something I want to
> debate further.
> 
> I think it is natural to be able to lock the inode and have it lock the
> icache state.

Importantly, to be able to manipulate the icache state in any number of
steps, under a consistent lock. Exactly like we have with inode_lock
today.

Stepping away from that, adding code to handle new concurrencies, before
inode_lock is able to be lifted is just wrong.

The locking in my lock break patch is ugly and wrong, yes. But it is
always an intermediate step. I want to argue that with RCU inode work
*anyway*, there is not much point to reducing the strength of the
i_lock property because locking can be cleaned up nicely and still
keep i_lock ~= inode_lock (for a single inode).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  2:41             ` Nick Piggin
@ 2010-10-22  2:48               ` Nick Piggin
  2010-10-22  3:12                 ` Al Viro
  0 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  2:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 01:41:52PM +1100, Nick Piggin wrote:
> The locking in my lock break patch is ugly and wrong, yes. But it is
> always an intermediate step. I want to argue that with RCU inode work
> *anyway*, there is not much point to reducing the strength of the
> i_lock property because locking can be cleaned up nicely and still
> keep i_lock ~= inode_lock (for a single inode).

The other thing is that with RCU, the idea of locking an object in
the data structure with a per object lock actually *is* much more
natural. It's hard to do it properly with just a big data structure
lock.

If I want to take a reference to an inode from a data structre, how
to do it with RCU?

rcu_read_lock()
list_for_each(inode) {
  spin_lock(&big_lock); /* oops, might as well not even use RCU then */
  if (!unhashed) {
    iget();
  }
}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  2:34           ` Nick Piggin
  2010-10-22  2:41             ` Nick Piggin
@ 2010-10-22  3:07             ` Al Viro
  2010-10-22  4:46               ` Nick Piggin
  1 sibling, 1 reply; 58+ messages in thread
From: Al Viro @ 2010-10-22  3:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 01:34:44PM +1100, Nick Piggin wrote:

> > 	* walkers of the sb, wb and hash lists can grab ->i_lock at will;
> > it nests inside their locks.
> 
> What about if it is going on or off multiple data structures while
> the inode is live, like inode_lock can protect today. Such as putting
> it on the hash and sb list.

Look at the code.  You are overengineering it.  We do *not* need a framework
for messing with these lists in arbitrary ways.  Where would we need to
do that to an inode we don't hold a reference to or had placed I_FREEING
on and would need i_lock held by caller?  Even assuming that we need to
keep [present in hash, present on sb list] in sync (which I seriously doubt),
we can bloody well grab both locks before i_lock.

> > inodes.  It's not an accidental subtle property of the code, it's bloody
> > fundamental.
> 
> I didn't miss that, and I agree that at the point of my initial lock
> break up, the locking is "wrong". Whether you correct it by changing
> the lock ordering or by using RCU to do lookups is something I want to
> debate further.
> 
> I think it is natural to be able to lock the inode and have it lock the
> icache state.

Code outside of fs/inode.c and fs/fs-writeback.c generally has no business
looking at the full icache state, period.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  2:48               ` Nick Piggin
@ 2010-10-22  3:12                 ` Al Viro
  2010-10-22  4:48                   ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Al Viro @ 2010-10-22  3:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 01:48:34PM +1100, Nick Piggin wrote:
> On Fri, Oct 22, 2010 at 01:41:52PM +1100, Nick Piggin wrote:
> > The locking in my lock break patch is ugly and wrong, yes. But it is
> > always an intermediate step. I want to argue that with RCU inode work
> > *anyway*, there is not much point to reducing the strength of the
> > i_lock property because locking can be cleaned up nicely and still
> > keep i_lock ~= inode_lock (for a single inode).
> 
> The other thing is that with RCU, the idea of locking an object in
> the data structure with a per object lock actually *is* much more
> natural. It's hard to do it properly with just a big data structure
> lock.
> 
> If I want to take a reference to an inode from a data structre, how
> to do it with RCU?
> 
> rcu_read_lock()
> list_for_each(inode) {
>   spin_lock(&big_lock); /* oops, might as well not even use RCU then */
>   if (!unhashed) {
>     iget();
>   }
> }

Huh?  Why the hell does it have to be a big lock?  You grab ->i_lock,
then look at the damn thing.  You also grab it on eviction from the
list - *inside* the lock used for serializing the write access to
your RCU list.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22  1:56   ` Al Viro
  2010-10-22  2:26     ` Nick Piggin
@ 2010-10-22  3:14     ` Dave Chinner
  2010-10-22 10:37       ` Al Viro
  2010-10-24  2:18       ` Nick Piggin
  1 sibling, 2 replies; 58+ messages in thread
From: Dave Chinner @ 2010-10-22  3:14 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 02:56:22AM +0100, Al Viro wrote:
> On Thu, Oct 21, 2010 at 11:49:41AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We currently protect the per-inode state flags with the inode_lock.
> > Using a global lock to protect per-object state is overkill when we
> > coul duse a per-inode lock to protect the state.  Use the
> > inode->i_lock for this, and wrap all the state changes and checks
> > with the inode->i_lock.
> 
> > @@ -1424,22 +1449,30 @@ static void iput_final(struct inode *inode)
> >  	if (!drop) {
> >  		if (sb->s_flags & MS_ACTIVE) {
> >  			inode->i_state |= I_REFERENCED;
> > -			if (!(inode->i_state & (I_DIRTY|I_SYNC)))
> > +			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
> > +			    list_empty(&inode->i_lru)) {
> > +				spin_unlock(&inode->i_lock);
> >  				inode_lru_list_add(inode);
> > +				return;
> 
> Sorry, that's broken.  Leaving aside leaking inode_lock here (you remove
> taking it later anyway), this is racy.
> 
> Look: inode is hashed.  It's alive and well.  It just has no references outside
> of the lists.  Right?  Now, what's to stop another CPU from
> 	a) looking it up in icache
> 	b) doing unlink()
> 	c) dropping the last reference
> 	d) freeing the sucker
> ... just as you are about to call inode_lru_list_add() here?

Nothing - I hadn't considered that as a potential inode freeing
race window, so my assumption that it was OK to do this is wrong. It
definitely needs fixing.

> For paths in iput() where we do set I_FREEING/I_WILL_FREE it's perfectly
> fine to drop all locks once that's done.  Inode is ours, nobody will pick
> it and we are free to do as we wish.

Yes, I knew that bit - I went wrong making the same assumptions
through the unused path.

> For the path that leaves the inode alive and hashed - sorry, can't do.
> AFAICS, unlike hash, wb and sb locks, lru lock should nest *inside*
> ->i_lock.  And yes, it means trylock in prune_icache(), with "put it in
> the end of the list for one more spin" if we fail.  In that case it's
> really cleaner that way.

I left it outside i_lock to be consistent with all the new
locks being introduced. fs/fs-writeback.c::sync_inode() has a
similar inode life-time wart when adding clean inodes to the lru
which I was never really happy about. I suspect it has similar
problems.

I had a bit of a think about playing refcounting games to avoid
doing the LRU add without holding the i_lock (to avoid the above
freeing problem), but that ends up with more complex and messy iput/
iput_final interactions.  Likewise, adding trylocks into the lru
list add sites is doesn't solve the inode-goes-away-after-i_lock-
is-dropped problems.  A couple of other ideas I had also drowned in
complexity at birth.

AFAICT, moving the inode_lru_lock inside i_lock doesn't affect the
locking order of anything else, and I agree that putting a single
trylock in the prune_icache loop is definitely cleaner than anything
else I've been able to think of that retains the current locking
order. It will also remove the wart in sync_inode().

So, I'll swallow my rhetoric and agree with you that inode_lru_lock
inside the i_lock is the natural and easiest way to nest these
locks. I'll rework the series to do this. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  3:07             ` Al Viro
@ 2010-10-22  4:46               ` Nick Piggin
  2010-10-22  5:01                 ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  4:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 04:07:28AM +0100, Al Viro wrote:
> On Fri, Oct 22, 2010 at 01:34:44PM +1100, Nick Piggin wrote:
> 
> > > 	* walkers of the sb, wb and hash lists can grab ->i_lock at will;
> > > it nests inside their locks.
> > 
> > What about if it is going on or off multiple data structures while
> > the inode is live, like inode_lock can protect today. Such as putting
> > it on the hash and sb list.
> 
> Look at the code.  You are overengineering it.  We do *not* need a framework
> for messing with these lists in arbitrary ways.  Where would we need to
> do that to an inode we don't hold a reference to or had placed I_FREEING

Look, my point is that I believe it is an easier step to get from
inode_lock to i_lock, and then from there we can go wild.

What is your criteria for a particular lock ordering being "natural"
versus not? In almost all cases we have

[stuff with data structure] -> [stuff with inode]
and
[stuff with inode] -> [stuff with data structure]

So neither is inherently more natural, I think. So it comes down to
how the code fits together and works.

The difficulty with inode_lock breaking is not the data structures.
We know how to lock and modify them. The hardest part is verifying
that a particular inode has no new, unhandled concurrency introduced
to it (eg. the particular concurrency you pointed out in Dave's patch
just now). Agree?

So in that case, I think it is much more natural to be able to take
an inode and with i_lock, cover it from all icache state concurrency.
I object to it being called over engineering -- it's actually just a
big hammer on the inode, compared with fiddling with more complex
rules.

> on and would need i_lock held by caller?  Even assuming that we need to
> keep [present in hash, present on sb list] in sync (which I seriously doubt),
> we can bloody well grab both locks before i_lock.

I'm not saying there is. Most of the problems would be between a
particular inode state versus its membership on one of the lists.
However, with my patches, I *don't care* if there is an issue there
or not. It simply doesn't matter because it has the same protection
as inode_lock at that point.

If you want to micro optimise it, change lock orders around, and
open more concurrency, that is easily possible to do after my patches
lift inode_lock. If you do all the changes *before* inode_lock removal,
then it's not bisectable at all.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  3:12                 ` Al Viro
@ 2010-10-22  4:48                   ` Nick Piggin
  0 siblings, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  4:48 UTC (permalink / raw)
  To: Al Viro; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 04:12:11AM +0100, Al Viro wrote:
> On Fri, Oct 22, 2010 at 01:48:34PM +1100, Nick Piggin wrote:
> > On Fri, Oct 22, 2010 at 01:41:52PM +1100, Nick Piggin wrote:
> > > The locking in my lock break patch is ugly and wrong, yes. But it is
> > > always an intermediate step. I want to argue that with RCU inode work
> > > *anyway*, there is not much point to reducing the strength of the
> > > i_lock property because locking can be cleaned up nicely and still
> > > keep i_lock ~= inode_lock (for a single inode).
> > 
> > The other thing is that with RCU, the idea of locking an object in
> > the data structure with a per object lock actually *is* much more
> > natural. It's hard to do it properly with just a big data structure
> > lock.
> > 
> > If I want to take a reference to an inode from a data structre, how
> > to do it with RCU?
> > 
> > rcu_read_lock()
> > list_for_each(inode) {
> >   spin_lock(&big_lock); /* oops, might as well not even use RCU then */
> >   if (!unhashed) {
> >     iget();
> >   }
> > }
> 
> Huh?  Why the hell does it have to be a big lock?  You grab ->i_lock,
> then look at the damn thing.  You also grab it on eviction from the
> list - *inside* the lock used for serializing the write access to
> your RCU list.

That sucks, it requires more acquiring and dropping of i_lock and
it hits single threaded performance. I looked at that.

But it also loses the i_lock = inode_lock property.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Inode Lock Scalability V7 (was V6)
  2010-10-22  4:46               ` Nick Piggin
@ 2010-10-22  5:01                 ` Nick Piggin
  0 siblings, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2010-10-22  5:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Al Viro, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 03:46:57PM +1100, Nick Piggin wrote:
> The difficulty with inode_lock breaking is not the data structures.
> We know how to lock and modify them. The hardest part is verifying
> that a particular inode has no new, unhandled concurrency introduced
> to it (eg. the particular concurrency you pointed out in Dave's patch
> just now). Agree?
> 
> So in that case, I think it is much more natural to be able to take
> an inode and with i_lock, cover it from all icache state concurrency.
> I object to it being called over engineering -- it's actually just a
> big hammer on the inode, compared with fiddling with more complex
> rules.

And yes, being a big hammer, it is actually ugly and clunky for
the first pass.

The intention is always that we can start steps to streamline it
now. I had been looking at switching lock orders around, reducing
i_lock coverage etc, but I found that with RCU, things got a lot
cleaner without reducing i_lock coverage. With RCU the important
part of the locking shifts back, from the read side, to the write
side. Not surprisingly, this made my lock more natural, wheras it
does nothing for a lock ordering which is the other way around.

And I think you give too little credit to i_lock being used to
protect all i_state. Sure it's not strictly needed, and we could
start breaking bits and pieces. But it works really nicely, and
is maintainable and easier to convince yourself of correctness.

Have an inode? Want to do something to it? Take i_lock. You're
done. We don't _need_ to ever think about concurrent modifications.
Really, this is the "big dumb" approach IMO. Breaking things out
finer than per-inode basis is premature optimisation at this point.
Note that my series never precluded such incremental changes, in
fact it makes them easier.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22  3:14     ` Dave Chinner
@ 2010-10-22 10:37       ` Al Viro
  2010-10-22 11:40         ` Christoph Hellwig
  2010-10-23 21:37         ` Al Viro
  2010-10-24  2:18       ` Nick Piggin
  1 sibling, 2 replies; 58+ messages in thread
From: Al Viro @ 2010-10-22 10:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 02:14:31PM +1100, Dave Chinner wrote:

> AFAICT, moving the inode_lru_lock inside i_lock doesn't affect the
> locking order of anything else, and I agree that putting a single
> trylock in the prune_icache loop is definitely cleaner than anything
> else I've been able to think of that retains the current locking
> order. It will also remove the wart in sync_inode().
> 
> So, I'll swallow my rhetoric and agree with you that inode_lru_lock
> inside the i_lock is the natural and easiest way to nest these
> locks. I'll rework the series to do this. 

FWIW, here's what I'd prefer:

* move the trivial parts in front of queue (including exofs fix, etc., etc.)
* make sure that _everything_ walking the lists honors I_FREEING/I_WILL_FREE
as the first step.  We are very close to that already 
* protect all accesses to ->i_state with ->i_lock
* separate lru list from wb
* protect lru list by spinlock nested inside ->i_lock, with trylock in
prune_icache()
* at that point we can rip the inode_lock off the initial part of iput
moving it down to the point after having marked the inode with I_FREEING
at that point we can take hash, etc. out of inode_lock and under locks of
their own one by one.  And that kills inode_lock completely, at which point
the hierarchy is established and we can do the rest (non-atomic refcount,
etc.)

Note that I'd rather leave *all* non-trivialities along the lines of
per-zone vs per-sb for after the hierarchy is done.  I.e. let's start
with really simple "here's the single spinlock for hash, here's the
single spinlock for all sb lists".  If we get that right, the rest will
be localized; let's deal with the skeleton first.

What I'm going to do is to put together a branch with essentially cleanups
and trivial fixes, with both patchsets forked off its tip.  Then move stuff
to common stem, rediffing the branches as we go.  Then see what's left.

One more note: IMO sb list lock is better off inside the hash one; when we
do per-chain hash locks, we'll be better off if we don't have to hold sb
one over the entire chain search.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22 10:37       ` Al Viro
@ 2010-10-22 11:40         ` Christoph Hellwig
  2010-10-23 21:40           ` Al Viro
  2010-10-23 21:37         ` Al Viro
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-22 11:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 11:37:05AM +0100, Al Viro wrote:
> One more note: IMO sb list lock is better off inside the hash one; when we
> do per-chain hash locks, we'll be better off if we don't have to hold sb
> one over the entire chain search.

Why would you nest these two at all?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/21] fs: Implement lazy LRU updates for inodes
  2010-10-21  0:49 ` [PATCH 04/21] fs: Implement lazy LRU updates for inodes Dave Chinner
  2010-10-21  2:14   ` Christian Stroetmann
  2010-10-21 10:07   ` Nick Piggin
@ 2010-10-23  9:32   ` Al Viro
  2 siblings, 0 replies; 58+ messages in thread
From: Al Viro @ 2010-10-23  9:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 11:49:29AM +1100, Dave Chinner wrote:
>  		if (sb->s_flags & MS_ACTIVE) {
> +			inode->i_state |= I_REFERENCED;
> +			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
> +				list_del_init(&inode->i_list);
> +				inode_lru_list_add(inode);

What if it was on the list already?  Looks like you'll screw the counter...
And yes, you have solution later in series (do nothing if list isn't empty);
should be here for bisectability sake...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22 10:37       ` Al Viro
  2010-10-22 11:40         ` Christoph Hellwig
@ 2010-10-23 21:37         ` Al Viro
  2010-10-24 14:13           ` Christoph Hellwig
  1 sibling, 1 reply; 58+ messages in thread
From: Al Viro @ 2010-10-23 21:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 11:37:05AM +0100, Al Viro wrote:

> What I'm going to do is to put together a branch with essentially cleanups
> and trivial fixes, with both patchsets forked off its tip.  Then move stuff
> to common stem, rediffing the branches as we go.  Then see what's left.

OK, the current (partial) set is in

git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ #merge-stem

What remains to be done (I'm about to fall down right now, so that'll have
to wait until tomorrow):

	* writeback_sb_inode() told to ignore I_FREEING ones in addition to
I_NEW and I_WILL_FREE ones it ignores now.  Currently I_FREEING can't be
found there at all, so that'll change nothing.
	* invalidate_inodes() - collect I_FREEING/I_WILL_FREE on a separate
list, then (after we'd evicted the stuff we'd decided to evict) wait until
they get freed by whatever's freeing them already.
	* remove_dquot_ref() - looks like we might be OK with that one being
as it is - it walks sb list of inodes and for things like prune_icache()
the inodes stay on said list all the way through evict(), so it either
doesn't care or it's already broken.  And no, I'm not discounting either
possibility - it needs further analysis.

That's it - after that we'll be OK with dropping and regaining inode_lock
between the moment when we set I_FREEING and removals from the lists.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22 11:40         ` Christoph Hellwig
@ 2010-10-23 21:40           ` Al Viro
  0 siblings, 0 replies; 58+ messages in thread
From: Al Viro @ 2010-10-23 21:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 07:40:33AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 22, 2010 at 11:37:05AM +0100, Al Viro wrote:
> > One more note: IMO sb list lock is better off inside the hash one; when we
> > do per-chain hash locks, we'll be better off if we don't have to hold sb
> > one over the entire chain search.
> 
> Why would you nest these two at all?

[already said off-list, but since the question had been here...]

Insertion in hash and into sb list.  We *probably* don't care about
atomicity of that pair, but in this case we are dealing with two
topmost locks of hierarchy that might become independent.  That really
can be done as a followup.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-22  3:14     ` Dave Chinner
  2010-10-22 10:37       ` Al Viro
@ 2010-10-24  2:18       ` Nick Piggin
  1 sibling, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2010-10-24  2:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-fsdevel, linux-kernel

On Fri, Oct 22, 2010 at 02:14:31PM +1100, Dave Chinner wrote:
> AFAICT, moving the inode_lru_lock inside i_lock doesn't affect the
> locking order of anything else, and I agree that putting a single
> trylock in the prune_icache loop is definitely cleaner than anything
> else I've been able to think of that retains the current locking
> order. It will also remove the wart in sync_inode().
> 
> So, I'll swallow my rhetoric and agree with you that inode_lru_lock
> inside the i_lock is the natural and easiest way to nest these
> locks. I'll rework the series to do this. 

I don't know what's so unclean or wrong about the locking order.  I'm
still leaning much further than Al as to putting locks inside i_lock.

With inode-rcu, we _always_ arrive at the inode _first_, before taking
_any_ other locks (because we either come in via an external reference,
or we can find the inode from any of the data structures using RCU
rather than taking their locks).

[The exception is hash insertion uniqueness enforcement, because we have
 no inode to lock, by definition. But in that case we're OK because the
 new inode we're about to insert has no concurrency on its i_lock so that
 can be initialised locked.]

So what we have is an inode, which we need to do stuff to. That stuff
involves moving it on or off data structures, and updating its refcount
and state, etc.

That suggests the i_lock -> data-structure-lock locking order.

And look at the flexibility -- I could implement that without changing
any other code or ordering in the icache. Sure you can reorder things
around to cope with different locking strategies, but I'm not actually
given a good reason _why_ i_lock first is not the "natural" ordering.
Doing that means we simply don't _need_ to change any ordering or
concurrency rules (although it doesn't prevent changes).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-23 21:37         ` Al Viro
@ 2010-10-24 14:13           ` Christoph Hellwig
  2010-10-24 16:21             ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-24 14:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 23, 2010 at 10:37:52PM +0100, Al Viro wrote:
> 	* invalidate_inodes() - collect I_FREEING/I_WILL_FREE on a separate
> list, then (after we'd evicted the stuff we'd decided to evict) wait until
> they get freed by whatever's freeing them already.

Note that we would only have to do this for the umount case.  For others
it's pretty pointless.

But I think there's a better way to do it, and that's per-sb inode lru
lists.  By adopting the scheme from prune_dcache we'd always have
s_umount exclusive for inode reclaims, and per defintion we would not
have any ongoing reclaim when we do enter umount.  It would also allow
us to get rid of iprune_sem and the nasty unsolved locking issues it
causes.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-24 14:13           ` Christoph Hellwig
@ 2010-10-24 16:21             ` Christoph Hellwig
  2010-10-24 19:17               ` Al Viro
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-24 16:21 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 24, 2010 at 10:13:10AM -0400, Christoph Hellwig wrote:
> On Sat, Oct 23, 2010 at 10:37:52PM +0100, Al Viro wrote:
> > 	* invalidate_inodes() - collect I_FREEING/I_WILL_FREE on a separate
> > list, then (after we'd evicted the stuff we'd decided to evict) wait until
> > they get freed by whatever's freeing them already.
> 
> Note that we would only have to do this for the umount case.  For others
> it's pretty pointless.

Now that I've looked into it I think we basically fine right now.

If we're in umount there should be no other I_FREEING inodes.

 - concurrent prune_icache is prevented by iprune_sem.
 - concurrent other invalidate_inodes should not happen because we're
   in unmount and the filesystem should not be reachable any more,
   and even if it did iprune_sem would protect us.
 - how could a concurrent iput_final happen?  filesystem is not
   accessible anymore, and iput of fs internal inodes is single-threaded
   with the rest of the actual umount process.

So just skipping over I_FREEING inodes here should be fine for
non-umount callers, and for umount we could even do a WARN_ON.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-24 16:21             ` Christoph Hellwig
@ 2010-10-24 19:17               ` Al Viro
  2010-10-24 20:04                 ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Al Viro @ 2010-10-24 19:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 24, 2010 at 12:21:31PM -0400, Christoph Hellwig wrote:
> On Sun, Oct 24, 2010 at 10:13:10AM -0400, Christoph Hellwig wrote:
> > On Sat, Oct 23, 2010 at 10:37:52PM +0100, Al Viro wrote:
> > > 	* invalidate_inodes() - collect I_FREEING/I_WILL_FREE on a separate
> > > list, then (after we'd evicted the stuff we'd decided to evict) wait until
> > > they get freed by whatever's freeing them already.
> > 
> > Note that we would only have to do this for the umount case.  For others
> > it's pretty pointless.
> 
> Now that I've looked into it I think we basically fine right now.
> 
> If we're in umount there should be no other I_FREEING inodes.
> 
>  - concurrent prune_icache is prevented by iprune_sem.
>  - concurrent other invalidate_inodes should not happen because we're
>    in unmount and the filesystem should not be reachable any more,
>    and even if it did iprune_sem would protect us.
>  - how could a concurrent iput_final happen?  filesystem is not
>    accessible anymore, and iput of fs internal inodes is single-threaded
>    with the rest of the actual umount process.
> 
> So just skipping over I_FREEING inodes here should be fine for
> non-umount callers, and for umount we could even do a WARN_ON.

FWIW, I think we should kill most of invalidate_inodes() callers.  Look:
	* call in generic_shutdown_super() is legitimate.  The first one,
that is.  The second should be replaced with check for ->s_list being
non-empty.  Note that after the first pass we should have kicked out
everything with zero i_count.  Everything that gets dropped to zero
i_count after that (i.e. during ->put_super()) will be evicted immediately
and won't stay.  I.e. the second call will evict *nothing*; it's just
an overblown way to check if there are any inodes left.
	* call in ext2_remount() is hogwash - we do that with at least
root inode pinned down, so it will fail, along with the remount attempt.
	* ntfs_fill_super() call - no-op.  MS_ACTIVE hasn't been set
yet, so there will be no inodes with zero i_count sitting around.
	* gfs2 calls - same story (no MS_ACTIVE yet in fill_super(),
MS_ACTIVE already removed *and* invalidate_inodes() already called
in gfs2_put_super())
	* smb reconnect logics.  AFAICS, that's complete crap; we *never*
retain inodes on smbfs.  IOW, nothing for invalidate_inodes() to do, other
than evict fsnotify marks.  Which is to say, we are calling the wrong
function there, even assuming that fsnotify should try to work there.
	* finally, __invalidate_device().  Which has a slew of callers of
its own and is *very* different from normal situation.  Here we have
underlying device gone bad.

So I'm going to do the following:
	1) split evict_inodes() off invalidate_inodes() and simplify it.
	2) switch generic_shutdown_super() to that sucker, called once.
	3) kill all calls of invalidate_inodes() except __invalidate_device()
one.
	4) think hard about __invalidate_device() situation.

	evict_inodes() should *not* see any inodes with
I_NEW/I_FREEING/I_WILL_FREE.  Just skip.  It might see I_DIRTY/I_SYNC,
but that's OK - evict_inode() will wait for that.

	OTOH, invalidate_inodes() from __invalidate_device() can run in
parallel with e.g. final iput().  Currently it's not a problem, but
we'll need to start skipping I_FREEING/I_WILL_FREE ones there if we want
to change iput() locking.

	And yes, iprune_sem is a trouble waiting to happen - one fs stuck
in e.g. truncate_inode_pages() and we are seriously fucked; any non-lazy
umount() will get stuck as well.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-24 19:17               ` Al Viro
@ 2010-10-24 20:04                 ` Christoph Hellwig
  2010-10-24 20:36                   ` Al Viro
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2010-10-24 20:04 UTC (permalink / raw)
  To: Al Viro; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 24, 2010 at 08:17:35PM +0100, Al Viro wrote:
> 	* call in ext2_remount() is hogwash - we do that with at least
> root inode pinned down, so it will fail, along with the remount attempt.

And having it fail is a good thing.  XIP mode means different file and
address_space operations, which we don't even try to deal with right
now.  Not allowing transitions from/to it is the right thing.

> 	* smb reconnect logics.  AFAICS, that's complete crap; we *never*
> retain inodes on smbfs.  IOW, nothing for invalidate_inodes() to do, other
> than evict fsnotify marks.  Which is to say, we are calling the wrong
> function there, even assuming that fsnotify should try to work there.

I don't think it should mess with fsnotify.  fsnotify_unmount_inodes
assumes it's only called on umount right now, and sends umount
notifications to userspace (see my mail from a few days ago).  So if
you split invalidate_inodes it really should only go into the umount
one.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock
  2010-10-24 20:04                 ` Christoph Hellwig
@ 2010-10-24 20:36                   ` Al Viro
  0 siblings, 0 replies; 58+ messages in thread
From: Al Viro @ 2010-10-24 20:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 24, 2010 at 04:04:32PM -0400, Christoph Hellwig wrote:
> On Sun, Oct 24, 2010 at 08:17:35PM +0100, Al Viro wrote:
> > 	* call in ext2_remount() is hogwash - we do that with at least
> > root inode pinned down, so it will fail, along with the remount attempt.
> 
> And having it fail is a good thing.  XIP mode means different file and
> address_space operations, which we don't even try to deal with right
> now.  Not allowing transitions from/to it is the right thing.

Exactly.  But that should be done without that ridiculous call to
invalidate_inodes() - we should simply fail remount() and be done
with that.

> > 	* smb reconnect logics.  AFAICS, that's complete crap; we *never*
> > retain inodes on smbfs.  IOW, nothing for invalidate_inodes() to do, other
> > than evict fsnotify marks.  Which is to say, we are calling the wrong
> > function there, even assuming that fsnotify should try to work there.
> 
> I don't think it should mess with fsnotify.  fsnotify_unmount_inodes
> assumes it's only called on umount right now, and sends umount
> notifications to userspace (see my mail from a few days ago).  So if
> you split invalidate_inodes it really should only go into the umount
> one.

No, I mean that it's not obvious that fsnotify clients can realistically
work on smbfs in the first place.  I.e. I suspect that fsnotify should
refuse to set marks on that sucker.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2010-10-24 20:36 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-21  0:49 Inode Lock Scalability V6 Dave Chinner
2010-10-21  0:49 ` [PATCH 01/21] fs: switch bdev inode bdi's correctly Dave Chinner
2010-10-21  0:49 ` [PATCH 02/21] kernel: add bl_list Dave Chinner
2010-10-21  0:49 ` [PATCH 03/21] fs: Convert nr_inodes and nr_unused to per-cpu counters Dave Chinner
2010-10-21  0:49 ` [PATCH 04/21] fs: Implement lazy LRU updates for inodes Dave Chinner
2010-10-21  2:14   ` Christian Stroetmann
2010-10-21 10:07   ` Nick Piggin
2010-10-21 12:22     ` Christoph Hellwig
2010-10-23  9:32   ` Al Viro
2010-10-21  0:49 ` [PATCH 05/21] fs: inode split IO and LRU lists Dave Chinner
2010-10-21  0:49 ` [PATCH 06/21] fs: Clean up inode reference counting Dave Chinner
2010-10-21  1:41   ` Christoph Hellwig
2010-10-21  0:49 ` [PATCH 07/21] exofs: use iput() for inode reference count decrements Dave Chinner
2010-10-21  0:49 ` [PATCH 08/21] fs: rework icount to be a locked variable Dave Chinner
2010-10-21 19:40   ` Al Viro
2010-10-21 22:32     ` Dave Chinner
2010-10-21  0:49 ` [PATCH 09/21] fs: Factor inode hash operations into functions Dave Chinner
2010-10-21  0:49 ` [PATCH 10/21] fs: Stop abusing find_inode_fast in iunique Dave Chinner
2010-10-21  0:49 ` [PATCH 11/21] fs: move i_ref increments into find_inode/find_inode_fast Dave Chinner
2010-10-21  0:49 ` [PATCH 12/21] fs: remove inode_add_to_list/__inode_add_to_list Dave Chinner
2010-10-21  0:49 ` [PATCH 13/21] fs: Introduce per-bucket inode hash locks Dave Chinner
2010-10-21  0:49 ` [PATCH 14/21] fs: add a per-superblock lock for the inode list Dave Chinner
2010-10-21  0:49 ` [PATCH 15/21] fs: split locking of inode writeback and LRU lists Dave Chinner
2010-10-21  0:49 ` [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock Dave Chinner
2010-10-22  1:56   ` Al Viro
2010-10-22  2:26     ` Nick Piggin
2010-10-22  3:14     ` Dave Chinner
2010-10-22 10:37       ` Al Viro
2010-10-22 11:40         ` Christoph Hellwig
2010-10-23 21:40           ` Al Viro
2010-10-23 21:37         ` Al Viro
2010-10-24 14:13           ` Christoph Hellwig
2010-10-24 16:21             ` Christoph Hellwig
2010-10-24 19:17               ` Al Viro
2010-10-24 20:04                 ` Christoph Hellwig
2010-10-24 20:36                   ` Al Viro
2010-10-24  2:18       ` Nick Piggin
2010-10-21  0:49 ` [PATCH 17/21] fs: protect wake_up_inode with inode->i_lock Dave Chinner
2010-10-21  2:17   ` Christoph Hellwig
2010-10-21 13:16     ` Nick Piggin
2010-10-21  0:49 ` [PATCH 18/21] fs: introduce a per-cpu last_ino allocator Dave Chinner
2010-10-21  0:49 ` [PATCH 19/21] fs: icache remove inode_lock Dave Chinner
2010-10-21  2:14   ` Christian Stroetmann
2010-10-21  0:49 ` [PATCH 20/21] fs: Reduce inode I_FREEING and factor inode disposal Dave Chinner
2010-10-21  0:49 ` [PATCH 21/21] fs: do not assign default i_ino in new_inode Dave Chinner
2010-10-21  5:04 ` Inode Lock Scalability V7 (was V6) Dave Chinner
2010-10-21 13:20   ` Nick Piggin
2010-10-21 23:52     ` Dave Chinner
2010-10-22  0:45       ` Nick Piggin
2010-10-22  2:20         ` Al Viro
2010-10-22  2:34           ` Nick Piggin
2010-10-22  2:41             ` Nick Piggin
2010-10-22  2:48               ` Nick Piggin
2010-10-22  3:12                 ` Al Viro
2010-10-22  4:48                   ` Nick Piggin
2010-10-22  3:07             ` Al Viro
2010-10-22  4:46               ` Nick Piggin
2010-10-22  5:01                 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).