[PATCH 0/17] fs: Inode cache scalability

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/17] fs: Inode cache scalability
@ 2010-09-29 12:18 Dave Chinner
  2010-09-29 12:18 ` [PATCH 01/17] kernel: add bl_list Dave Chinner
                   ` (19 more replies)
  0 siblings, 20 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This patch set is derived from Nick Piggin's VFS scalability tree.
There doesn't appear to be any push to get that tree into shape for
.37, so this is an attempt to start the process of finer grained
review of the series for upstream inclusion. I'm hitting VFS lock
contention problems with XFS on 8-16p machines now, so I need to get
this stuff moving.

This patch set is just the basic inode_lock breakup patches plus a
few more simple changes to the inode code. It stops short of
introducing RCU inode freeing because those changes are not
completely baked yet. It also stops short of changing the way inodes
are tracked for writeback because I'd like not to spend my week
after -rc1 is released fixing writeback again....

As a result, the full inode handling improvements of Nick's patch
set are not realised with this short series. However, my own testing
indicates that the amount of lock traffic and contention is down by
an order of magnitude on an 8-way box for parallel inode create and
unlink workloads, so there is still significant improvements in
scalability from just this patch set.

I've only ported the patches so far, without changing anything
significant other than the comit descriptions. One thing that has
stood out as I've done this is that the ordering of the patches is
not ideal, and some things (like the inode counters) are modified
multiple times through the patch set.  I'm quite happy to
reorder/rework the series to fix these problems if that is desired.

Basically I'm trying to get this patchset ready for .37 (merge
window is not really that far off now), and I'm aiming to have the
rest of the inode changes (RCU freeing, writeback, etc) ready for
.38. I may even look to some of the dcache changes for .38 depending
on how much I can get tested and reviewed in that time frame.

Comments are welcome.

The current patchset is also available at the following location.

The following changes since commit b30a3f6257ed2105259b404d419b4964e363928c:

  Linux 2.6.36-rc5 (2010-09-20 16:56:53 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Eric Dumazet (2):
      fs: inode per-cpu last_ino allocator
      fs: Convert nr_inodes to a per-cpu counter

Nick Piggin (15):
      kernel: add bl_list
      fs: icache lock s_inodes list
      fs: icache lock inode hash
      fs: icache lock i_state
      fs: icache lock i_count
      fs: icache lock lru/writeback lists
      fs: icache atomic inodes_stat
      fs: icache protect inode state
      fs: Make last_ino, iunique independent of inode_lock
      fs: icache remove inode_lock
      fs: Factor inode hash operations into functions
      fs: Introduce per-bucket inode hash locks
      fs: Implement lazy LRU updates for inodes.
      fs: Inode counters do not need to be atomic.
      fs: Clean up inode reference counting

 Documentation/filesystems/Locking        |    2 +-
 Documentation/filesystems/porting        |   10 +-
 Documentation/filesystems/vfs.txt        |    2 +-
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/staging/pohmelfs/inode.c         |   14 +-
 fs/9p/vfs_inode.c                        |    4 +-
 fs/affs/inode.c                          |    2 +-
 fs/afs/dir.c                             |    4 +-
 fs/anon_inodes.c                         |    2 +-
 fs/bfs/dir.c                             |    2 +-
 fs/block_dev.c                           |    7 +-
 fs/btrfs/inode.c                         |   23 +-
 fs/buffer.c                              |    2 +-
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/coda/dir.c                            |    2 +-
 fs/drop_caches.c                         |   19 +-
 fs/exofs/inode.c                         |   10 +-
 fs/exofs/namei.c                         |    2 +-
 fs/ext2/namei.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext3/namei.c                          |    2 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/ext4/namei.c                          |    2 +-
 fs/fs-writeback.c                        |  156 +++++---
 fs/gfs2/ops_inode.c                      |    2 +-
 fs/hfs/hfs_fs.h                          |    2 +-
 fs/hfs/inode.c                           |    2 +-
 fs/hfsplus/dir.c                         |    2 +-
 fs/hfsplus/hfsplus_fs.h                  |    2 +-
 fs/hfsplus/inode.c                       |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |  603 ++++++++++++++++++++----------
 fs/jffs2/dir.c                           |    4 +-
 fs/jfs/jfs_txnmgr.c                      |    2 +-
 fs/jfs/namei.c                           |    2 +-
 fs/libfs.c                               |    2 +-
 fs/locks.c                               |    2 +-
 fs/logfs/dir.c                           |    2 +-
 fs/logfs/inode.c                         |    2 +-
 fs/logfs/readwrite.c                     |    6 +-
 fs/minix/namei.c                         |    2 +-
 fs/namei.c                               |    2 +-
 fs/nfs/dir.c                             |    2 +-
 fs/nfs/getroot.c                         |    4 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nfs/write.c                           |    2 +-
 fs/nilfs2/gcdat.c                        |    1 +
 fs/nilfs2/gcinode.c                      |   22 +-
 fs/nilfs2/mdt.c                          |    2 +-
 fs/nilfs2/namei.c                        |    2 +-
 fs/nilfs2/segment.c                      |    2 +-
 fs/nilfs2/the_nilfs.h                    |    2 +-
 fs/notify/inode_mark.c                   |   46 ++-
 fs/notify/mark.c                         |    1 -
 fs/notify/vfsmount_mark.c                |    1 -
 fs/ntfs/inode.c                          |    4 +-
 fs/ntfs/super.c                          |    2 +-
 fs/ocfs2/inode.c                         |    2 +-
 fs/ocfs2/namei.c                         |    2 +-
 fs/quota/dquot.c                         |   36 +-
 fs/reiserfs/namei.c                      |    2 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/reiserfs/xattr.c                      |    2 +-
 fs/sysv/namei.c                          |    2 +-
 fs/ubifs/dir.c                           |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/namei.c                           |    2 +-
 fs/ufs/namei.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c              |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    4 +-
 include/linux/fs.h                       |   54 ++-
 include/linux/list_bl.h                  |  127 +++++++
 include/linux/rculist_bl.h               |  128 +++++++
 include/linux/writeback.h                |    4 +-
 ipc/mqueue.c                             |    2 +-
 kernel/futex.c                           |    2 +-
 kernel/sysctl.c                          |    4 +-
 mm/backing-dev.c                         |    8 +-
 mm/filemap.c                             |    6 +-
 mm/rmap.c                                |    6 +-
 mm/shmem.c                               |    6 +-
 net/socket.c                             |    2 +-
 85 files changed, 1001 insertions(+), 437 deletions(-)
 create mode 100644 include/linux/list_bl.h
 create mode 100644 include/linux/rculist_bl.h

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 01/17] kernel: add bl_list
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
  2010-10-01  5:48   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 02/17] fs: icache lock s_inodes list Dave Chinner
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Introduce a type of hlist that can support the use of the lowest bit
in the hlist_head. This will be subsequently used to implement
per-bucket bit spinlock for inode hashes.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/list_bl.h    |  127 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/rculist_bl.h |  128 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 255 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/list_bl.h
 create mode 100644 include/linux/rculist_bl.h

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
new file mode 100644
index 0000000..cf8acfc
--- /dev/null
+++ b/include/linux/list_bl.h
@@ -0,0 +1,127 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ *
+ * For modification operations, the 0 bit of hlist_bl_head->first
+ * pointer must be set.
+ */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define LIST_BL_LOCKMASK	1UL
+#else
+#define LIST_BL_LOCKMASK	0UL
+#endif
+
+#ifdef CONFIG_DEBUG_LIST
+#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#else
+#define LIST_BL_BUG_ON(x)
+#endif
+
+
+struct hlist_bl_head {
+	struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+	struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+	((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+	h->next = NULL;
+	h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+	return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+	return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+	hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+	struct hlist_bl_node *next = n->next;
+	struct hlist_bl_node **pprev = n->pprev;
+
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	*pprev = (struct hlist_bl_node *)
+			((unsigned long)next |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->next = LIST_POISON1;
+	n->pprev = LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		INIT_HLIST_BL_NODE(n);
+	}
+}
+
+/**
+ * hlist_bl_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first(head);				\
+	     pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
+
+#endif
diff --git a/include/linux/rculist_bl.h b/include/linux/rculist_bl.h
new file mode 100644
index 0000000..cdfb54e
--- /dev/null
+++ b/include/linux/rculist_bl.h
@@ -0,0 +1,128 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+/*
+ * RCU-protected bl list version. See include/linux/list_bl.h.
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+#include <linux/bit_spinlock.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	rcu_assign_pointer(h->first,
+		(struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)rcu_dereference(h->first) & ~LIST_BL_LOCKMASK);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node returns true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		n->pprev = NULL;
+	}
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first;
+
+	/* don't need hlist_bl_first_rcu because we're under lock */
+	first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+
+	/* need _rcu because we can have concurrent lock free readers */
+	hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_bl_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first_rcu(head);				\
+		pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+		pos = rcu_dereference_raw(pos->next))
+
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 02/17] fs: icache lock s_inodes list
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
  2010-09-29 12:18 ` [PATCH 01/17] kernel: add bl_list Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-10-01  5:49   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 03/17] fs: icache lock inode hash Dave Chinner
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

To allow removal of the inode_lock, we first need to protect the
superblock inode list with it's own lock instead of using the
inode_lock for this purpose. Nest the new sb_inode_list_lock inside
the inode_lock around the list operations it needs to protect.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c          |    4 ++++
 fs/fs-writeback.c         |    4 ++++
 fs/inode.c                |   19 +++++++++++++++++++
 fs/notify/inode_mark.c    |    2 ++
 fs/quota/dquot.c          |    6 ++++++
 include/linux/writeback.h |    1 +
 6 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 2195c21..ab69ae7 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,18 +17,22 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 81e086d..9adc9d9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1021,6 +1021,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1038,6 +1039,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		if (mapping->nrpages == 0)
 			continue;
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
@@ -1055,7 +1057,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		cond_resched();
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..ca98254 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -26,6 +26,15 @@
 #include <linux/posix_acl.h>
 
 /*
+ * Usage:
+ * sb_inode_list_lock protects:
+ *   s_inodes, i_sb_list
+ *
+ * Ordering:
+ * inode_lock
+ *   sb_inode_list_lock
+ */
+/*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
@@ -83,6 +92,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
+DEFINE_SPINLOCK(sb_inode_list_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -339,7 +349,9 @@ static void dispose_list(struct list_head *head)
 
 		spin_lock(&inode_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -371,6 +383,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb_inode_list_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -408,8 +421,10 @@ int invalidate_inodes(struct super_block *sb)
 
 	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -597,7 +612,9 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 {
 	inodes_stat.nr_inodes++;
 	list_add(&inode->i_list, &inode_in_use);
+	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb_inode_list_lock);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
 }
@@ -1231,7 +1248,9 @@ static void iput_final(struct inode *inode)
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
+	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 33297c0..34b1585 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -283,6 +283,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -296,5 +297,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		iput(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index aad1316..2e3b913 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -897,6 +897,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 #endif
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -910,6 +911,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 			continue;
 
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -921,7 +923,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 
@@ -1004,6 +1008,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	int reserved = 0;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1017,6 +1022,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..9974edb 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,6 +10,7 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
+extern spinlock_t sb_inode_list_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 03/17] fs: icache lock inode hash
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
  2010-09-29 12:18 ` [PATCH 01/17] kernel: add bl_list Dave Chinner
  2010-09-29 12:18 ` [PATCH 02/17] fs: icache lock s_inodes list Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
  2010-10-01  6:06   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 04/17] fs: icache lock i_state Dave Chinner
                   ` (16 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Currently the inode hash lists are protected by the inode_lock. To
allow removal of the inode_lock, we need to protect the inode hash
table lists with a new lock. Nest the new inode_hash_lock inside the
inode_lock to protect the hash lists.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c                |   33 ++++++++++++++++++++++++++++++++-
 include/linux/writeback.h |    1 +
 2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ca98254..9d7ffb1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -29,10 +29,14 @@
  * Usage:
  * sb_inode_list_lock protects:
  *   s_inodes, i_sb_list
+ * inode_hash_lock protects:
+ *   inode hash table, i_hash
  *
  * Ordering:
  * inode_lock
  *   sb_inode_list_lock
+ * inode_lock
+ *   inode_hash_lock
  */
 /*
  * This is needed for the following functions:
@@ -93,6 +97,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -348,7 +353,9 @@ static void dispose_list(struct list_head *head)
 		evict(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
@@ -557,17 +564,20 @@ static struct inode *find_inode(struct super_block *sb,
 	struct inode *inode = NULL;
 
 repeat:
+	spin_lock(&inode_hash_lock);
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock(&inode_hash_lock);
 	return node ? inode : NULL;
 }
 
@@ -582,17 +592,20 @@ static struct inode *find_inode_fast(struct super_block *sb,
 	struct inode *inode = NULL;
 
 repeat:
+	spin_lock(&inode_hash_lock);
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock(&inode_hash_lock);
 	return node ? inode : NULL;
 }
 
@@ -615,8 +628,11 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
-	if (head)
+	if (head) {
+		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
+		spin_unlock(&inode_hash_lock);
+	}
 }
 
 /**
@@ -1094,7 +1110,9 @@ int insert_inode_locked(struct inode *inode)
 	while (1) {
 		struct hlist_node *node;
 		struct inode *old = NULL;
+
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
@@ -1106,9 +1124,11 @@ int insert_inode_locked(struct inode *inode)
 		}
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
@@ -1134,6 +1154,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
@@ -1145,9 +1166,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		}
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
@@ -1172,7 +1195,9 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1186,7 +1211,9 @@ EXPORT_SYMBOL(__insert_inode_hash);
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1245,7 +1272,9 @@ static void iput_final(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
 	spin_lock(&sb_inode_list_lock);
@@ -1257,7 +1286,9 @@ static void iput_final(struct inode *inode)
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9974edb..35d6e81 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,6 +11,7 @@ struct backing_dev_info;
 
 extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
+extern spinlock_t inode_hash_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 04/17] fs: icache lock i_state
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (2 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 03/17] fs: icache lock inode hash Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-10-01  5:54   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 05/17] fs: icache lock i_count Dave Chinner
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state.  Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c  |    9 ++++--
 fs/fs-writeback.c |   29 +++++++++++++++--
 fs/inode.c        |   86 +++++++++++++++++++++++++++++++++++++++++++++-------
 fs/nilfs2/gcdat.c |    1 +
 fs/quota/dquot.c  |   12 ++++---
 5 files changed, 113 insertions(+), 24 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index ab69ae7..45bdf88 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
+				|| inode->i_mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9adc9d9..7bd1aef 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -280,10 +280,12 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 	}
 }
 
@@ -337,6 +339,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -358,8 +361,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * write_inode()
 	 */
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -369,6 +374,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -479,7 +485,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			continue;
 		}
@@ -487,8 +495,10 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
+		if (inode_dirtied_after(inode, wbc->wb_start)) {
+			spin_unlock(&inode->i_lock);
 			return 1;
+		}
 
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
@@ -501,6 +511,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
@@ -936,6 +947,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -986,6 +998,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		}
 	}
 out:
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	if (wakeup_bdi)
@@ -1033,12 +1046,16 @@ static void wait_sb_inodes(struct super_block *sb)
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		/*
@@ -1165,7 +1182,9 @@ int write_inode_now(struct inode *inode, int sync)
 
 	might_sleep();
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, &wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
@@ -1189,7 +1208,9 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	int ret;
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	return ret;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 9d7ffb1..906a4ad 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -31,10 +31,13 @@
  *   s_inodes, i_sb_list
  * inode_hash_lock protects:
  *   inode hash table, i_hash
+ * inode->i_lock protects:
+ *   i_state
  *
  * Ordering:
  * inode_lock
  *   sb_inode_list_lock
+ *     inode->i_lock
  * inode_lock
  *   inode_hash_lock
  */
@@ -296,6 +299,8 @@ static void init_once(void *foo)
  */
 void __iget(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
+
 	if (atomic_inc_return(&inode->i_count) != 1)
 		return;
 
@@ -396,16 +401,21 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			spin_unlock(&inode->i_lock);
 			count++;
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
@@ -484,12 +494,15 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state || atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			__iget(inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -500,12 +513,16 @@ static void prune_icache(int nr_to_scan)
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
+			spin_lock(&inode->i_lock);
+			if (!can_unuse(inode)) {
+				spin_unlock(&inode->i_lock);
 				continue;
+			}
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 		nr_pruned++;
 	}
 	inodes_stat.nr_unused -= nr_pruned;
@@ -568,8 +585,14 @@ repeat:
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&inode_hash_lock);
+			goto repeat;
+		}
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
@@ -598,6 +621,10 @@ repeat:
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&inode_hash_lock);
+			goto repeat;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
@@ -624,10 +651,10 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
 	inodes_stat.nr_inodes++;
-	list_add(&inode->i_list, &inode_in_use);
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
+	list_add(&inode->i_list, &inode_in_use);
 	if (head) {
 		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
@@ -684,9 +711,9 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -753,8 +780,8 @@ static struct inode *get_new_inode(struct super_block *sb,
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, head, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -769,6 +796,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * allocated.
 		 */
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -777,6 +805,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 	return inode;
 
 set_failed:
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
@@ -800,8 +829,8 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, head, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -816,6 +845,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * allocated.
 		 */
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -857,6 +887,7 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		res = counter++;
 		head = inode_hashtable + hash(sb, res);
 		inode = find_inode_fast(sb, head, res);
+		spin_unlock(&inode->i_lock);
 	} while (inode != NULL);
 	spin_unlock(&inode_lock);
 
@@ -866,18 +897,24 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
+	struct inode *ret = inode;
+
 	spin_lock(&inode_lock);
-	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
 		__iget(inode);
-	else
+	} else {
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
-		inode = NULL;
+		ret = NULL;
+	}
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
-	return inode;
+
+	return ret;
 }
 EXPORT_SYMBOL(igrab);
 
@@ -910,6 +947,7 @@ static struct inode *ifind(struct super_block *sb,
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -943,6 +981,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1112,6 +1151,7 @@ int insert_inode_locked(struct inode *inode)
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
+repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_ino != ino)
@@ -1120,6 +1160,10 @@ int insert_inode_locked(struct inode *inode)
 				continue;
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
+			if (!spin_trylock(&old->i_lock)) {
+				spin_unlock(&inode_hash_lock);
+				goto repeat;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1130,6 +1174,7 @@ int insert_inode_locked(struct inode *inode)
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1154,6 +1199,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
+repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_sb != sb)
@@ -1162,6 +1208,10 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 				continue;
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
+			if (!spin_trylock(&old->i_lock)) {
+				spin_unlock(&inode_hash_lock);
+				goto repeat;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1172,6 +1222,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1256,19 +1307,27 @@ static void iput_final(struct inode *inode)
 	else
 		drop = generic_drop_inode(inode);
 
+	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	if (!drop) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
+			spin_unlock(&inode->i_lock);
+			spin_unlock(&sb_inode_list_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
@@ -1277,12 +1336,12 @@ static void iput_final(struct inode *inode)
 		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
-	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
@@ -1491,6 +1550,8 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1498,6 +1559,7 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
diff --git a/fs/nilfs2/gcdat.c b/fs/nilfs2/gcdat.c
index 84a45d1..c51f0e8 100644
--- a/fs/nilfs2/gcdat.c
+++ b/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 2e3b913..15f66f1 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -899,18 +899,20 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
+		    !atomic_read(&inode->i_writecount) ||
+		    !dqinit_needed(inode, type))) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount))
-			continue;
-		if (!dqinit_needed(inode, type))
-			continue;
 
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 05/17] fs: icache lock i_count
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (3 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 04/17] fs: icache lock i_state Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
  2010-09-29 12:18 ` [PATCH 06/17] fs: icache lock lru/writeback lists Dave Chinner
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

the inode reference count is currently an atomic variable so that it can be
sampled/modified outside the inode_lock. However, the inode_lock is still
needed to synchronise the final reference count and checks against the inode
state.

With the inode state now protected by a per-inode lock, we can protect the
inode reference count with the same per-inode lock and still check and modify
the count and state atomically. By using the i_lock in this manner, we no
longer need an atomic reference count field, so convert all the reference
counting to be protected by the inode->i_lock and change i_count to a
non-atomic variable.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/staging/pohmelfs/inode.c         |   14 +++++++----
 fs/9p/vfs_inode.c                        |    4 ++-
 fs/affs/inode.c                          |    4 ++-
 fs/afs/dir.c                             |    4 ++-
 fs/anon_inodes.c                         |    4 ++-
 fs/bfs/dir.c                             |    4 ++-
 fs/block_dev.c                           |   15 ++++++++++--
 fs/btrfs/inode.c                         |   23 +++++++++++++++----
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/coda/dir.c                            |    4 ++-
 fs/exofs/inode.c                         |   12 +++++++--
 fs/exofs/namei.c                         |    4 ++-
 fs/ext2/namei.c                          |    4 ++-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext3/namei.c                          |    4 ++-
 fs/ext4/ialloc.c                         |    4 +-
 fs/ext4/namei.c                          |    4 ++-
 fs/fs-writeback.c                        |    4 +-
 fs/gfs2/ops_inode.c                      |    4 ++-
 fs/hfsplus/dir.c                         |    4 ++-
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   36 ++++++++++++++++++++---------
 fs/jffs2/dir.c                           |    8 +++++-
 fs/jfs/jfs_txnmgr.c                      |    4 ++-
 fs/jfs/namei.c                           |    4 ++-
 fs/libfs.c                               |    4 ++-
 fs/locks.c                               |    2 +-
 fs/logfs/dir.c                           |    4 ++-
 fs/logfs/readwrite.c                     |    6 ++++-
 fs/minix/namei.c                         |    4 ++-
 fs/namei.c                               |    7 ++++-
 fs/nfs/dir.c                             |    4 ++-
 fs/nfs/getroot.c                         |    4 ++-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nfs/write.c                           |    2 +-
 fs/nilfs2/mdt.c                          |    2 +-
 fs/nilfs2/namei.c                        |    4 ++-
 fs/notify/inode_mark.c                   |   21 ++++++++++------
 fs/ntfs/super.c                          |    4 ++-
 fs/ocfs2/namei.c                         |    4 ++-
 fs/reiserfs/namei.c                      |    4 ++-
 fs/reiserfs/stree.c                      |    2 +-
 fs/sysv/namei.c                          |    4 ++-
 fs/ubifs/dir.c                           |    4 ++-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/namei.c                           |    4 ++-
 fs/ufs/namei.c                           |    4 ++-
 fs/xfs/linux-2.6/xfs_iops.c              |    4 ++-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    6 +++-
 include/linux/fs.h                       |    2 +-
 ipc/mqueue.c                             |    7 ++++-
 kernel/futex.c                           |    4 ++-
 mm/shmem.c                               |    4 ++-
 net/socket.c                             |    4 ++-
 58 files changed, 224 insertions(+), 95 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 1a40da9..79238ba 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1549,7 +1549,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (inode->i_count != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 97dae29..2dc95dd 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -1289,13 +1289,15 @@ static void pohmelfs_put_super(struct super_block *sb)
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, count: %u.\n",
 				__func__, pi->ino, pi, inode, count);
 
-		if (atomic_read(&inode->i_count) != count) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_count) != count) {
 			printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_count: %d.\n",
 					__func__, pi->ino, pi, inode, count,
-					atomic_read(&inode->i_count));
-			count = atomic_read(&inode->i_count);
+					inode->i_count);
+			count = inode->i_count;
 			in_drop_list++;
 		}
+		spin_unlock(&inode->i_lock);
 
 		while (count--)
 			iput(&pi->vfs_inode);
@@ -1305,7 +1307,7 @@ static void pohmelfs_put_super(struct super_block *sb)
 		pi = POHMELFS_I(inode);
 
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, i_count: %u.\n",
-				__func__, pi->ino, pi, inode, atomic_read(&inode->i_count));
+				__func__, pi->ino, pi, inode, inode->i_count);
 
 		/*
 		 * These are special inodes, they were created during
@@ -1313,7 +1315,9 @@ static void pohmelfs_put_super(struct super_block *sb)
 		 * so they live here with reference counter being 1 and prevent
 		 * umount from succeed since it believes that they are busy.
 		 */
-		count = atomic_read(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		count = inode->i_count;
+		spin_lock(&inode->i_lock);
 		if (count) {
 			list_del_init(&inode->i_sb_list);
 			while (count--)
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 9e670d5..5e1d774 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1791,7 +1791,9 @@ v9fs_vfs_link_dotl(struct dentry *old_dentry, struct inode *dir,
 		/* Caching disabled. No need to get upto date stat info.
 		 * This dentry will be released immediately. So, just i_count++
 		 */
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 	}
 
 	dentry->d_op = old_dentry->d_op;
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 3a0fdec..cb9e773 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -388,7 +388,9 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 0d38c09..4d8598c 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -1045,7 +1045,9 @@ static int afs_link(struct dentry *from, struct inode *dir,
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	spin_lock(&vnode->vfs_inode.i_lock);
+	vnode->vfs_inode.i_count++;
+	spin_unlock(&vnode->vfs_inode.i_lock);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index e4b75d6..c50dc2a 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -114,7 +114,9 @@ struct file *anon_inode_getfile(const char *name,
 	 * so we can avoid doing an igrab() and we can use an open-coded
 	 * atomic_inc().
 	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	spin_lock(&anon_inode_inode->i_lock);
+	anon_inode_inode->i_count++;
+	spin_unlock(&anon_inode_inode->i_lock);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index d967e05..d42fc72 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -176,7 +176,9 @@ static int bfs_link(struct dentry *old, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 50e8c85..140451c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -550,7 +550,12 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	atomic_inc(&bdev->bd_inode->i_count);
+	struct inode *inode = bdev->bd_inode;
+
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
+
 	return bdev;
 }
 
@@ -580,7 +585,9 @@ static struct block_device *bd_acquire(struct inode *inode)
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		spin_lock(&inode->i_lock);
+		bdev->bd_inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -596,7 +603,9 @@ static struct block_device *bd_acquire(struct inode *inode)
 			 * So, we can access it via ->i_mapping always
 			 * without igrab().
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			spin_lock(&inode->i_lock);
+			bdev->bd_inode->i_count++;
+			spin_unlock(&inode->i_lock);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..ffb8aec 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1964,8 +1964,13 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct delayed_iput *delayed;
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	spin_lock(&inode->i_lock);
+	if (inode->i_count > 1) {
+		inode->i_count--;
+		spin_unlock(&inode->i_lock);
 		return;
+	}
+	spin_unlock(&inode->i_lock);
 
 	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 	delayed->inode = inode;
@@ -2718,11 +2723,17 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+	spin_lock(&inode->i_lock);
+	if (S_ISDIR(inode->i_mode) && inode->i_count > 1) {
+		spin_unlock(&inode->i_lock);
 		return ERR_PTR(-ENOSPC);
+	}
 
-	if (atomic_read(&inode->i_count) > 2)
+	if (inode->i_count > 2) {
+		spin_unlock(&inode->i_lock);
 		return ERR_PTR(-ENOSPC);
+	}
+	spin_unlock(&inode->i_lock);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
 		return ERR_PTR(-ENOSPC);
@@ -3939,7 +3950,7 @@ again:
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (inode->i_count > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
@@ -4758,7 +4769,9 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f091b13..cb4d673 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1102,7 +1102,7 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, inode->i_count);
 		return 0;
 	}
 
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 93f77d4..af0f050 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1639,7 +1639,7 @@ int cifs_revalidate_dentry(struct dentry *dentry)
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count.counter,
+		 "jiffies %ld", full_path, inode, inode->i_count,
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ccd98b0..2e52ad6 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -303,7 +303,9 @@ static int coda_link(struct dentry *source_de, struct inode *dir_inode,
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index eb7368e..3e7f967 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1101,7 +1101,9 @@ static void create_done(struct exofs_io_state *ios, void *p)
 
 	set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count--;
+	spin_unlock(&inode->i_lock);
 	wake_up(&oi->i_wq);
 }
 
@@ -1154,14 +1156,18 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	ios->done = create_done;
 	ios->private = inode;
 	ios->cred = oi->i_cred;
 	ret = exofs_sbi_create(ios);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count--;
+		spin_unlock(&inode->i_lock);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
 	}
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index b7dd0c2..506778a 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -153,7 +153,9 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	return exofs_add_nondir(dentry, inode);
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..a5b9a54 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -206,7 +206,9 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4ab72db..cdb398b 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_count > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					inode->i_count);
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 2b35ddb..4e3b5ff 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2260,7 +2260,9 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..720b42d 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -189,9 +189,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_count > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       inode->i_count);
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 314c0d3..4dbb5e5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2312,7 +2312,9 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7bd1aef..2edaad7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -309,7 +309,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
+	if (!inode->i_count)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -406,7 +406,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
+		} else if (inode->i_count) {
 			/*
 			 * The inode is clean, inuse
 			 */
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 1009be2..49c38dc 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -253,7 +253,9 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 764fd1b..55fa48d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -301,7 +301,9 @@ static int hfsplus_link(struct dentry *src_dentry, struct inode *dst_dir,
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 56f0da1..7f6b7ef 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !i->i_count) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
diff --git a/fs/inode.c b/fs/inode.c
index 906a4ad..2e8ab8e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -32,14 +32,13 @@
  * inode_hash_lock protects:
  *   inode hash table, i_hash
  * inode->i_lock protects:
- *   i_state
+ *   i_state, i_count
  *
  * Ordering:
  * inode_lock
  *   sb_inode_list_lock
  *     inode->i_lock
- * inode_lock
- *   inode_hash_lock
+ *       inode_hash_lock
  */
 /*
  * This is needed for the following functions:
@@ -150,7 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_count = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -301,7 +300,8 @@ void __iget(struct inode *inode)
 {
 	assert_spin_locked(&inode->i_lock);
 
-	if (atomic_inc_return(&inode->i_count) != 1)
+	inode->i_count++;
+	if (inode->i_count > 1)
 		return;
 
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
@@ -407,7 +407,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			continue;
 		}
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		if (!inode->i_count) {
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
@@ -457,7 +457,7 @@ static int can_unuse(struct inode *inode)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (atomic_read(&inode->i_count))
+	if (inode->i_count)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -495,7 +495,7 @@ static void prune_icache(int nr_to_scan)
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
 		spin_lock(&inode->i_lock);
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		if (inode->i_state || inode->i_count) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -1307,8 +1307,6 @@ static void iput_final(struct inode *inode)
 	else
 		drop = generic_drop_inode(inode);
 
-	spin_lock(&sb_inode_list_lock);
-	spin_lock(&inode->i_lock);
 	if (!drop) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 			list_move(&inode->i_list, &inode_unused);
@@ -1368,8 +1366,24 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+retry:
+		spin_lock(&inode->i_lock);
+		if (inode->i_count == 1) {
+			if (!spin_trylock(&inode_lock)) {
+				spin_unlock(&inode->i_lock);
+				goto retry;
+			}
+			if (!spin_trylock(&sb_inode_list_lock)) {
+				spin_unlock(&inode_lock);
+				spin_unlock(&inode->i_lock);
+				goto retry;
+			}
+			inode->i_count--;
 			iput_final(inode);
+		} else {
+			inode->i_count--;
+			spin_unlock(&inode->i_lock);
+		}
 	}
 }
 EXPORT_SYMBOL(iput);
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index ed78a3c..4d1bcfa 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -289,7 +289,9 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 	}
 	return ret;
 }
@@ -864,7 +866,9 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index d945ea7..820212f 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,9 @@ int txCommit(tid_t tid,		/* transaction identifier */
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		spin_lock(&tblk->u.ip->i_lock);
+		tblk->u.ip->i_count++;
+		spin_unlock(&tblk->u.ip->i_lock);
 		/*
 		 * Avoid a rare deadlock
 		 *
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index a9cf8e8..3259008 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -839,7 +839,9 @@ static int jfs_link(struct dentry *old_dentry,
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	spin_lock(&ip->i_lock);
+	ip->i_count++;
+	spin_unlock(&ip->i_lock);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0a9da95..0fa4dbe 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -255,7 +255,9 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
diff --git a/fs/locks.c b/fs/locks.c
index ab24d49..6d97e65 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1376,7 +1376,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 			goto out;
 		if ((arg == F_WRLCK)
 		    && ((atomic_read(&dentry->d_count) > 1)
-			|| (atomic_read(&inode->i_count) > 1)))
+			|| inode->i_count > 1))
 			goto out;
 	}
 
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 9777eb5..90eb51f 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -569,7 +569,9 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 6127baf..0d92191 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -1002,8 +1002,12 @@ static int __logfs_is_valid_block(struct inode *inode, u64 bix, u64 ofs)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+	spin_lock(&inode->i_lock);
+	if ((inode->i_nlink == 0) && inode->i_count == 1) {
+		spin_unlock(&inode->i_lock);
 		return 0;
+	}
+	spin_unlock(&inode->i_lock);
 
 	if (bix < I0_BLOCKS)
 		return logfs_is_valid_direct(li, bix, ofs);
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index f3f3578..a4a160f 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -101,7 +101,9 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	return add_nondir(dentry, inode);
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 24896e8..817d6bb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2290,8 +2290,11 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		if (nd.last.name[nd.last.len])
 			goto slashes;
 		inode = dentry->d_inode;
-		if (inode)
-			atomic_inc(&inode->i_count);
+		if (inode) {
+			spin_lock(&inode->i_lock);
+			inode->i_count++;
+			spin_unlock(&inode->i_lock);
+		}
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e257172..375b6b5 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1580,7 +1580,9 @@ nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		d_add(dentry, inode);
 	}
 	return error;
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index a70e446..c6db37e 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -55,7 +55,9 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..dfb79e5 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -384,7 +384,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		inode->i_count);
 
 out:
 	return inode;
@@ -1190,7 +1190,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			inode->i_count, fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 3e2f19b..d7fc5d0 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -506,8 +506,8 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
 		state->owner = owner;
 		atomic_inc(&owner->so_count);
 		list_add(&state->inode_states, &nfsi->open_states);
-		state->inode = igrab(inode);
 		spin_unlock(&inode->i_lock);
+		state->inode = igrab(inode);
 		/* Note: The reclaim code dictates that we add stateless
 		 * and read-only stateids to the end of the list */
 		list_add_tail(&state->open_states, &owner->so_states);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 874972d..129ebaa 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		igrab(inode);
+		__iget(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index d01aff4..db0b75b 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -480,7 +480,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_count = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index ad6ed2c..9e287ea 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -219,7 +219,9 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 34b1585..70f7e16 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -257,24 +257,29 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		if (!inode->i_count)
 			continue;
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
 
 		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
+		if (inode != need_iput_tmp) {
+			spin_lock(&inode->i_lock);
 			__iget(inode);
-		else
+			spin_unlock(&inode->i_lock);
+		} else
 			need_iput_tmp = NULL;
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-		    atomic_read(&next_i->i_count) &&
-		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			__iget(next_i);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (next_i->i_count &&
+			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
+				__iget(next_i);
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 5128061..2e380ba 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2930,7 +2930,9 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
 		/* We increment i_count simulating an ntfs_iget(). */
-		atomic_inc(&vol->root_ino->i_count);
+		spin_lock(&vol->root_ino->i_lock);
+		vol->root_ino->i_count++;
+		spin_unlock(&vol->root_ino->i_lock);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index a00dda2..9c46feb 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -741,7 +741,9 @@ static int ocfs2_link(struct dentry *old_dentry,
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index ee78d4a..1efebb2 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1156,7 +1156,9 @@ static int reiserfs_link(struct dentry *old_dentry, struct inode *dir,
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 313d39d..b8d4d67 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1477,7 +1477,7 @@ static int maybe_indirect_to_direct(struct reiserfs_transaction_handle *th,
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (inode->i_count > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 33e047b..d63da9b 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -126,7 +126,9 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	return add_nondir(dentry, inode);
 }
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 87ebcce..c204b5c 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -550,7 +550,9 @@ static int ubifs_link(struct dentry *old_dentry, struct inode *dir,
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cd5900b..c3c79b7 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -342,7 +342,7 @@ static void ubifs_evict_inode(struct inode *inode)
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!inode->i_count);
 
 	truncate_inode_pages(&inode->i_data, 0);
 
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..d8b0dc8 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1101,7 +1101,9 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index b056f02..8cbf920 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -180,7 +180,9 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b1fc2a6..332cdf5 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -352,7 +352,9 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	return 0;
 }
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..065db30 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -599,7 +599,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = VFS_I(ip)->i_count;
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0898c54..859628b 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -481,8 +481,10 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	spin_lock(&VFS_I(ip)->i_lock); \
+	ASSERT(VFS_I(ip)->i_count > 0) ; \
+	VFS_I(ip)->i_count++; \
+	spin_unlock(&VFS_I(ip)->i_lock); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76041b6..415c88e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -728,7 +728,7 @@ struct inode {
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_count;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c60e519..7fe9efb 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -768,8 +768,11 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 	}
 
 	inode = dentry->d_inode;
-	if (inode)
-		atomic_inc(&inode->i_count);
+	if (inode) {
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
+	}
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
diff --git a/kernel/futex.c b/kernel/futex.c
index 6a3a5fa..e9dfa00 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -168,7 +168,9 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		spin_lock(&key->shared.inode->i_lock);
+		key->shared.inode->i_count++;
+		spin_unlock(&key->shared.inode->i_lock);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..56229bb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,7 +1903,9 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	spin_lock(&inode->i_lock);
+	inode->i_count++;	/* New dentry reference */
+	spin_unlock(&inode->i_lock);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
diff --git a/net/socket.c b/net/socket.c
index 2270b94..5431af1 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,7 +377,9 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		atomic_inc(&path.dentry->d_inode->i_count);
+		spin_lock(&path.dentry->d_inode->i_lock);
+		path.dentry->d_inode->i_count++;
+		spin_unlock(&path.dentry->d_inode->i_lock);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 06/17] fs: icache lock lru/writeback lists
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (4 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 05/17] fs: icache lock i_count Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
  2010-10-01  6:01   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 07/17] fs: icache atomic inodes_stat Dave Chinner
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

The inode moves between different lists protected by the inode_lock. Introduce
a new lock that protects all of the lists (dirty, unused, in use, etc) that the
inode will move around as it changes state. As this is mostly a list for
protecting the writeback lists, name it wb_inode_list_lock and nest all the
list manipulations in this lock inside the current inode_lock scope.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |   48 +++++++++++++++++++++++++++++++++++++++++++-
 fs/inode.c                |   44 +++++++++++++++++++++++++++++++++-------
 include/linux/writeback.h |    1 +
 mm/backing-dev.c          |    4 +++
 4 files changed, 87 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2edaad7..fb7b723 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -161,6 +161,7 @@ static void redirty_tail(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -178,6 +179,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	list_move(&inode->i_list, &wb->b_more_io);
 }
 
@@ -218,6 +220,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (older_than_this &&
@@ -281,11 +284,13 @@ static void inode_wait_for_writeback(struct inode *inode)
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
+		spin_lock(&wb_inode_list_lock);
 	}
 }
 
@@ -339,6 +344,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
@@ -375,6 +381,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -461,11 +468,18 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
+again:
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
 						 struct inode, i_list);
 
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&wb_inode_list_lock);
+			spin_lock(&wb_inode_list_lock);
+			goto again;
+		}
+
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
 				/*
@@ -474,9 +488,12 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 				 * to it back onto the dirty list.
 				 */
 				redirty_tail(inode);
+				spin_unlock(&inode->i_lock);
 				continue;
 			}
 
+			spin_unlock(&inode->i_lock);
+
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -485,10 +502,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
-		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		/*
@@ -511,11 +527,13 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -535,6 +553,9 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
+again:
+	spin_lock(&wb_inode_list_lock);
+
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -544,7 +565,12 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				goto again;
+			}
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -553,6 +579,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		if (ret)
 			break;
 	}
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
@@ -563,9 +590,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&wb_inode_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -676,13 +705,22 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
+retry:
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_list);
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				spin_unlock(&inode_lock);
+				goto retry;
+			}
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
+			spin_unlock(&inode->i_lock);
 		}
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode_lock);
 	}
 
@@ -993,8 +1031,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_lock(&wb_inode_list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			spin_unlock(&wb_inode_list_lock);
 		}
 	}
 out:
@@ -1183,7 +1223,9 @@ int write_inode_now(struct inode *inode, int sync)
 	might_sleep();
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, &wbc);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (sync)
@@ -1209,7 +1251,9 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, wbc);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	return ret;
diff --git a/fs/inode.c b/fs/inode.c
index 2e8ab8e..e15620f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -31,6 +31,8 @@
  *   s_inodes, i_sb_list
  * inode_hash_lock protects:
  *   inode hash table, i_hash
+ * wb_inode_list_lock protects:
+ *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
  * inode->i_lock protects:
  *   i_state, i_count
  *
@@ -38,6 +40,7 @@
  * inode_lock
  *   sb_inode_list_lock
  *     inode->i_lock
+ *       wb_inode_list_lock
  *       inode_hash_lock
  */
 /*
@@ -99,6 +102,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(wb_inode_list_lock);
 DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
@@ -304,8 +308,11 @@ void __iget(struct inode *inode)
 	if (inode->i_count > 1)
 		return;
 
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+	if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+		spin_lock(&wb_inode_list_lock);
 		list_move(&inode->i_list, &inode_in_use);
+		spin_unlock(&wb_inode_list_lock);
+	}
 	inodes_stat.nr_unused--;
 }
 
@@ -408,7 +415,9 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		}
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, dispose);
+			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
@@ -486,6 +495,8 @@ static void prune_icache(int nr_to_scan)
 
 	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
+again:
+	spin_lock(&wb_inode_list_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
@@ -494,13 +505,17 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		spin_lock(&inode->i_lock);
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&wb_inode_list_lock);
+			goto again;
+		}
 		if (inode->i_state || inode->i_count) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
@@ -509,11 +524,16 @@ static void prune_icache(int nr_to_scan)
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
+again2:
+			spin_lock(&wb_inode_list_lock);
 
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
-			spin_lock(&inode->i_lock);
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				goto again2;
+			}
 			if (!can_unuse(inode)) {
 				spin_unlock(&inode->i_lock);
 				continue;
@@ -531,6 +551,7 @@ static void prune_icache(int nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
+	spin_unlock(&wb_inode_list_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -654,7 +675,9 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
+	spin_lock(&wb_inode_list_lock);
 	list_add(&inode->i_list, &inode_in_use);
+	spin_unlock(&wb_inode_list_lock);
 	if (head) {
 		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
@@ -1308,8 +1331,11 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+		if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&wb_inode_list_lock);
+		}
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
@@ -1333,7 +1359,9 @@ static void iput_final(struct inode *inode)
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
 	}
+	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
+	spin_unlock(&wb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
@@ -1366,17 +1394,17 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-retry:
+retry1:
 		spin_lock(&inode->i_lock);
 		if (inode->i_count == 1) {
 			if (!spin_trylock(&inode_lock)) {
+retry2:
 				spin_unlock(&inode->i_lock);
-				goto retry;
+				goto retry1;
 			}
 			if (!spin_trylock(&sb_inode_list_lock)) {
 				spin_unlock(&inode_lock);
-				spin_unlock(&inode->i_lock);
-				goto retry;
+				goto retry2;
 			}
 			inode->i_count--;
 			iput_final(inode);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 35d6e81..8b9c24f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,6 +11,7 @@ struct backing_dev_info;
 
 extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
+extern spinlock_t wb_inode_list_lock;
 extern spinlock_t inode_hash_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c2bf86f..b1e2987 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -73,12 +73,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
+	spin_lock(&wb_inode_list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_list)
 		nr_more_io++;
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -681,9 +683,11 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode_lock);
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 07/17] fs: icache atomic inodes_stat
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (5 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 06/17] fs: icache lock lru/writeback lists Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
  2010-09-29 12:18 ` [PATCH 08/17] fs: icache protect inode state Dave Chinner
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

The inode use statistics are currently protected by the inode_lock.
Before we can remove the inode_lock, we need to protect these
counters against races. Do this by converting them to atomic
counters so they ar enot dependent on any lock at all.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c  |    6 ++++--
 fs/inode.c         |   26 ++++++++++++++------------
 include/linux/fs.h |   13 ++++++-------
 3 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index fb7b723..f6c8975 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -764,7 +764,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(atomic_read(&inodes_stat.nr_inodes) -
+			atomic_read(&inodes_stat.nr_unused));
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1144,7 +1145,8 @@ void writeback_inodes_sb(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(atomic_read(&inodes_stat.nr_inodes) -
+			atomic_read(&inodes_stat.nr_unused));
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index e15620f..6d982e6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -122,7 +122,10 @@ static DECLARE_RWSEM(iprune_sem);
 /*
  * Statistics gathering..
  */
-struct inodes_stat_t inodes_stat;
+struct inodes_stat_t inodes_stat = {
+	.nr_inodes = ATOMIC_INIT(0),
+	.nr_unused = ATOMIC_INIT(0),
+};
 
 static struct kmem_cache *inode_cachep __read_mostly;
 
@@ -313,7 +316,7 @@ void __iget(struct inode *inode)
 		list_move(&inode->i_list, &inode_in_use);
 		spin_unlock(&wb_inode_list_lock);
 	}
-	inodes_stat.nr_unused--;
+	atomic_dec(&inodes_stat.nr_unused);
 }
 
 void end_writeback(struct inode *inode)
@@ -377,9 +380,7 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
+	atomic_sub(nr_disposed, &inodes_stat.nr_inodes);
 }
 
 /*
@@ -428,7 +429,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
-	inodes_stat.nr_unused -= count;
+	atomic_sub(count, &inodes_stat.nr_unused);
 	return busy;
 }
 
@@ -545,7 +546,7 @@ again2:
 		spin_unlock(&inode->i_lock);
 		nr_pruned++;
 	}
-	inodes_stat.nr_unused -= nr_pruned;
+	atomic_sub(nr_pruned, &inodes_stat.nr_unused);
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -578,7 +579,8 @@ static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 			return -1;
 		prune_icache(nr);
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
+	return (atomic_read(&inodes_stat.nr_unused) / 100) *
+					sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
@@ -671,7 +673,7 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
+	atomic_inc(&inodes_stat.nr_inodes);
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
@@ -1336,7 +1338,7 @@ static void iput_final(struct inode *inode)
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&wb_inode_list_lock);
 		}
-		inodes_stat.nr_unused++;
+		atomic_inc(&inodes_stat.nr_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
@@ -1354,7 +1356,7 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
+		atomic_dec(&inodes_stat.nr_unused);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
@@ -1366,9 +1368,9 @@ static void iput_final(struct inode *inode)
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
+	atomic_dec(&inodes_stat.nr_inodes);
 	evict(inode);
 	spin_lock(&inode_lock);
 	spin_lock(&inode_hash_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 415c88e..d60f256 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -39,13 +39,6 @@ struct files_stat_struct {
 	int max_files;		/* tunable */
 };
 
-struct inodes_stat_t {
-	int nr_inodes;
-	int nr_unused;
-	int dummy[5];		/* padding for sysctl ABI compatibility */
-};
-
-
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 
 #define MAY_EXEC 1
@@ -416,6 +409,12 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private, int ret,
 			bool is_async);
 
+struct inodes_stat_t {
+	atomic_t nr_inodes;
+	atomic_t nr_unused;
+	int dummy[5];		/* padding for sysctl ABI compatibility */
+};
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 08/17] fs: icache protect inode state
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (6 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 07/17] fs: icache atomic inodes_stat Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-10-01  6:02   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock Dave Chinner
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Before removing the inode_lock, we need to protect the inode list
operations with the inode->i_lock. This ensures that all inode state
changes are serialised regardless of the fact that the lists they
are moving around might be protected by different locks. Hence we
can safely protect an inode in transit from one list to another
without needing to hold all the list locks at the same time.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   33 +++++++++++++++++++++++++++++----
 1 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6d982e6..8fbc4d4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -34,7 +34,11 @@
  * wb_inode_list_lock protects:
  *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
  * inode->i_lock protects:
- *   i_state, i_count
+ *   i_state
+ *   i_count
+ *   i_hash
+ *   i_list
+ *   i_sb_list
  *
  * Ordering:
  * inode_lock
@@ -368,12 +372,14 @@ static void dispose_list(struct list_head *head)
 		evict(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
-		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -674,7 +680,6 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
 	atomic_inc(&inodes_stat.nr_inodes);
-	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
 	spin_lock(&wb_inode_list_lock);
@@ -704,7 +709,10 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, head, inode);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -736,9 +744,12 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -802,11 +813,14 @@ static struct inode *get_new_inode(struct super_block *sb,
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
+			spin_lock(&sb_inode_list_lock);
+			spin_lock(&inode->i_lock);
 			if (set(inode, data))
 				goto set_failed;
 
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -831,6 +845,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 set_failed:
 	spin_unlock(&inode->i_lock);
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
@@ -853,9 +868,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
+			spin_lock(&sb_inode_list_lock);
+			spin_lock(&inode->i_lock);
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -1270,10 +1288,13 @@ EXPORT_SYMBOL(insert_inode_locked4);
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1287,9 +1308,11 @@ EXPORT_SYMBOL(__insert_inode_hash);
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1356,10 +1379,10 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		atomic_dec(&inodes_stat.nr_unused);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
+		atomic_dec(&inodes_stat.nr_unused);
 	}
 	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
@@ -1373,9 +1396,11 @@ static void iput_final(struct inode *inode)
 	atomic_dec(&inodes_stat.nr_inodes);
 	evict(inode);
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (7 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 08/17] fs: icache protect inode state Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  4:53   ` Andrew Morton
  2010-10-01  6:08   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 10/17] fs: icache remove inode_lock Dave Chinner
                   ` (10 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Before removing the inode_lock, we need to make the last_ino  and iunique
counters independent of the inode_lock. last_ino can be trivially converted to
an atomic variable, while the iunique counter needs a new lock nested inside
the inode_lock to provide the same protection that the inode_lock previously
provided.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   28 ++++++++++++++++++++++------
 1 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8fbc4d4..c45b2db 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -736,7 +736,7 @@ struct inode *new_inode(struct super_block *sb)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
+	static atomic_t last_ino = ATOMIC_INIT(0);
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -746,7 +746,7 @@ struct inode *new_inode(struct super_block *sb)
 		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
-		inode->i_ino = ++last_ino;
+		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
@@ -897,6 +897,22 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
+static int test_inode_iunique(struct super_block * sb, struct hlist_head *head, unsigned long ino)
+{
+	struct hlist_node *node;
+	struct inode * inode = NULL;
+
+	spin_lock(&inode_hash_lock);
+	hlist_for_each_entry(inode, node, head, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb) {
+			spin_unlock(&inode_hash_lock);
+			return 0;
+		}
+	}
+	spin_unlock(&inode_hash_lock);
+	return 1;
+}
+
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
@@ -918,20 +934,20 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
+	static DEFINE_SPINLOCK(unique_lock);
 	static unsigned int counter;
-	struct inode *inode;
 	struct hlist_head *head;
 	ino_t res;
 
 	spin_lock(&inode_lock);
+	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
 		head = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, head, res);
-		spin_unlock(&inode->i_lock);
-	} while (inode != NULL);
+	} while (!test_inode_iunique(sb, head, res));
+	spin_unlock(&unique_lock);
 	spin_unlock(&inode_lock);
 
 	return res;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 10/17] fs: icache remove inode_lock
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (8 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-29 12:18 ` [PATCH 11/17] fs: Factor inode hash operations into functions Dave Chinner
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

All the functionality that the inode_lock protected has now been
wrapped up in new independent locks and/or functionality. Hence the
inode_lock does not serve a purpose any longer and hence can now be
removed.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 Documentation/filesystems/Locking |    2 +-
 Documentation/filesystems/porting |   10 ++++-
 Documentation/filesystems/vfs.txt |    2 +-
 fs/buffer.c                       |    2 +-
 fs/drop_caches.c                  |    4 --
 fs/fs-writeback.c                 |   47 ++++--------------
 fs/inode.c                        |   95 +++++++------------------------------
 fs/logfs/inode.c                  |    2 +-
 fs/notify/inode_mark.c            |   23 +++++----
 fs/notify/mark.c                  |    1 -
 fs/notify/vfsmount_mark.c         |    1 -
 fs/ntfs/inode.c                   |    4 +-
 fs/ocfs2/inode.c                  |    2 +-
 fs/quota/dquot.c                  |   16 ++----
 include/linux/fs.h                |    2 +-
 include/linux/writeback.h         |    1 -
 mm/backing-dev.c                  |    4 --
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 19 files changed, 69 insertions(+), 161 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283..e92dad2 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -114,7 +114,7 @@ alloc_inode:
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock, sb_inode_list_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..ab07213 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -299,7 +299,7 @@ be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
 been for ->delete_inode().
-	->drop_inode() returns int now; it's called on final iput() with inode_lock
+	->drop_inode() returns int now; it's called on final iput() with i_lock
 held and it returns true if filesystems wants the inode to be dropped.  As before,
 generic_drop_inode() is still the default and it's been updated appropriately.
 generic_delete_inode() is also alive and it consists simply of return 1.  Note that
@@ -318,3 +318,11 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+
+[mandatory]
+	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
+for details of what locks to replace inode_lock with in order to protect
+particular things. Most of the time, a filesystem only needs ->i_lock, which
+protects *all* the inode state and its membership on lists that was
+previously protected with inode_lock.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..405beb2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -246,7 +246,7 @@ or bottom half).
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the inode_lock spinlock held.
+	with the i_lock and sb_inode_list_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..66f7afd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 45bdf88..0884447 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f6c8975..dc983ea 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -186,7 +186,7 @@ static void requeue_io(struct inode *inode)
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -286,18 +286,16 @@ static void inode_wait_for_writeback(struct inode *inode)
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		spin_lock(&wb_inode_list_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ * Write out an inode's dirty pages. Either the caller has ref on the inode
+ * (either via __iget or via syscall against an fd) or the inode has
+ * I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
  *
@@ -305,7 +303,8 @@ static void inode_wait_for_writeback(struct inode *inode)
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
  *
- * Called under inode_lock.
+ * Called under wb_inode_list_lock and i_lock. May drop the locks but returns
+ * with them locked.
  */
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
@@ -346,7 +345,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -366,12 +364,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * due to delalloc, clear dirty metadata flags right before
 	 * write_inode()
 	 */
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -379,7 +375,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	inode->i_state &= ~I_SYNC;
@@ -529,10 +524,8 @@ again:
 		}
 		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -552,7 +545,6 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_lock);
 again:
 	spin_lock(&wb_inode_list_lock);
 
@@ -580,7 +572,6 @@ again:
 			break;
 	}
 	spin_unlock(&wb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -589,13 +580,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&wb_inode_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&wb_inode_list_lock);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -706,14 +695,12 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * we'll just busyloop.
 		 */
 retry:
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_list);
 			if (!spin_trylock(&inode->i_lock)) {
 				spin_unlock(&wb_inode_list_lock);
-				spin_unlock(&inode_lock);
 				goto retry;
 			}
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
@@ -721,7 +708,6 @@ retry:
 			spin_unlock(&inode->i_lock);
 		}
 		spin_unlock(&wb_inode_list_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	return wrote;
@@ -985,7 +971,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -1040,7 +1025,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	}
 out:
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
@@ -1074,7 +1058,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 
 	/*
@@ -1098,14 +1081,12 @@ static void wait_sb_inodes(struct super_block *sb)
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * inode_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it
-		 * under inode_lock. So we keep the reference and iput
-		 * it later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * sb_inode_list_lock.  We cannot iput the inode now as we can
+		 * be holding the last reference and we cannot iput it under
+		 * spinlock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1114,11 +1095,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -1223,13 +1202,11 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, &wbc);
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1251,13 +1228,11 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	int ret;
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, wbc);
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
diff --git a/fs/inode.c b/fs/inode.c
index c45b2db..153f8d2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -41,11 +41,10 @@
  *   i_sb_list
  *
  * Ordering:
- * inode_lock
- *   sb_inode_list_lock
- *     inode->i_lock
- *       wb_inode_list_lock
- *       inode_hash_lock
+ * sb_inode_list_lock
+ *   inode->i_lock
+ *     wb_inode_list_lock
+ *     inode_hash_lock
  */
 /*
  * This is needed for the following functions:
@@ -104,7 +103,6 @@ static struct hlist_head *inode_hashtable __read_mostly;
  * NOTE! You also have to own the lock if you change
  * the i_state of an inode while it is in use..
  */
-DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
 DEFINE_SPINLOCK(wb_inode_list_lock);
 DEFINE_SPINLOCK(inode_hash_lock);
@@ -136,7 +134,7 @@ static struct kmem_cache *inode_cachep __read_mostly;
 static void wake_up_inode(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_NEW);
@@ -305,7 +303,7 @@ static void init_once(void *foo)
 }
 
 /*
- * inode_lock must be held
+ * i_lock must be held
  */
 void __iget(struct inode *inode)
 {
@@ -371,16 +369,14 @@ static void dispose_list(struct list_head *head)
 
 		evict(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
+		spin_unlock(&sb_inode_list_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -408,7 +404,6 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb_inode_list_lock);
 
 		next = next->next;
@@ -453,12 +448,10 @@ int invalidate_inodes(struct super_block *sb)
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -482,7 +475,7 @@ static int can_unuse(struct inode *inode)
 
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * a temporary list and then are freed outside LRU lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -501,7 +494,6 @@ static void prune_icache(int nr_to_scan)
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-	spin_lock(&inode_lock);
 again:
 	spin_lock(&wb_inode_list_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
@@ -525,12 +517,10 @@ again:
 			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 again2:
 			spin_lock(&wb_inode_list_lock);
 
@@ -557,7 +547,6 @@ again2:
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&inode_lock);
 	spin_unlock(&wb_inode_list_lock);
 
 	dispose_list(&freeable);
@@ -698,9 +687,9 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
  * @inode: inode to mark in use
  *
  * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash. This needs to be done under
- * the inode_lock, so export a function to do this rather than the inode lock
- * itself. We calculate the hash list to add to here so it is all internal
+ * list, the owning superblock and the inode hash.
+ *
+ * We calculate the hash list to add to here so it is all internal
  * which requires the caller to have already set up the inode number in the
  * inode to add.
  */
@@ -708,12 +697,10 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, head, inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
@@ -739,18 +726,14 @@ struct inode *new_inode(struct super_block *sb)
 	static atomic_t last_ino = ATOMIC_INIT(0);
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_lock);
-
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
@@ -809,7 +792,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
@@ -821,7 +803,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -836,7 +817,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 */
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -846,7 +826,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 set_failed:
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -864,7 +843,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
@@ -874,7 +852,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -889,7 +866,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 */
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -939,7 +915,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	struct hlist_head *head;
 	ino_t res;
 
-	spin_lock(&inode_lock);
 	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
@@ -948,7 +923,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		head = inode_hashtable + hash(sb, res);
 	} while (!test_inode_iunique(sb, head, res));
 	spin_unlock(&unique_lock);
-	spin_unlock(&inode_lock);
 
 	return res;
 }
@@ -958,7 +932,6 @@ struct inode *igrab(struct inode *inode)
 {
 	struct inode *ret = inode;
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
 		__iget(inode);
@@ -971,7 +944,6 @@ struct inode *igrab(struct inode *inode)
 		ret = NULL;
 	}
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	return ret;
 }
@@ -994,7 +966,7 @@ EXPORT_SYMBOL(igrab);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct hlist_head *head, int (*test)(struct inode *, void *),
@@ -1002,17 +974,14 @@ static struct inode *ifind(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1036,16 +1005,13 @@ static struct inode *ifind_fast(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1068,7 +1034,7 @@ static struct inode *ifind_fast(struct super_block *sb,
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1096,7 +1062,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1147,7 +1113,7 @@ EXPORT_SYMBOL(ilookup);
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_lock held, so can't sleep.
+ * Note both @test and @set are called with the i_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
@@ -1209,7 +1175,6 @@ int insert_inode_locked(struct inode *inode)
 		struct hlist_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
@@ -1228,13 +1193,11 @@ repeat:
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1257,7 +1220,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct hlist_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
@@ -1276,13 +1238,11 @@ repeat:
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1305,13 +1265,11 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -1323,13 +1281,11 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
 
@@ -1381,16 +1337,13 @@ static void iput_final(struct inode *inode)
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
@@ -1408,16 +1361,13 @@ static void iput_final(struct inode *inode)
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	atomic_dec(&inodes_stat.nr_inodes);
 	evict(inode);
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
@@ -1437,17 +1387,12 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-retry1:
+retry:
 		spin_lock(&inode->i_lock);
 		if (inode->i_count == 1) {
-			if (!spin_trylock(&inode_lock)) {
-retry2:
-				spin_unlock(&inode->i_lock);
-				goto retry1;
-			}
 			if (!spin_trylock(&sb_inode_list_lock)) {
-				spin_unlock(&inode_lock);
-				goto retry2;
+				spin_unlock(&inode->i_lock);
+				goto retry;
 			}
 			inode->i_count--;
 			iput_final(inode);
@@ -1634,8 +1579,6 @@ EXPORT_SYMBOL(inode_wait);
  * It doesn't matter if I_NEW is not set initially, a call to
  * wake_up_inode() after removing from the hash list will DTRT.
  *
- * This is called with inode_lock held.
- *
  * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
@@ -1645,10 +1588,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index d8c71ec..a67b607 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -286,7 +286,7 @@ static int logfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-/* called with inode_lock held */
+/* called with i_lock held */
 static int logfs_drop_inode(struct inode *inode)
 {
 	struct logfs_super *super = logfs_super(inode->i_sb);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 70f7e16..e51d065 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -22,7 +22,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
+#include <linux/writeback.h>
 
 #include <asm/atomic.h>
 
@@ -232,9 +232,8 @@ out:
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with inode_lock held, protecting the unmounting super block's list
- * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * We temporarily drop inode_lock, however, and CAN block.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb_inode_list_lock to protect the super block's list of inodes.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -243,13 +242,16 @@ void fsnotify_unmount_inodes(struct list_head *list)
 	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
 
+		spin_lock(&inode->i_lock);
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
@@ -257,19 +259,20 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!inode->i_count)
+		if (!inode->i_count) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp) {
-			spin_lock(&inode->i_lock);
 			__iget(inode);
-			spin_unlock(&inode->i_lock);
 		} else
 			need_iput_tmp = NULL;
+		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
 		if (&next_i->i_sb_list != list) {
@@ -283,13 +286,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_lock here because we hold
+		 * We can safely drop sb_inode_list_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -301,7 +303,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 325185e..50c0085 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -91,7 +91,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/srcu.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c
index 56772b5..6f8eefe 100644
--- a/fs/notify/vfsmount_mark.c
+++ b/fs/notify/vfsmount_mark.c
@@ -23,7 +23,6 @@
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 93622b1..7c530f3 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -54,7 +54,7 @@
  *
  * Return 1 if the attributes match and 0 if not.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep.
  */
 int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
@@ -98,7 +98,7 @@ int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
  *
  * Return 0 on success and -errno on error.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
  */
 static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index eece3e0..65c61e2 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1195,7 +1195,7 @@ void ocfs2_evict_inode(struct inode *inode)
 	ocfs2_clear_inode(inode);
 }
 
-/* Called under inode_lock, with no more references on the
+/* Called under i_lock, with no more references on the
  * struct inode, so it's safe here to check the flags field
  * and to manipulate i_nlink without any other locks. */
 int ocfs2_drop_inode(struct inode *inode)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 15f66f1..69bc754 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -76,7 +76,7 @@
 #include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
-#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -896,7 +896,6 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -914,21 +913,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the inode_lock.
-		 * We cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under inode_lock. So we
-		 * keep the reference and iput it later. */
+		 * removed from s_inodes list while we dropped the
+		 * sb_inode_list_lock.  We cannot iput the inode now as we can
+		 * be holding the last reference and we cannot iput it under
+		 * lock. So we keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1009,7 +1005,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -1025,7 +1020,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d60f256..46a51b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1579,7 +1579,7 @@ struct super_operations {
 };
 
 /*
- * Inode state bits.  Protected by inode_lock.
+ * Inode state bits.  Protected by i_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8b9c24f..d266f0d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,7 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
 extern spinlock_t inode_hash_lock;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b1e2987..c874f7c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -72,7 +72,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_lock);
 	spin_lock(&wb_inode_list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_list)
 		nr_dirty++;
@@ -81,7 +80,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	list_for_each_entry(inode, &wb->b_more_io, i_list)
 		nr_more_io++;
 	spin_unlock(&wb_inode_list_lock);
-	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -682,13 +680,11 @@ void bdi_destroy(struct backing_dev_info *bdi)
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
 		spin_unlock(&wb_inode_list_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	bdi_unregister(bdi);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..ece6ef2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  ->inode_lock
+ *  ->i_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,8 +98,8 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(zap_pte_range->set_page_dirty)
+ *    ->i_lock			(page_remove_rmap->set_page_dirty)
+ *    ->i_lock			(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
  *  ->task->proc_lock
diff --git a/mm/rmap.c b/mm/rmap.c
index f6f0d2d..9aa1d3e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -31,11 +31,11 @@
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
- *               inode_lock (in set_page_dirty's __mark_inode_dirty)
- *                 sb_lock (within inode_lock in fs/fs-writeback.c)
+ *               i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                 sb_lock (within i_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_lock in __sync_single_inode)
+ *                           within i_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 11/17] fs: Factor inode hash operations into functions
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (9 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 10/17] fs: icache remove inode_lock Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-10-01  6:06   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 12/17] fs: Introduce per-bucket inode hash locks Dave Chinner
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Before we can replace the inode hash locking with a more scalable
mechanism, we need to remove external users of the inode_hash_lock.
Make it private by adding a function __remove_inode_hash that can be
called by filesystems instead of open-coding their own inode hash
removal operations.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c                |   39 ++++++++++++++++++++++++---------------
 include/linux/fs.h        |    1 +
 include/linux/writeback.h |    1 -
 3 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 153f8d2..24141cc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -105,7 +105,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
  */
 DEFINE_SPINLOCK(sb_inode_list_lock);
 DEFINE_SPINLOCK(wb_inode_list_lock);
-DEFINE_SPINLOCK(inode_hash_lock);
+static DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -371,9 +371,7 @@ static void dispose_list(struct list_head *head)
 
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
-		spin_lock(&inode_hash_lock);
-		hlist_del_init(&inode->i_hash);
-		spin_unlock(&inode_hash_lock);
+		__remove_inode_hash(inode);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
@@ -1274,6 +1272,20 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
+ *	__remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock. inode->i_lock must be
+ *	held.
+ */
+void __remove_inode_hash(struct inode *inode)
+{
+	spin_lock(&inode_hash_lock);
+	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
+}
+
+/**
  *	remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
@@ -1282,9 +1294,7 @@ EXPORT_SYMBOL(__insert_inode_hash);
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode->i_lock);
-	spin_lock(&inode_hash_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_hash_lock);
+	__remove_inode_hash(inode);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1348,9 +1358,7 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		spin_lock(&inode_hash_lock);
-		hlist_del_init(&inode->i_hash);
-		spin_unlock(&inode_hash_lock);
+		__remove_inode_hash(inode);
 		atomic_dec(&inodes_stat.nr_unused);
 	}
 	spin_lock(&wb_inode_list_lock);
@@ -1363,11 +1371,12 @@ static void iput_final(struct inode *inode)
 	spin_unlock(&inode->i_lock);
 	atomic_dec(&inodes_stat.nr_inodes);
 	evict(inode);
-	spin_lock(&inode->i_lock);
-	spin_lock(&inode_hash_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_hash_lock);
-	spin_unlock(&inode->i_lock);
+
+	/*
+	 * i_lock is required to delete from hash because find_inode_fast
+	 * might find us but go to sleep before we run wake_up_inode.
+	 */
+	remove_inode_hash(inode);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 46a51b9..da0ebf1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2186,6 +2186,7 @@ extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
+extern void __remove_inode_hash(struct inode *);
 extern void remove_inode_hash(struct inode *);
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d266f0d..0f6fe0c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,7 +11,6 @@ struct backing_dev_info;
 
 extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
-extern spinlock_t inode_hash_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 12/17] fs: Introduce per-bucket inode hash locks
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (10 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 11/17] fs: Factor inode hash operations into functions Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  1:52   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 13/17] fs: Implement lazy LRU updates for inodes Dave Chinner
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Now the inode_hash_lock is private, we can change the hash locking
to be more scalable. Convert the inode hash to use the new
bit-locked hash list implementation that allows per-bucket locks to
be used. This allows us to replace the global inode_hash_lock with
finer grained locking without increasing the size of the hash table.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/btrfs/inode.c        |    2 +-
 fs/fs-writeback.c       |    2 +-
 fs/hfs/hfs_fs.h         |    2 +-
 fs/hfs/inode.c          |    2 +-
 fs/hfsplus/hfsplus_fs.h |    2 +-
 fs/hfsplus/inode.c      |    2 +-
 fs/inode.c              |  198 ++++++++++++++++++++++++++---------------------
 fs/nilfs2/gcinode.c     |   22 +++---
 fs/nilfs2/segment.c     |    2 +-
 fs/nilfs2/the_nilfs.h   |    2 +-
 fs/reiserfs/xattr.c     |    2 +-
 include/linux/fs.h      |    3 +-
 mm/shmem.c              |    4 +-
 13 files changed, 135 insertions(+), 110 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ffb8aec..7675f0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3860,7 +3860,7 @@ again:
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_unhashed(&inode->i_hash))
+	if (hlist_bl_unhashed(&inode->i_hash))
 		return;
 
 	spin_lock(&root->inode_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dc983ea..9b63fc7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -990,7 +990,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
+			if (hlist_bl_unhashed(&inode->i_hash))
 				goto out;
 		}
 		if (inode->i_state & I_FREEING)
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 4f55651..24591be 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {
 
 	int fs_div;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFS_FLG_BITMAP_DIRTY	0
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 397b7ad..7778298 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -524,7 +524,7 @@ static struct dentry *hfs_file_lookup(struct inode *dir, struct dentry *dentry,
 	HFS_I(inode)->rsrc_inode = dir;
 	HFS_I(dir)->rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index dc856be..499f5a5 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -144,7 +144,7 @@ struct hfsplus_sb_info {
 
 	unsigned long flags;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFSPLUS_SB_WRITEBACKUP	0x0001
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index c5a979d..b755cf0 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -202,7 +202,7 @@ static struct dentry *hfsplus_file_lookup(struct inode *dir, struct dentry *dent
 	HFSPLUS_I(inode).rsrc_inode = dir;
 	HFSPLUS_I(dir).rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/inode.c b/fs/inode.c
index 24141cc..9d1a0fc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -24,12 +24,13 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
+#include <linux/bit_spinlock.h>
 
 /*
  * Usage:
  * sb_inode_list_lock protects:
  *   s_inodes, i_sb_list
- * inode_hash_lock protects:
+ * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
  * wb_inode_list_lock protects:
  *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
@@ -44,7 +45,7 @@
  * sb_inode_list_lock
  *   inode->i_lock
  *     wb_inode_list_lock
- *     inode_hash_lock
+ *     inode_hash_bucket lock
  */
 /*
  * This is needed for the following functions:
@@ -95,7 +96,22 @@ static unsigned int i_hash_shift __read_mostly;
 
 LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
-static struct hlist_head *inode_hashtable __read_mostly;
+
+struct inode_hash_bucket {
+	struct hlist_bl_head head;
+};
+
+static inline void spin_lock_bucket(struct inode_hash_bucket *b)
+{
+	bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
+{
+	__bit_spin_unlock(0, (unsigned long *)b);
+}
+
+static struct inode_hash_bucket *inode_hashtable __read_mostly;
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -105,7 +121,6 @@ static struct hlist_head *inode_hashtable __read_mostly;
  */
 DEFINE_SPINLOCK(sb_inode_list_lock);
 DEFINE_SPINLOCK(wb_inode_list_lock);
-static DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -278,7 +293,7 @@ void destroy_inode(struct inode *inode)
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
@@ -589,20 +604,21 @@ static void __wait_on_freeing_inode(struct inode *inode);
  * add any additional branch in the common code.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct inode_hash_bucket *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
+			cpu_relax();
 			goto repeat;
 		}
 		if (!test(inode, data)) {
@@ -610,13 +626,13 @@ repeat:
 			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
-	spin_unlock(&inode_hash_lock);
+	spin_unlock_bucket(b);
 	return node ? inode : NULL;
 }
 
@@ -625,30 +641,32 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct inode_hash_bucket *b,
+				unsigned long ino)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
+			cpu_relax();
 			goto repeat;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
-	spin_unlock(&inode_hash_lock);
+	spin_unlock_bucket(b);
 	return node ? inode : NULL;
 }
 
@@ -663,7 +681,7 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
 }
 
 static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
+__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
 	atomic_inc(&inodes_stat.nr_inodes);
@@ -672,10 +690,10 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 	spin_lock(&wb_inode_list_lock);
 	list_add(&inode->i_list, &inode_in_use);
 	spin_unlock(&wb_inode_list_lock);
-	if (head) {
-		spin_lock(&inode_hash_lock);
-		hlist_add_head(&inode->i_hash, head);
-		spin_unlock(&inode_hash_lock);
+	if (b) {
+		spin_lock_bucket(b);
+		hlist_bl_add_head(&inode->i_hash, &b->head);
+		spin_unlock_bucket(b);
 	}
 }
 
@@ -693,11 +711,11 @@ __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
  */
 void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&sb_inode_list_lock);
 	spin_lock(&inode->i_lock);
-	__inode_add_to_lists(sb, head, inode);
+	__inode_add_to_lists(sb, b, inode);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -779,7 +797,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct inode_hash_bucket *b,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -791,7 +809,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		struct inode *old;
 
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, b, test, data);
 		if (!old) {
 			spin_lock(&sb_inode_list_lock);
 			spin_lock(&inode->i_lock);
@@ -799,7 +817,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 				goto set_failed;
 
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode->i_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -833,7 +851,7 @@ set_failed:
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct inode_hash_bucket *b, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -842,13 +860,13 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		struct inode *old;
 
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			spin_lock(&sb_inode_list_lock);
 			spin_lock(&inode->i_lock);
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode->i_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -871,19 +889,20 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
-static int test_inode_iunique(struct super_block * sb, struct hlist_head *head, unsigned long ino)
+static int test_inode_iunique(struct super_block *sb,
+				struct inode_hash_bucket *b, unsigned long ino)
 {
-	struct hlist_node *node;
-	struct inode * inode = NULL;
+	struct hlist_bl_node *node;
+	struct inode *inode = NULL;
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_ino == ino && inode->i_sb == sb) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
 			return 0;
 		}
 	}
-	spin_unlock(&inode_hash_lock);
+	spin_unlock_bucket(b);
 	return 1;
 }
 
@@ -910,7 +929,7 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 */
 	static DEFINE_SPINLOCK(unique_lock);
 	static unsigned int counter;
-	struct hlist_head *head;
+	struct inode_hash_bucket *b;
 	ino_t res;
 
 	spin_lock(&unique_lock);
@@ -918,8 +937,8 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		head = inode_hashtable + hash(sb, res);
-	} while (!test_inode_iunique(sb, head, res));
+		b = inode_hashtable + hash(sb, res);
+	} while (!test_inode_iunique(sb, b, res));
 	spin_unlock(&unique_lock);
 
 	return res;
@@ -967,12 +986,13 @@ EXPORT_SYMBOL(igrab);
  * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct hlist_head *head, int (*test)(struct inode *, void *),
+		struct inode_hash_bucket *b,
+		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
-	inode = find_inode(sb, head, test, data);
+	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -999,11 +1019,12 @@ static struct inode *ifind(struct super_block *sb,
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct hlist_head *head, unsigned long ino)
+		struct inode_hash_bucket *b,
+		unsigned long ino)
 {
 	struct inode *inode;
 
-	inode = find_inode_fast(sb, head, ino);
+	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -1037,9 +1058,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 0);
+	return ifind(sb, b, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1065,9 +1086,9 @@ EXPORT_SYMBOL(ilookup5_nowait);
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 1);
+	return ifind(sb, b, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 
@@ -1087,9 +1108,9 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
-	return ifind_fast(sb, head, ino);
+	return ifind_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(ilookup);
 
@@ -1117,17 +1138,17 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	inode = ifind(sb, head, test, data, 1);
+	inode = ifind(sb, b, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode(sb, head, test, set, data);
+	return get_new_inode(sb, b, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 
@@ -1148,17 +1169,17 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	inode = ifind_fast(sb, head, ino);
+	inode = ifind_fast(sb, b, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode_fast(sb, head, ino);
+	return get_new_inode_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 
@@ -1166,16 +1187,16 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
 	inode->i_state |= I_NEW;
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 repeat:
-		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		spin_lock_bucket(b);
+		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1183,21 +1204,21 @@ repeat:
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
 			if (!spin_trylock(&old->i_lock)) {
-				spin_unlock(&inode_hash_lock);
+				spin_unlock_bucket(b);
 				goto repeat;
 			}
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_add_head(&inode->i_hash, &b->head);
+			spin_unlock_bucket(b);
 			return 0;
 		}
-		spin_unlock(&inode_hash_lock);
+		spin_unlock_bucket(b);
 		__iget(old);
 		spin_unlock(&old->i_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1210,17 +1231,17 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
 	inode->i_state |= I_NEW;
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 repeat:
-		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		spin_lock_bucket(b);
+		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1228,21 +1249,21 @@ repeat:
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
 			if (!spin_trylock(&old->i_lock)) {
-				spin_unlock(&inode_hash_lock);
+				spin_unlock_bucket(b);
 				goto repeat;
 			}
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_add_head(&inode->i_hash, &b->head);
+			spin_unlock_bucket(b);
 			return 0;
 		}
-		spin_unlock(&inode_hash_lock);
+		spin_unlock_bucket(b);
 		__iget(old);
 		spin_unlock(&old->i_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1261,12 +1282,12 @@ EXPORT_SYMBOL(insert_inode_locked4);
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, hashval);
 
 	spin_lock(&inode->i_lock);
-	spin_lock(&inode_hash_lock);
-	hlist_add_head(&inode->i_hash, head);
-	spin_unlock(&inode_hash_lock);
+	spin_lock_bucket(b);
+	hlist_bl_add_head(&inode->i_hash, &b->head);
+	spin_unlock_bucket(b);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1280,9 +1301,10 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void __remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_hash_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_hash_lock);
+	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+	spin_lock_bucket(b);
+	hlist_bl_del_init(&inode->i_hash);
+	spin_unlock_bucket(b);
 }
 
 /**
@@ -1312,7 +1334,7 @@ EXPORT_SYMBOL(generic_delete_inode);
  */
 int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || hlist_unhashed(&inode->i_hash);
+	return !inode->i_nlink || hlist_bl_unhashed(&inode->i_hash);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 
@@ -1626,7 +1648,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct inode_hash_bucket),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1635,7 +1657,7 @@ void __init inode_init_early(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
 }
 
 void __init inode_init(void)
@@ -1657,7 +1679,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct inode_hash_bucket),
 					ihash_entries,
 					14,
 					0,
@@ -1666,7 +1688,7 @@ void __init inode_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index bed3a78..ce7344e 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -196,13 +196,13 @@ int nilfs_init_gccache(struct the_nilfs *nilfs)
 	INIT_LIST_HEAD(&nilfs->ns_gc_inodes);
 
 	nilfs->ns_gc_inodes_h =
-		kmalloc(sizeof(struct hlist_head) * NILFS_GCINODE_HASH_SIZE,
+		kmalloc(sizeof(struct hlist_bl_head) * NILFS_GCINODE_HASH_SIZE,
 			GFP_NOFS);
 	if (nilfs->ns_gc_inodes_h == NULL)
 		return -ENOMEM;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++)
-		INIT_HLIST_HEAD(&nilfs->ns_gc_inodes_h[loop]);
+		INIT_HLIST_BL_HEAD(&nilfs->ns_gc_inodes_h[loop]);
 	return 0;
 }
 
@@ -254,18 +254,18 @@ static unsigned long ihash(ino_t ino, __u64 cno)
  */
 struct inode *nilfs_gc_iget(struct the_nilfs *nilfs, ino_t ino, __u64 cno)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
-	struct hlist_node *node;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino == ino && NILFS_I(inode)->i_cno == cno)
 			return inode;
 	}
 
 	inode = alloc_gcinode(nilfs, ino, cno);
 	if (likely(inode)) {
-		hlist_add_head(&inode->i_hash, head);
+		hlist_bl_add_head(&inode->i_hash, head);
 		list_add(&NILFS_I(inode)->i_dirty, &nilfs->ns_gc_inodes);
 	}
 	return inode;
@@ -284,16 +284,18 @@ void nilfs_clear_gcinode(struct inode *inode)
  */
 void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h;
-	struct hlist_node *node, *n;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
+	struct hlist_bl_node *node;
 	struct inode *inode;
 	int loop;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
-		hlist_for_each_entry_safe(inode, node, n, head, i_hash) {
-			hlist_del_init(&inode->i_hash);
+restart:
+		hlist_bl_for_each_entry(inode, node, head, i_hash) {
+			hlist_bl_del_init(&inode->i_hash);
 			list_del_init(&NILFS_I(inode)->i_dirty);
 			nilfs_clear_gcinode(inode); /* might sleep */
+			goto restart;
 		}
 	}
 }
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9fd051a..038251c 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -2452,7 +2452,7 @@ nilfs_remove_written_gcinodes(struct the_nilfs *nilfs, struct list_head *head)
 	list_for_each_entry_safe(ii, n, head, i_dirty) {
 		if (!test_bit(NILFS_I_UPDATED, &ii->i_state))
 			continue;
-		hlist_del_init(&ii->vfs_inode.i_hash);
+		hlist_bl_del_init(&ii->vfs_inode.i_hash);
 		list_del_init(&ii->i_dirty);
 		nilfs_clear_gcinode(&ii->vfs_inode);
 	}
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index f785a7b..1ab441a 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -167,7 +167,7 @@ struct the_nilfs {
 
 	/* GC inode list and hash table head */
 	struct list_head	ns_gc_inodes;
-	struct hlist_head      *ns_gc_inodes_h;
+	struct hlist_bl_head      *ns_gc_inodes_h;
 
 	/* Disk layout information (static) */
 	unsigned int		ns_blocksize_bits;
diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index 8c4cf27..ea2f55c 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -424,7 +424,7 @@ int reiserfs_prepare_write(struct file *f, struct page *page,
 static void update_ctime(struct inode *inode)
 {
 	struct timespec now = current_fs_time(inode->i_sb);
-	if (hlist_unhashed(&inode->i_hash) || !inode->i_nlink ||
+	if (hlist_bl_unhashed(&inode->i_hash) || !inode->i_nlink ||
 	    timespec_equal(&inode->i_ctime, &now))
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index da0ebf1..f06be07 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -376,6 +376,7 @@ struct files_stat_struct {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/rculist_bl.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -722,7 +723,7 @@ struct posix_acl;
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
 	struct list_head	i_list;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
diff --git a/mm/shmem.c b/mm/shmem.c
index 56229bb..b83b442 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2148,7 +2148,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 	if (*len < 3)
 		return 255;
 
-	if (hlist_unhashed(&inode->i_hash)) {
+	if (hlist_bl_unhashed(&inode->i_hash)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
 		 * so as we hash inodes here rather than at creation
 		 * time, we need a lock to ensure we only try
@@ -2156,7 +2156,7 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 		 */
 		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
-		if (hlist_unhashed(&inode->i_hash))
+		if (hlist_bl_unhashed(&inode->i_hash))
 			__insert_inode_hash(inode,
 					    inode->i_ino + inode->i_generation);
 		spin_unlock(&lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 13/17] fs: Implement lazy LRU updates for inodes.
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (11 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 12/17] fs: Introduce per-bucket inode hash locks Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  2:05   ` Christoph Hellwig
  2010-09-29 12:18 ` [PATCH 14/17] fs: Inode counters do not need to be atomic Dave Chinner
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |   20 +++--------
 fs/inode.c                |   81 +++++++++++++--------------------------------
 include/linux/fs.h        |   20 +++++++----
 include/linux/writeback.h |    1 -
 4 files changed, 42 insertions(+), 80 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9b63fc7..432a4df 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,15 +408,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (inode->i_count) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
-			/*
-			 * The inode is clean, unused
-			 */
+			/* The inode is clean */
 			list_move(&inode->i_list, &inode_unused);
 		}
 	}
@@ -1058,8 +1051,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&sb_inode_list_lock);
-
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
 	 * because there may have been pages dirtied before our sync
@@ -1067,6 +1058,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
@@ -1083,10 +1075,10 @@ static void wait_sb_inodes(struct super_block *sb)
 		spin_unlock(&sb_inode_list_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * sb_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * spinlock. So we keep the reference and iput it later.
+		 * removed from s_inodes list while we dropped the i_lock.  We
+		 * cannot iput the inode now as we can be holding the last
+		 * reference and we cannot iput it under spinlock. So we keep
+		 * the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
diff --git a/fs/inode.c b/fs/inode.c
index 9d1a0fc..50599d7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -94,7 +94,6 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 
 struct inode_hash_bucket {
@@ -296,6 +295,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
+	INIT_LIST_HEAD(&inode->i_list);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -317,25 +317,6 @@ static void init_once(void *foo)
 	inode_init_once(inode);
 }
 
-/*
- * i_lock must be held
- */
-void __iget(struct inode *inode)
-{
-	assert_spin_locked(&inode->i_lock);
-
-	inode->i_count++;
-	if (inode->i_count > 1)
-		return;
-
-	if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-		spin_lock(&wb_inode_list_lock);
-		list_move(&inode->i_list, &inode_in_use);
-		spin_unlock(&wb_inode_list_lock);
-	}
-	atomic_dec(&inodes_stat.nr_unused);
-}
-
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
@@ -380,7 +361,7 @@ static void dispose_list(struct list_head *head)
 		struct inode *inode;
 
 		inode = list_first_entry(head, struct inode, i_list);
-		list_del(&inode->i_list);
+		list_del_init(&inode->i_list);
 
 		evict(inode);
 
@@ -431,11 +412,12 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
 			spin_lock(&wb_inode_list_lock);
-			list_move(&inode->i_list, dispose);
+			list_del(&inode->i_list);
 			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
+			list_add(&inode->i_list, dispose);
 			count++;
 			continue;
 		}
@@ -473,19 +455,6 @@ int invalidate_inodes(struct super_block *sb)
 }
 EXPORT_SYMBOL(invalidate_inodes);
 
-static int can_unuse(struct inode *inode)
-{
-	if (inode->i_state)
-		return 0;
-	if (inode_has_buffers(inode))
-		return 0;
-	if (inode->i_count)
-		return 0;
-	if (inode->i_data.nrpages)
-		return 0;
-	return 1;
-}
-
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to
  * a temporary list and then are freed outside LRU lock by dispose_list().
@@ -503,13 +472,12 @@ static void prune_icache(int nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
-	int nr_scanned;
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
 again:
 	spin_lock(&wb_inode_list_lock);
-	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
+	for (; nr_to_scan; nr_to_scan--) {
 		struct inode *inode;
 
 		if (list_empty(&inode_unused))
@@ -521,33 +489,30 @@ again:
 			spin_unlock(&wb_inode_list_lock);
 			goto again;
 		}
-		if (inode->i_state || inode->i_count) {
+		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
+			list_del_init(&inode->i_list);
+			spin_unlock(&inode->i_lock);
+			atomic_dec(&inodes_stat.nr_unused);
+			continue;
+		}
+		if (inode->i_state) {
 			list_move(&inode->i_list, &inode_unused);
+			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
+
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-again2:
 			spin_lock(&wb_inode_list_lock);
-
-			if (inode != list_entry(inode_unused.next,
-						struct inode, i_list))
-				continue;	/* wrong inode or list_empty */
-			if (!spin_trylock(&inode->i_lock)) {
-				spin_unlock(&wb_inode_list_lock);
-				goto again2;
-			}
-			if (!can_unuse(inode)) {
-				spin_unlock(&inode->i_lock);
-				continue;
-			}
+			continue;
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
@@ -687,9 +652,6 @@ __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 	atomic_inc(&inodes_stat.nr_inodes);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
-	spin_lock(&wb_inode_list_lock);
-	list_add(&inode->i_list, &inode_in_use);
-	spin_unlock(&wb_inode_list_lock);
 	if (b) {
 		spin_lock_bucket(b);
 		hlist_bl_add_head(&inode->i_hash, &b->head);
@@ -1381,11 +1343,14 @@ static void iput_final(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
-		atomic_dec(&inodes_stat.nr_unused);
 	}
-	spin_lock(&wb_inode_list_lock);
-	list_del_init(&inode->i_list);
-	spin_unlock(&wb_inode_list_lock);
+	if (!list_empty(&inode->i_list)) {
+		spin_lock(&wb_inode_list_lock);
+		list_del_init(&inode->i_list);
+		spin_unlock(&wb_inode_list_lock);
+		if (!inode->i_state)
+			atomic_dec(&inodes_stat.nr_unused);
+	}
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f06be07..096a5eb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1627,16 +1627,17 @@ struct super_operations {
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
+#define I_DIRTY_SYNC		0x01
+#define I_DIRTY_DATASYNC	0x02
+#define I_DIRTY_PAGES		0x04
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_WILL_FREE		0x10
+#define I_FREEING		0x20
+#define I_CLEAR			0x40
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
+#define I_REFERENCED		0x100
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
@@ -2177,7 +2178,6 @@ extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struc
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
-extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
@@ -2392,6 +2392,12 @@ extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline void __iget(struct inode *inode)
+{
+	assert_spin_locked(&inode->i_lock);
+	inode->i_count++;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0f6fe0c..cde6993 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,7 +11,6 @@ struct backing_dev_info;
 
 extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
-extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 14/17] fs: Inode counters do not need to be atomic.
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (12 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 13/17] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-29 12:18 ` [PATCH 15/17] fs: inode per-cpu last_ino allocator Dave Chinner
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

atomics for counters do not scale on large machines, so convert them
back to normal variables protected by spin locks. We can do this
because the counters are associated with specific list operations
that are protected by locks; nr_inodes can be protected by the
sb_inode_list_lock, and nr_unused can be protected by the
wb_inode_list_lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c  |    6 ++----
 fs/inode.c         |   30 ++++++++++++------------------
 include/linux/fs.h |   12 ++++++------
 3 files changed, 20 insertions(+), 28 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 432a4df..8e390e8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -743,8 +743,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(atomic_read(&inodes_stat.nr_inodes) -
-			atomic_read(&inodes_stat.nr_unused));
+			inodes_stat.nr_inodes - inodes_stat.nr_unused;
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1116,8 +1115,7 @@ void writeback_inodes_sb(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	work.nr_pages = nr_dirty + nr_unstable +
-			(atomic_read(&inodes_stat.nr_inodes) -
-			atomic_read(&inodes_stat.nr_unused));
+			inodes_stat.nr_inodes - inodes_stat.nr_unused;
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 50599d7..d279517 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -139,8 +139,8 @@ static DECLARE_RWSEM(iprune_sem);
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat = {
-	.nr_inodes = ATOMIC_INIT(0),
-	.nr_unused = ATOMIC_INIT(0),
+	.nr_inodes = 0,
+	.nr_unused = 0,
 };
 
 static struct kmem_cache *inode_cachep __read_mostly;
@@ -376,7 +376,6 @@ static void dispose_list(struct list_head *head)
 		destroy_inode(inode);
 		nr_disposed++;
 	}
-	atomic_sub(nr_disposed, &inodes_stat.nr_inodes);
 }
 
 /*
@@ -385,7 +384,7 @@ static void dispose_list(struct list_head *head)
 static int invalidate_list(struct list_head *head, struct list_head *dispose)
 {
 	struct list_head *next;
-	int busy = 0, count = 0;
+	int busy = 0;
 
 	next = head->next;
 	for (;;) {
@@ -413,19 +412,17 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (!inode->i_count) {
 			spin_lock(&wb_inode_list_lock);
 			list_del(&inode->i_list);
+			inodes_stat.nr_unused--;
 			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
 			list_add(&inode->i_list, dispose);
-			count++;
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
-	/* only unused inodes may be cached with i_count zero */
-	atomic_sub(count, &inodes_stat.nr_unused);
 	return busy;
 }
 
@@ -471,7 +468,6 @@ EXPORT_SYMBOL(invalidate_inodes);
 static void prune_icache(int nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_pruned = 0;
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
@@ -492,7 +488,7 @@ again:
 		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
 			list_del_init(&inode->i_list);
 			spin_unlock(&inode->i_lock);
-			atomic_dec(&inodes_stat.nr_unused);
+			inodes_stat.nr_unused--;
 			continue;
 		}
 		if (inode->i_state) {
@@ -518,9 +514,8 @@ again:
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
-		nr_pruned++;
+		inodes_stat.nr_unused--;
 	}
-	atomic_sub(nr_pruned, &inodes_stat.nr_unused);
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -552,8 +547,7 @@ static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 			return -1;
 		prune_icache(nr);
 	}
-	return (atomic_read(&inodes_stat.nr_unused) / 100) *
-					sysctl_vfs_cache_pressure;
+	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
@@ -649,7 +643,7 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
-	atomic_inc(&inodes_stat.nr_inodes);
+	inodes_stat.nr_inodes++;
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
 	if (b) {
@@ -1325,9 +1319,9 @@ static void iput_final(struct inode *inode)
 		if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
 			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &inode_unused);
+			inodes_stat.nr_unused++;
 			spin_unlock(&wb_inode_list_lock);
 		}
-		atomic_inc(&inodes_stat.nr_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
@@ -1347,16 +1341,16 @@ static void iput_final(struct inode *inode)
 	if (!list_empty(&inode->i_list)) {
 		spin_lock(&wb_inode_list_lock);
 		list_del_init(&inode->i_list);
-		spin_unlock(&wb_inode_list_lock);
 		if (!inode->i_state)
-			atomic_dec(&inodes_stat.nr_unused);
+			inodes_stat.nr_unused--;
+		spin_unlock(&wb_inode_list_lock);
 	}
 	list_del_init(&inode->i_sb_list);
+	inodes_stat.nr_inodes--;
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	atomic_dec(&inodes_stat.nr_inodes);
 	evict(inode);
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 096a5eb..3a43313 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -32,6 +32,12 @@
 #define SEEK_END	2	/* seek relative to end of file */
 #define SEEK_MAX	SEEK_END
 
+struct inodes_stat_t {
+	int nr_inodes;
+	int nr_unused;
+	int dummy[5];		/* padding for sysctl ABI compatibility */
+};
+
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	int nr_files;		/* read only */
@@ -410,12 +416,6 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private, int ret,
 			bool is_async);
 
-struct inodes_stat_t {
-	atomic_t nr_inodes;
-	atomic_t nr_unused;
-	int dummy[5];		/* padding for sysctl ABI compatibility */
-};
-
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (13 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 14/17] fs: Inode counters do not need to be atomic Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  2:07   ` Christoph Hellwig
  2010-09-30  4:53   ` Andrew Morton
  2010-09-29 12:18 ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Dave Chinner
                   ` (4 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Eric Dumazet <dada1@cosmosbay.com>

last_ino was converted to an atomic variable to allow the inode_lock
to go away. However, contended atomics do not scale on large
machines, and new_inode() triggers excessive contention in such
situations.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.
This reduces contention on the shared last_ino, and give same
spreading ino numbers than before (i.e. same wraparound after 2^32
allocations).

[npiggin: some extra commenting and use of defines]

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d279517..1388450 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -653,6 +653,48 @@ __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 	}
 }
 
+#ifdef CONFIG_SMP
+#define LAST_INO_BATCH 1024
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+static DEFINE_PER_CPU(unsigned int, last_ino);
+static atomic_t shared_last_ino;
+
+static unsigned int last_ino_get(void)
+{
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
+
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0))
+		res = (unsigned int)atomic_add_return(LAST_INO_BATCH,
+				&shared_last_ino) - LAST_INO_BATCH;
+
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+#else
+static unsigned int last_ino_get(void)
+{
+	static unsigned int last_ino;
+
+	return ++last_ino;
+}
+#endif
+
 /**
  * inode_add_to_lists - add a new inode to relevant lists
  * @sb: superblock inode belongs to
@@ -690,19 +732,13 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static atomic_t last_ino = ATOMIC_INIT(0);
 	struct inode *inode;
 
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
-		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
+		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (14 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 15/17] fs: inode per-cpu last_ino allocator Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  2:12   ` Christoph Hellwig
  2010-09-30  4:53   ` Andrew Morton
  2010-09-29 12:18 ` [PATCH 17/17] fs: Clean up inode reference counting Dave Chinner
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Eric Dumazet <dada1@cosmosbay.com>

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c  |   16 +++++++++++++---
 fs/inode.c         |   35 +++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    5 ++++-
 kernel/sysctl.c    |    4 ++--
 4 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8e390e8..348cc18 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -728,6 +728,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 {
 	unsigned long expired;
 	long nr_pages;
+	int nr_dirty_inodes;
 
 	/*
 	 * When set to zero, disable periodic writeback
@@ -740,10 +741,15 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	if (time_before(jiffies, expired))
 		return 0;
 
+	/* approximate dirty inodes */
+	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
+	if (nr_dirty_inodes < 0)
+		nr_dirty_inodes = 0;
+
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			inodes_stat.nr_inodes - inodes_stat.nr_unused;
+			nr_dirty_inodes;
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1105,6 +1111,7 @@ void writeback_inodes_sb(struct super_block *sb)
 {
 	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+	int nr_dirty_inodes;
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb		= sb,
@@ -1114,8 +1121,11 @@ void writeback_inodes_sb(struct super_block *sb)
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			inodes_stat.nr_inodes - inodes_stat.nr_unused;
+	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
+	if (nr_dirty_inodes < 0)
+		nr_dirty_inodes = 0;
+
+	work.nr_pages = nr_dirty + nr_unstable + nr_dirty_inodes;
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 1388450..a91efab 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -143,8 +143,38 @@ struct inodes_stat_t inodes_stat = {
 	.nr_unused = 0,
 };
 
+static DEFINE_PER_CPU(unsigned int, nr_inodes);
+
 static struct kmem_cache *inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	int i;
+	int sum = 0;
+	for_each_possible_cpu(i)
+		sum += per_cpu(nr_inodes, i);
+	return sum < 0 ? 0 : sum;
+}
+
+int get_nr_inodes_unused(void)
+{
+	return inodes_stat.nr_unused;
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+int proc_nr_inodes(ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, buffer, lenp, ppos);
+#else
+	return -ENOSYS;
+#endif
+}
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -232,6 +262,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 
+	this_cpu_inc(nr_inodes);
+
 	return 0;
 out:
 	return -ENOMEM;
@@ -272,6 +304,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	this_cpu_dec(nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
@@ -643,7 +676,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
 	if (b) {
@@ -1382,7 +1414,6 @@ static void iput_final(struct inode *inode)
 		spin_unlock(&wb_inode_list_lock);
 	}
 	list_del_init(&inode->i_sb_list);
-	inodes_stat.nr_inodes--;
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3a43313..d2ee5d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -407,6 +407,8 @@ extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
 extern int sysctl_nr_open;
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
+extern int get_nr_inodes_unused(void);
 extern int leases_enable, lease_break_time;
 
 struct buffer_head;
@@ -2477,7 +2479,8 @@ ssize_t simple_attr_write(struct file *file, const char __user *buf,
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
-
+int proc_nr_inodes(struct ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..33d1733 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1340,14 +1340,14 @@ static struct ctl_table fs_table[] = {
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "inode-state",
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH 17/17] fs: Clean up inode reference counting
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (15 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Dave Chinner
@ 2010-09-29 12:18 ` Dave Chinner
  2010-09-30  2:15   ` Christoph Hellwig
  2010-09-30  4:53   ` Andrew Morton
  2010-09-29 23:57 ` [PATCH 0/17] fs: Inode cache scalability Christoph Hellwig
                   ` (2 subsequent siblings)
  19 siblings, 2 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-29 12:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

From: Nick Piggin <npiggin@suse.de>

Lots of filesystem code open codes the act of getting a reference to
an inode.  Factor the open coded inode lock, increment, unlock into
a function iget().  Then rename __iget to iget_ilock so that nothing
is directly incrementing the inode reference count for trivial
operations.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/affs/inode.c             |    4 +---
 fs/anon_inodes.c            |    4 +---
 fs/bfs/dir.c                |    4 +---
 fs/block_dev.c              |   14 +++-----------
 fs/btrfs/inode.c            |    4 +---
 fs/coda/dir.c               |    4 +---
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    4 +---
 fs/exofs/namei.c            |    4 +---
 fs/ext2/namei.c             |    4 +---
 fs/ext3/namei.c             |    4 +---
 fs/ext4/namei.c             |    4 +---
 fs/fs-writeback.c           |    6 +++---
 fs/gfs2/ops_inode.c         |    4 +---
 fs/hfsplus/dir.c            |    4 +---
 fs/inode.c                  |   18 +++++++++---------
 fs/jffs2/dir.c              |    8 ++------
 fs/jfs/jfs_txnmgr.c         |    4 +---
 fs/jfs/namei.c              |    4 +---
 fs/libfs.c                  |    4 +---
 fs/logfs/dir.c              |    4 +---
 fs/minix/namei.c            |    4 +---
 fs/namei.c                  |    7 ++-----
 fs/nfs/dir.c                |    4 +---
 fs/nfs/getroot.c            |    6 ++----
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    4 +---
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 +---
 fs/ocfs2/namei.c            |    4 +---
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    4 +---
 fs/sysv/namei.c             |    4 +---
 fs/ubifs/dir.c              |    4 +---
 fs/udf/namei.c              |    4 +---
 fs/ufs/namei.c              |    4 +---
 fs/xfs/linux-2.6/xfs_iops.c |    4 +---
 fs/xfs/xfs_inode.h          |    4 +---
 include/linux/fs.h          |   10 +++++++++-
 ipc/mqueue.c                |    7 ++-----
 kernel/futex.c              |    4 +---
 mm/shmem.c                  |    4 +---
 net/socket.c                |    4 +---
 43 files changed, 70 insertions(+), 144 deletions(-)

diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index cb9e773..f892be2 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -388,9 +388,7 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		iget(inode);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index c50dc2a..517286c 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -114,9 +114,7 @@ struct file *anon_inode_getfile(const char *name,
 	 * so we can avoid doing an igrab() and we can use an open-coded
 	 * atomic_inc().
 	 */
-	spin_lock(&anon_inode_inode->i_lock);
-	anon_inode_inode->i_count++;
-	spin_unlock(&anon_inode_inode->i_lock);
+	iget(anon_inode_inode);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index d42fc72..0a42a91 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -176,9 +176,7 @@ static int bfs_link(struct dentry *old, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 140451c..276a641 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -550,11 +550,7 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	struct inode *inode = bdev->bd_inode;
-
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(bdev->bd_inode);
 
 	return bdev;
 }
@@ -585,9 +581,7 @@ static struct block_device *bd_acquire(struct inode *inode)
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		spin_lock(&inode->i_lock);
-		bdev->bd_inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		bdgrab(bdev);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -603,9 +597,7 @@ static struct block_device *bd_acquire(struct inode *inode)
 			 * So, we can access it via ->i_mapping always
 			 * without igrab().
 			 */
-			spin_lock(&inode->i_lock);
-			bdev->bd_inode->i_count++;
-			spin_unlock(&inode->i_lock);
+			iget(bdev->bd_inode);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7675f0c..6664ddf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4769,9 +4769,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index 2e52ad6..f97bc3b 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -303,9 +303,7 @@ static int coda_link(struct dentry *source_de, struct inode *dir_inode,
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 0884447..af0b6da 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -24,7 +24,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
-		__iget(inode);
+		iget_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3e7f967..fd0b9f2 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1156,9 +1156,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	ios->done = create_done;
 	ios->private = inode;
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index 506778a..9bdc11f 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -153,9 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	return exofs_add_nondir(dentry, inode);
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index a5b9a54..499326b 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -206,9 +206,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 4e3b5ff..3742e5a 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2260,9 +2260,7 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4dbb5e5..9752c00 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2312,9 +2312,7 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 348cc18..9c10fdc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -294,7 +294,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 
 /*
  * Write out an inode's dirty pages. Either the caller has ref on the inode
- * (either via __iget or via syscall against an fd) or the inode has
+ * (either via iget or via syscall against an fd) or the inode has
  * I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
@@ -505,7 +505,7 @@ again:
 		}
 
 		BUG_ON(inode->i_state & I_FREEING);
-		__iget(inode);
+		iget_ilock(inode);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1075,7 +1075,7 @@ static void wait_sb_inodes(struct super_block *sb)
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
-		__iget(inode);
+		iget_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		/*
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 49c38dc..4f1719a 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -253,9 +253,7 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		iget(inode);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 55fa48d..f163fab 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -301,9 +301,7 @@ static int hfsplus_link(struct dentry *src_dentry, struct inode *dst_dir,
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
diff --git a/fs/inode.c b/fs/inode.c
index a91efab..57eb850 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -533,7 +533,7 @@ again:
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&wb_inode_list_lock);
-			__iget(inode);
+			iget_ilock(inode);
 			spin_unlock(&inode->i_lock);
 
 			if (remove_inode_buffers(inode))
@@ -591,7 +591,7 @@ static struct shrinker icache_shrinker = {
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call __iget()
+ * NOTE: we are not increasing the inode-refcount, you must call iget_ilock()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
@@ -855,7 +855,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		iget_ilock(old);
 		spin_unlock(&old->i_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -904,7 +904,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		iget_ilock(old);
 		spin_unlock(&old->i_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -975,7 +975,7 @@ struct inode *igrab(struct inode *inode)
 
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
-		__iget(inode);
+		iget_ilock(inode);
 	} else {
 		/*
 		 * Handle the case where s_op->clear_inode is not been
@@ -1018,7 +1018,7 @@ static struct inode *ifind(struct super_block *sb,
 
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
-		__iget(inode);
+		iget_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -1050,7 +1050,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
-		__iget(inode);
+		iget_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1239,7 +1239,7 @@ repeat:
 			return 0;
 		}
 		spin_unlock_bucket(b);
-		__iget(old);
+		iget_ilock(old);
 		spin_unlock(&old->i_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
@@ -1284,7 +1284,7 @@ repeat:
 			return 0;
 		}
 		spin_unlock_bucket(b);
-		__iget(old);
+		iget_ilock(old);
 		spin_unlock(&old->i_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 4d1bcfa..85f523b 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -289,9 +289,7 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		spin_lock(&old_dentry->d_inode->i_lock);
-		old_dentry->d_inode->i_count++;
-		spin_unlock(&old_dentry->d_inode->i_lock);
+		iget(old_dentry->d_inode);
 	}
 	return ret;
 }
@@ -866,9 +864,7 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		spin_lock(&old_dentry->d_inode->i_lock);
-		old_dentry->d_inode->i_count++;
-		spin_unlock(&old_dentry->d_inode->i_lock);
+		iget(old_dentry->d_inode);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index 820212f..d5764b7 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -1279,9 +1279,7 @@ int txCommit(tid_t tid,		/* transaction identifier */
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		spin_lock(&tblk->u.ip->i_lock);
-		tblk->u.ip->i_count++;
-		spin_unlock(&tblk->u.ip->i_lock);
+		iget(tblk->u.ip);
 		/*
 		 * Avoid a rare deadlock
 		 *
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 3259008..7b1c8ea 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -839,9 +839,7 @@ static int jfs_link(struct dentry *old_dentry,
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	spin_lock(&ip->i_lock);
-	ip->i_count++;
-	spin_unlock(&ip->i_lock);
+	iget(ip);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0fa4dbe..1269ddb 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -255,9 +255,7 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 90eb51f..4a09080 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -569,9 +569,7 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index a4a160f..0d5a830 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -101,9 +101,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	return add_nondir(dentry, inode);
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 817d6bb..e905661 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2290,11 +2290,8 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		if (nd.last.name[nd.last.len])
 			goto slashes;
 		inode = dentry->d_inode;
-		if (inode) {
-			spin_lock(&inode->i_lock);
-			inode->i_count++;
-			spin_unlock(&inode->i_lock);
-		}
+		if (inode)
+			iget(inode);
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 375b6b5..c1435a9 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1580,9 +1580,7 @@ nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		iget(inode);
 		d_add(dentry, inode);
 	}
 	return error;
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index c6db37e..89e4ab8 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -54,10 +54,8 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 			iput(inode);
 			return -ENOMEM;
 		}
-		/* Circumvent igrab(): we know the inode is not being freed */
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		/* We know the inode is not being freed */
+		iget(inode);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 129ebaa..346a9db 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		__iget(inode);
+		iget_ilock(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 9e287ea..99dedf3 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -219,9 +219,7 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index e51d065..ecc8c18 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -244,7 +244,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		spin_lock(&inode->i_lock);
 		/*
-		 * We cannot __iget() an inode in state I_FREEING,
+		 * We cannot iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
@@ -255,7 +255,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * doing an iget/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
@@ -269,7 +269,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp) {
-			__iget(inode);
+			iget_ilock(inode);
 		} else
 			need_iput_tmp = NULL;
 		spin_unlock(&inode->i_lock);
@@ -279,7 +279,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			spin_lock(&next_i->i_lock);
 			if (next_i->i_count &&
 			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-				__iget(next_i);
+				iget_ilock(next_i);
 				need_iput = next_i;
 			}
 			spin_unlock(&next_i->i_lock);
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 2e380ba..a0e45f3 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2930,9 +2930,7 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
 		/* We increment i_count simulating an ntfs_iget(). */
-		spin_lock(&vol->root_ino->i_lock);
-		vol->root_ino->i_count++;
-		spin_unlock(&vol->root_ino->i_lock);
+		iget(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 9c46feb..d2d972b 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -741,9 +741,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 		goto out_commit;
 	}
 
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 69bc754..697bd6e 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -910,7 +910,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 			reserved = 1;
 #endif
 
-		__iget(inode);
+		iget_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index 1efebb2..1de28a0 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1156,9 +1156,7 @@ static int reiserfs_link(struct dentry *old_dentry, struct inode *dir,
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index d63da9b..1620948 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -126,9 +126,7 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	return add_nondir(dentry, inode);
 }
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index c204b5c..7ccb249 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -550,9 +550,7 @@ static int ubifs_link(struct dentry *old_dentry, struct inode *dir,
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index d8b0dc8..acd070b 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1101,9 +1101,7 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 8cbf920..f6e9232 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -180,9 +180,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 332cdf5..8e9652f 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -352,9 +352,7 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	d_instantiate(dentry, inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 859628b..820a791 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -481,10 +481,8 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 #define IHOLD(ip) \
 do { \
-	spin_lock(&VFS_I(ip)->i_lock); \
 	ASSERT(VFS_I(ip)->i_count > 0) ; \
-	VFS_I(ip)->i_count++; \
-	spin_unlock(&VFS_I(ip)->i_lock); \
+	iget(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d2ee5d0..b07847c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2394,12 +2394,20 @@ extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
-static inline void __iget(struct inode *inode)
+static inline void iget_ilock(struct inode *inode)
 {
 	assert_spin_locked(&inode->i_lock);
+	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE));
 	inode->i_count++;
 }
 
+static inline void iget(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	iget_ilock(inode);
+	spin_unlock(&inode->i_lock);
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 7fe9efb..625dcfa 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -768,11 +768,8 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 	}
 
 	inode = dentry->d_inode;
-	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
-	}
+	if (inode)
+		iget(inode);
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
diff --git a/kernel/futex.c b/kernel/futex.c
index e9dfa00..b587a8b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -168,9 +168,7 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		spin_lock(&key->shared.inode->i_lock);
-		key->shared.inode->i_count++;
-		spin_unlock(&key->shared.inode->i_lock);
+		iget(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
diff --git a/mm/shmem.c b/mm/shmem.c
index b83b442..b24943a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,9 +1903,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;	/* New dentry reference */
-	spin_unlock(&inode->i_lock);
+	iget(inode);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
diff --git a/net/socket.c b/net/socket.c
index 5431af1..c5afc82 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,9 +377,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		spin_lock(&path.dentry->d_inode->i_lock);
-		path.dentry->d_inode->i_count++;
-		spin_unlock(&path.dentry->d_inode->i_lock);
+		iget(path.dentry->d_inode);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/17] fs: Inode cache scalability
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (16 preceding siblings ...)
  2010-09-29 12:18 ` [PATCH 17/17] fs: Clean up inode reference counting Dave Chinner
@ 2010-09-29 23:57 ` Christoph Hellwig
  2010-09-30  0:24   ` Dave Chinner
  2010-09-30  2:21 ` Christoph Hellwig
  2010-10-02 23:10 ` Carlos Carvalho
  19 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-29 23:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:32PM +1000, Dave Chinner wrote:
> I've only ported the patches so far, without changing anything
> significant other than the comit descriptions. One thing that has
> stood out as I've done this is that the ordering of the patches is
> not ideal, and some things (like the inode counters) are modified
> multiple times through the patch set.  I'm quite happy to
> reorder/rework the series to fix these problems if that is desired.

There's two obvious ordering issues: first the inode counters that you
mentioned.  I think this is esialy fixed by simply dropping both batches
messing with it - we should have the new locks protecting it in places once
inode_lock is dropped.  The other one is the clean up inode reference
counting patch, which sounds like it should be earlier in the series so
that we have the helpers in place before touching all places that
opencode an inode reference count increment.

Thanks forpicking up the work on this while Nick is travelling and some
more comments on the individual patches will follow.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/17] fs: Inode cache scalability
  2010-09-29 23:57 ` [PATCH 0/17] fs: Inode cache scalability Christoph Hellwig
@ 2010-09-30  0:24   ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-30  0:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 07:57:16PM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:32PM +1000, Dave Chinner wrote:
> > I've only ported the patches so far, without changing anything
> > significant other than the comit descriptions. One thing that has
> > stood out as I've done this is that the ordering of the patches is
> > not ideal, and some things (like the inode counters) are modified
> > multiple times through the patch set.  I'm quite happy to
> > reorder/rework the series to fix these problems if that is desired.
> 
> There's two obvious ordering issues: first the inode counters that you
> mentioned.  I think this is esialy fixed by simply dropping both batches
> messing with it - we should have the new locks protecting it in places once
> inode_lock is dropped.  The other one is the clean up inode reference
> counting patch, which sounds like it should be earlier in the series so
> that we have the helpers in place before touching all places that
> opencode an inode reference count increment.

Yeah, I thought you'd want that. ;) I'll reorder the iget helper
patch to be the first in the series which should reduce churn quite
a bit, and then convert both the nr_inode and nr_unused counters to
be per-cpu before any of the other modifications and so they can be
ignored completely in later patches.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 12/17] fs: Introduce per-bucket inode hash locks
  2010-09-29 12:18 ` [PATCH 12/17] fs: Introduce per-bucket inode hash locks Dave Chinner
@ 2010-09-30  1:52   ` Christoph Hellwig
  2010-09-30  2:43     ` Dave Chinner
  2010-10-16  7:55     ` Nick Piggin
  0 siblings, 2 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-30  1:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Instead of doing the lock overkill on a still fundamentally global data
structure what about replacing this with something better.  I know
you've already done this with the XFS icache, and while the per-AG
concept obviously can't be generic at least some of the lessons could be
applied.

then again how much testing did this get anyway given that you
benchmark ran mostly XFS which doesn't hit this at all?

If it was up to me I'd dtop this (and the bl_list addition) from the
series for now and wait for people who care about the scalability of
the generic icache code to come up with a better data structure.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 13/17] fs: Implement lazy LRU updates for inodes.
  2010-09-29 12:18 ` [PATCH 13/17] fs: Implement lazy LRU updates for inodes Dave Chinner
@ 2010-09-30  2:05   ` Christoph Hellwig
  2010-10-16  7:54     ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-30  2:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> @@ -1058,8 +1051,6 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	spin_lock(&sb_inode_list_lock);
> -
>  	/*
>  	 * Data integrity sync. Must wait for all pages under writeback,
>  	 * because there may have been pages dirtied before our sync
> @@ -1067,6 +1058,7 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 * In which case, the inode may not be on the dirty list, but
>  	 * we still have to wait for that writeout.
>  	 */
> +	spin_lock(&sb_inode_list_lock);

I think this should be folded back into the patch introducing
sb_inode_list_lock.

> @@ -1083,10 +1075,10 @@ static void wait_sb_inodes(struct super_block *sb)
>  		spin_unlock(&sb_inode_list_lock);
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * sb_inode_list_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * spinlock. So we keep the reference and iput it later.
> +		 * removed from s_inodes list while we dropped the i_lock.  We
> +		 * cannot iput the inode now as we can be holding the last
> +		 * reference and we cannot iput it under spinlock. So we keep
> +		 * the reference and iput it later.

This also looks like a hunk that got in by accident and should be merged
into an earlier patch.

> @@ -431,11 +412,12 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
>  		invalidate_inode_buffers(inode);
>  		if (!inode->i_count) {
>  			spin_lock(&wb_inode_list_lock);
> -			list_move(&inode->i_list, dispose);
> +			list_del(&inode->i_list);
>  			spin_unlock(&wb_inode_list_lock);
>  			WARN_ON(inode->i_state & I_NEW);
>  			inode->i_state |= I_FREEING;
>  			spin_unlock(&inode->i_lock);
> +			list_add(&inode->i_list, dispose);

Moving the list_add out of the lock looks fine, but I can't really
see how it's related to the rest of the patch.

> +		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
> +			list_del_init(&inode->i_list);
> +			spin_unlock(&inode->i_lock);
> +			atomic_dec(&inodes_stat.nr_unused);
> +			continue;
> +		}
> +		if (inode->i_state) {

Slightly confusing but okay given the only i_state that will get us here
is I_REFERENCED.  Do we really care about the additional cycle or two a
dumb compiler might generate when writing

	if (inode->i_state & I_REFERENCED)

?

>  		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +			list_move(&inode->i_list, &inode_unused);

Why are we now moving the inode to the front of the list? 

> @@ -687,9 +652,6 @@ __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
>  	atomic_inc(&inodes_stat.nr_inodes);
>  	list_add(&inode->i_sb_list, &sb->s_inodes);
>  	spin_unlock(&sb_inode_list_lock);
> -	spin_lock(&wb_inode_list_lock);
> -	list_add(&inode->i_list, &inode_in_use);
> -	spin_unlock(&wb_inode_list_lock);
>  	if (b) {
>  		spin_lock_bucket(b);
>  		hlist_bl_add_head(&inode->i_hash, &b->head);

At some point it might be worth to split this into

	inode_add_to_sb_list

and

	__inode_add_to_hash

but that can be left for later.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-29 12:18 ` [PATCH 15/17] fs: inode per-cpu last_ino allocator Dave Chinner
@ 2010-09-30  2:07   ` Christoph Hellwig
  2010-10-06  6:29     ` Dave Chinner
  2010-09-30  4:53   ` Andrew Morton
  1 sibling, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-30  2:07 UTC (permalink / raw)
  To: Dave Chinner, dada1; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:47PM +1000, Dave Chinner wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> 
> last_ino was converted to an atomic variable to allow the inode_lock
> to go away. However, contended atomics do not scale on large
> machines, and new_inode() triggers excessive contention in such
> situations.

And the good thing is most users of new_inode couldn't care less about
the fake i_ino assigned because they have a real inode number.  So
the first step is to move the i_ino assignment into a separate helper
and only use it in those filesystems that need it.  Second step is
to figure out why some filesystems need iunique() and some are fine
with the incrementing counter and then we should find a scalable way
to generate an inode number - preferably just one and not to, but if
that's not possible we need some documentation on why which one is
needed.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-09-29 12:18 ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Dave Chinner
@ 2010-09-30  2:12   ` Christoph Hellwig
  2010-09-30  4:53   ` Andrew Morton
  1 sibling, 0 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-30  2:12 UTC (permalink / raw)
  To: Dave Chinner, dada1; +Cc: linux-fsdevel, linux-kernel

> +/*
> + * Handle nr_dentry sysctl
> + */
> +int proc_nr_inodes(ctl_table *table, int write,
> +		   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +	inodes_stat.nr_inodes = get_nr_inodes();
> +	return proc_dointvec(table, write, buffer, lenp, ppos);
> +#else
> +	return -ENOSYS;
> +#endif
> +}

Why would we even bother to define the handler if we don't have
/proc/sys/ support?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 17/17] fs: Clean up inode reference counting
  2010-09-29 12:18 ` [PATCH 17/17] fs: Clean up inode reference counting Dave Chinner
@ 2010-09-30  2:15   ` Christoph Hellwig
  2010-10-16  7:55     ` Nick Piggin
  2010-09-30  4:53   ` Andrew Morton
  1 sibling, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-30  2:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Besides moving this much earlier in the series as mentioned before
the most important thing is giving the helpers a different name as
the iget name is already used for a different purpose, even if
got rid of the original iget and only have iget_locked.  I think
iref/iref_locked would be good enough names for the helpers.

One other things is that a lot of the current igrab users could
use this helper instead - there's very few places where we look at
inodes that don't already have a permanent reference when we try
to acquire another one.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/17] fs: Inode cache scalability
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (17 preceding siblings ...)
  2010-09-29 23:57 ` [PATCH 0/17] fs: Inode cache scalability Christoph Hellwig
@ 2010-09-30  2:21 ` Christoph Hellwig
  2010-10-02 23:10 ` Carlos Carvalho
  19 siblings, 0 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-09-30  2:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

Btw, another relatively easy patch from Nick's series is:

	http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git;a=commitdiff;h=950bbff18b19663abc2d260e352efd51fd79dbf7

although I think the second hunk could be done better by simply not
calling __inode_add_to_lists but just factoring out a hunk to add the
inode only to the i_sb_list which is all that's left after the patches
in this series.  And of course both hunks should be separate patches.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 12/17] fs: Introduce per-bucket inode hash locks
  2010-09-30  1:52   ` Christoph Hellwig
@ 2010-09-30  2:43     ` Dave Chinner
  2010-10-16  7:55     ` Nick Piggin
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-30  2:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:14PM -0400, Christoph Hellwig wrote:
> 
> Instead of doing the lock overkill on a still fundamentally global data
> structure what about replacing this with something better.  I know
> you've already done this with the XFS icache, and while the per-AG
> concept obviously can't be generic at least some of the lessons could be
> applied.

The XFS inode cache design is tied tightly to the inode layout in
XFS, so the tree-per-ag-per-mount parallelism design really does not
work in a generic manner.  Sure, we could probably make it a
hashed-tree rather than hashed-link-list design, but that's a much
more fundamental change than just splitting the locks up.

> then again how much testing did this get anyway given that you
> benchmark ran mostly XFS which doesn't hit this at all?

I've been running comparitive benchmarks on ext4 as well so that I
also test all the generic paths.

> If it was up to me I'd dtop this (and the bl_list addition) from the
> series for now and wait for people who care about the scalability of
> the generic icache code to come up with a better data structure.

I think that it's going to take a lot of work to come up with
something more generically optimal, so in the mean time I think this
is a net win for filesystems that use the generic icache.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 01/17] kernel: add bl_list
  2010-09-29 12:18 ` [PATCH 01/17] kernel: add bl_list Dave Chinner
@ 2010-09-30  4:52   ` Andrew Morton
  2010-10-16  7:55     ` Nick Piggin
  2010-10-01  5:48   ` Christoph Hellwig
  1 sibling, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:33 +1000 Dave Chinner <david@fromorbit.com> wrote:

> From: Nick Piggin <npiggin@suse.de>
> 
> Introduce a type of hlist that can support the use of the lowest bit
> in the hlist_head. This will be subsequently used to implement
> per-bucket bit spinlock for inode hashes.
> 
>
> ...
>
> +static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
> +{
> +	h->next = NULL;
> +	h->pprev = NULL;
> +}

No need to shout.

>
> ...
>
> +static inline void hlist_bl_del(struct hlist_bl_node *n)
> +{
> +	__hlist_bl_del(n);
> +	n->next = LIST_POISON1;
> +	n->pprev = LIST_POISON2;
> +}

I'd suggest creating new poison values for hlist_bl's, leave
LIST_POISON1 and LIST_POISON2 for list_head (and any other list
variants which went and used them :()

>
> ...
>


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 03/17] fs: icache lock inode hash
  2010-09-29 12:18 ` [PATCH 03/17] fs: icache lock inode hash Dave Chinner
@ 2010-09-30  4:52   ` Andrew Morton
  2010-09-30  6:13     ` Dave Chinner
  2010-10-01  6:06   ` Christoph Hellwig
  1 sibling, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:35 +1000 Dave Chinner <david@fromorbit.com> wrote:

>  DEFINE_SPINLOCK(inode_lock);
>  DEFINE_SPINLOCK(sb_inode_list_lock);
> +DEFINE_SPINLOCK(inode_hash_lock);

I assume these all go away later on.  If not, they'll probably all land
in the same cacheline!


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 05/17] fs: icache lock i_count
  2010-09-29 12:18 ` [PATCH 05/17] fs: icache lock i_count Dave Chinner
@ 2010-09-30  4:52   ` Andrew Morton
  2010-10-01  5:55     ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:37 +1000 Dave Chinner <david@fromorbit.com> wrote:

> -	if (atomic_read(&inode->i_count) != 1)
> +	if (inode->i_count != 1)

This really should have been renamed to catch unconverted code.

Such code usually wouldn't compile anyway, but it will if it takes the
address of i_count only (for example).

And maybe we should access this guy via accessor functions, dunno.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 06/17] fs: icache lock lru/writeback lists
  2010-09-29 12:18 ` [PATCH 06/17] fs: icache lock lru/writeback lists Dave Chinner
@ 2010-09-30  4:52   ` Andrew Morton
  2010-09-30  6:16     ` Dave Chinner
  2010-10-16  7:55     ` Nick Piggin
  2010-10-01  6:01   ` Christoph Hellwig
  1 sibling, 2 replies; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:38 +1000 Dave Chinner <david@fromorbit.com> wrote:

> The inode moves between different lists protected by the inode_lock. Introduce
> a new lock that protects all of the lists (dirty, unused, in use, etc) that the
> inode will move around as it changes state. As this is mostly a list for
> protecting the writeback lists, name it wb_inode_list_lock and nest all the
> list manipulations in this lock inside the current inode_lock scope.

All those spin_trylock()s are real ugly.  They're unexplained in the
changelog and unexplained in code comments.

I'd suggest that each such site have a comment explaining why we're
resorting to this.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 07/17] fs: icache atomic inodes_stat
  2010-09-29 12:18 ` [PATCH 07/17] fs: icache atomic inodes_stat Dave Chinner
@ 2010-09-30  4:52   ` Andrew Morton
  2010-09-30  6:20     ` Dave Chinner
  2010-10-16  7:56     ` Nick Piggin
  0 siblings, 2 replies; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:39 +1000 Dave Chinner <david@fromorbit.com> wrote:

> From: Nick Piggin <npiggin@suse.de>
> 
> The inode use statistics are currently protected by the inode_lock.
> Before we can remove the inode_lock, we need to protect these
> counters against races. Do this by converting them to atomic
> counters so they ar enot dependent on any lock at all.

typo

>
> ...
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -764,7 +764,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
>  	wb->last_old_flush = jiffies;
>  	nr_pages = global_page_state(NR_FILE_DIRTY) +
>  			global_page_state(NR_UNSTABLE_NFS) +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +			(atomic_read(&inodes_stat.nr_inodes) -
> +			atomic_read(&inodes_stat.nr_unused));

race bug.

>  	if (nr_pages) {
>  		struct wb_writeback_work work = {
> @@ -1144,7 +1145,8 @@ void writeback_inodes_sb(struct super_block *sb)
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
>  	work.nr_pages = nr_dirty + nr_unstable +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> +			(atomic_read(&inodes_stat.nr_inodes) -
> +			atomic_read(&inodes_stat.nr_unused));

and another.

OK, they aren't serious ones.  But known regressions shouldn't be snuck
into the kernel unchangelogged and uncommented :(

>  	bdi_queue_work(sb->s_bdi, &work);
>  	wait_for_completion(&done);
>
> ...
>
> -struct inodes_stat_t {
> -	int nr_inodes;
> -	int nr_unused;
> -	int dummy[5];		/* padding for sysctl ABI compatibility */
> -};
> -
> -
>  #define NR_FILE  8192	/* this can well be larger on a larger system */
>  
>  #define MAY_EXEC 1
> @@ -416,6 +409,12 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private, int ret,
>  			bool is_async);
>  
> +struct inodes_stat_t {
> +	atomic_t nr_inodes;
> +	atomic_t nr_unused;
> +	int dummy[5];		/* padding for sysctl ABI compatibility */
> +};

OK, that's a hack.  The first two "ints" are copied out to userspace. 
This change assumes that sizeof(atomic_t)=4 and that an atomic_t has
the same layout, alignment and padding as an int.

Probably that's true in current kernels and with current architectures
but it's a hack and it's presumptive.

It shouldn't be snuck into the tree unchangelogged and uncommented.

(time passes)

OK, I see that all of this gets reverted later on.  Please update the
changelog so the next reviewer doesn't get fooled.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock
  2010-09-29 12:18 ` [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock Dave Chinner
@ 2010-09-30  4:53   ` Andrew Morton
  2010-10-01  6:08   ` Christoph Hellwig
  1 sibling, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:41 +1000 Dave Chinner <david@fromorbit.com> wrote:

> From: Nick Piggin <npiggin@suse.de>
> 
> Before removing the inode_lock, we need to make the last_ino  and iunique
> counters independent of the inode_lock. last_ino can be trivially converted to
> an atomic variable, while the iunique counter needs a new lock nested inside
> the inode_lock to provide the same protection that the inode_lock previously
> provided.
> 
>
> ...
>
> +static int test_inode_iunique(struct super_block * sb, struct hlist_head *head, unsigned long ino)
> +{
> +	struct hlist_node *node;
> +	struct inode * inode = NULL;
> +
> +	spin_lock(&inode_hash_lock);
> +	hlist_for_each_entry(inode, node, head, i_hash) {
> +		if (inode->i_ino == ino && inode->i_sb == sb) {
> +			spin_unlock(&inode_hash_lock);
> +			return 0;
> +		}
> +	}
> +	spin_unlock(&inode_hash_lock);
> +	return 1;
> +}

Please run all the patches through checkpatch.

Please document this function.  Why does it exist?  What does it do? 
How does it do it?  Try to improve the code!

>
> ...
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-29 12:18 ` [PATCH 15/17] fs: inode per-cpu last_ino allocator Dave Chinner
  2010-09-30  2:07   ` Christoph Hellwig
@ 2010-09-30  4:53   ` Andrew Morton
  2010-09-30  5:36     ` Eric Dumazet
  1 sibling, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:47 +1000 Dave Chinner <david@fromorbit.com> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> 
> last_ino was converted to an atomic variable to allow the inode_lock
> to go away. However, contended atomics do not scale on large
> machines, and new_inode() triggers excessive contention in such
> situations.
> 
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
> This reduces contention on the shared last_ino, and give same
> spreading ino numbers than before (i.e. same wraparound after 2^32
> allocations).
> 
> [npiggin: some extra commenting and use of defines]
> 
> ...
>  
> +#ifdef CONFIG_SMP
> +#define LAST_INO_BATCH 1024
> +/*
> + * Each cpu owns a range of LAST_INO_BATCH numbers.
> + * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
> + * to renew the exhausted range.
> + *
> + * This does not significantly increase overflow rate because every CPU can
> + * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
> + * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> + * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> + * overflow rate by 2x, which does not seem too significant.
> + *
> + * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> + * error if st_ino won't fit in target struct field. Use 32bit counter
> + * here to attempt to avoid that.
> + */
> +static DEFINE_PER_CPU(unsigned int, last_ino);
> +static atomic_t shared_last_ino;
> +
> +static unsigned int last_ino_get(void)
> +{
> +	unsigned int *p = &get_cpu_var(last_ino);
> +	unsigned int res = *p;
> +
> +	if (unlikely((res & (LAST_INO_BATCH-1)) == 0))
> +		res = (unsigned int)atomic_add_return(LAST_INO_BATCH,
> +				&shared_last_ino) - LAST_INO_BATCH;

May as well remove the "- LAST_INO_BATCH" there, I think.  It'll skew
the results a tad at startup, but why does that matter?

> +	*p = ++res;
> +	put_cpu_var(last_ino);
> +	return res;
> +}
> +#else
> +static unsigned int last_ino_get(void)
> +{
> +	static unsigned int last_ino;
> +
> +	return ++last_ino;
> +}

This is racy with CONFIG_PREEMPT on some architectures, I suspect.  I'd
suggest conversion to atomic_t with, of course, an explanatory comment ;)


> +#endif

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-09-29 12:18 ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Dave Chinner
  2010-09-30  2:12   ` Christoph Hellwig
@ 2010-09-30  4:53   ` Andrew Morton
  2010-09-30  6:10     ` Dave Chinner
  2010-10-02 16:02     ` Christoph Hellwig
  1 sibling, 2 replies; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:48 +1000 Dave Chinner <david@fromorbit.com> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> 
> The number of inodes allocated does not need to be tied to the
> addition or removal of an inode to/from a list. If we are not tied
> to a list lock, we could update the counters when inodes are
> initialised or destroyed, but to do that we need to convert the
> counters to be per-cpu (i.e. independent of a lock). This means that
> we have the freedom to change the list/locking implementation
> without needing to care about the counters.
> 
>
> ...
>
> +int get_nr_inodes(void)
> +{
> +	int i;
> +	int sum = 0;
> +	for_each_possible_cpu(i)
> +		sum += per_cpu(nr_inodes, i);
> +	return sum < 0 ? 0 : sum;
> +}

This reimplements percpu_counter_sum_positive(), rather poorly

If one never intends to use the approximate percpu_counter_read() then
one could initialise the counter with a really large batch value, for a
very small performance gain.

> +int get_nr_inodes_unused(void)
> +{
> +	return inodes_stat.nr_unused;
> +}
>
> ...
>
> @@ -407,6 +407,8 @@ extern struct files_stat_struct files_stat;
>  extern int get_max_files(void);
>  extern int sysctl_nr_open;
>  extern struct inodes_stat_t inodes_stat;
> +extern int get_nr_inodes(void);
> +extern int get_nr_inodes_unused(void);

These are pretty cruddy names.  Unfotunately we don't really have a vfs
or "inode" subsystem name to prefix them with.

>
> ...
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 17/17] fs: Clean up inode reference counting
  2010-09-29 12:18 ` [PATCH 17/17] fs: Clean up inode reference counting Dave Chinner
  2010-09-30  2:15   ` Christoph Hellwig
@ 2010-09-30  4:53   ` Andrew Morton
  1 sibling, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  4:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, 29 Sep 2010 22:18:49 +1000 Dave Chinner <david@fromorbit.com> wrote:

> -static inline void __iget(struct inode *inode)
> +static inline void iget_ilock(struct inode *inode)
>  {
>  	assert_spin_locked(&inode->i_lock);
> +	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE));
>  	inode->i_count++;
>  }
>  
> +static inline void iget(struct inode *inode)
> +{
> +	spin_lock(&inode->i_lock);
> +	iget_ilock(inode);
> +	spin_unlock(&inode->i_lock);
> +}

I suspect that for typical configs, these functions will generate amazing
amounts of code.

Please measure this.  We may decide to uninline both.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-30  4:53   ` Andrew Morton
@ 2010-09-30  5:36     ` Eric Dumazet
  2010-09-30  7:53       ` Eric Dumazet
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2010-09-30  5:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le mercredi 29 septembre 2010 à 21:53 -0700, Andrew Morton a écrit :
> On Wed, 29 Sep 2010 22:18:47 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Eric Dumazet <dada1@cosmosbay.com>
> > 

Please note my new email address, thanks.

> > last_ino was converted to an atomic variable to allow the inode_lock
> > to go away. However, contended atomics do not scale on large
> > machines, and new_inode() triggers excessive contention in such
> > situations.
> > 
> > Solve this problem by providing to each cpu a per_cpu variable,
> > feeded by the shared last_ino, but once every 1024 allocations.
> > This reduces contention on the shared last_ino, and give same
> > spreading ino numbers than before (i.e. same wraparound after 2^32
> > allocations).
> > 
> > [npiggin: some extra commenting and use of defines]
> > 
> > ...
> >  
> > +#ifdef CONFIG_SMP
> > +#define LAST_INO_BATCH 1024
> > +/*
> > + * Each cpu owns a range of LAST_INO_BATCH numbers.
> > + * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
> > + * to renew the exhausted range.
> > + *
> > + * This does not significantly increase overflow rate because every CPU can
> > + * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
> > + * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> > + * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> > + * overflow rate by 2x, which does not seem too significant.
> > + *
> > + * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> > + * error if st_ino won't fit in target struct field. Use 32bit counter
> > + * here to attempt to avoid that.
> > + */
> > +static DEFINE_PER_CPU(unsigned int, last_ino);
> > +static atomic_t shared_last_ino;
> > +
> > +static unsigned int last_ino_get(void)
> > +{
> > +	unsigned int *p = &get_cpu_var(last_ino);
> > +	unsigned int res = *p;
> > +
> > +	if (unlikely((res & (LAST_INO_BATCH-1)) == 0))
> > +		res = (unsigned int)atomic_add_return(LAST_INO_BATCH,
> > +				&shared_last_ino) - LAST_INO_BATCH;
> 
> May as well remove the "- LAST_INO_BATCH" there, I think.  It'll skew
> the results a tad at startup, but why does that matter?


Because on x86, atomic_add_return(val, ptr) uses xadd() + val

So, using "atomic_add_return(val, ptr) - val" removes one instruction ;)

> 
> > +	*p = ++res;
> > +	put_cpu_var(last_ino);
> > +	return res;
> > +}
> > +#else
> > +static unsigned int last_ino_get(void)
> > +{
> > +	static unsigned int last_ino;
> > +
> > +	return ++last_ino;
> > +}
> 
> This is racy with CONFIG_PREEMPT on some architectures, I suspect.  I'd
> suggest conversion to atomic_t with, of course, an explanatory comment ;)
> 

Thanks, I'll rework the patch !

I am pretty happy to see some interest on this patch serie,
eventually :)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-09-30  4:53   ` Andrew Morton
@ 2010-09-30  6:10     ` Dave Chinner
  2010-10-16  7:55       ` Nick Piggin
  2010-10-02 16:02     ` Christoph Hellwig
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-30  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:53:22PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:48 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Eric Dumazet <dada1@cosmosbay.com>
> > 
> > The number of inodes allocated does not need to be tied to the
> > addition or removal of an inode to/from a list. If we are not tied
> > to a list lock, we could update the counters when inodes are
> > initialised or destroyed, but to do that we need to convert the
> > counters to be per-cpu (i.e. independent of a lock). This means that
> > we have the freedom to change the list/locking implementation
> > without needing to care about the counters.
> > 
> >
> > ...
> >
> > +int get_nr_inodes(void)
> > +{
> > +	int i;
> > +	int sum = 0;
> > +	for_each_possible_cpu(i)
> > +		sum += per_cpu(nr_inodes, i);
> > +	return sum < 0 ? 0 : sum;
> > +}
> 
> This reimplements percpu_counter_sum_positive(), rather poorly

I thought so - it was on my list of things to check when redoing
this patch. I'll fix that up appropritately.

> 
> If one never intends to use the approximate percpu_counter_read() then
> one could initialise the counter with a really large batch value, for a
> very small performance gain.
> 
> > +int get_nr_inodes_unused(void)
> > +{
> > +	return inodes_stat.nr_unused;
> > +}
> >
> > ...
> >
> > @@ -407,6 +407,8 @@ extern struct files_stat_struct files_stat;
> >  extern int get_max_files(void);
> >  extern int sysctl_nr_open;
> >  extern struct inodes_stat_t inodes_stat;
> > +extern int get_nr_inodes(void);
> > +extern int get_nr_inodes_unused(void);
> 
> These are pretty cruddy names.  Unfotunately we don't really have a vfs
> or "inode" subsystem name to prefix them with.

Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 03/17] fs: icache lock inode hash
  2010-09-30  4:52   ` Andrew Morton
@ 2010-09-30  6:13     ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-30  6:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:20PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:35 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> >  DEFINE_SPINLOCK(inode_lock);
> >  DEFINE_SPINLOCK(sb_inode_list_lock);
> > +DEFINE_SPINLOCK(inode_hash_lock);
> 
> I assume these all go away later on.  If not, they'll probably all land
> in the same cacheline!

indeed, the hash lock goes away in this series, as does the
inode_lock. I'll check the other new locks don't land in the same
cacheline.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 06/17] fs: icache lock lru/writeback lists
  2010-09-30  4:52   ` Andrew Morton
@ 2010-09-30  6:16     ` Dave Chinner
  2010-10-16  7:55     ` Nick Piggin
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-09-30  6:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:40PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:38 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > The inode moves between different lists protected by the inode_lock. Introduce
> > a new lock that protects all of the lists (dirty, unused, in use, etc) that the
> > inode will move around as it changes state. As this is mostly a list for
> > protecting the writeback lists, name it wb_inode_list_lock and nest all the
> > list manipulations in this lock inside the current inode_lock scope.
> 
> All those spin_trylock()s are real ugly.  They're unexplained in the
> changelog and unexplained in code comments.

Yes, they are, but I don't know exactly why it is so trylock happy.
I'll try to dig out the reason for it and:

> I'd suggest that each such site have a comment explaining why we're
> resorting to this.

At least get this far.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 07/17] fs: icache atomic inodes_stat
  2010-09-30  4:52   ` Andrew Morton
@ 2010-09-30  6:20     ` Dave Chinner
  2010-09-30  6:37       ` Andrew Morton
  2010-10-16  7:56     ` Nick Piggin
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-09-30  6:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:53PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:39 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > The inode use statistics are currently protected by the inode_lock.
> > Before we can remove the inode_lock, we need to protect these
> > counters against races. Do this by converting them to atomic
> > counters so they ar enot dependent on any lock at all.
> 
> typo
> 
> >
> > ...
> >
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -764,7 +764,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
> >  	wb->last_old_flush = jiffies;
> >  	nr_pages = global_page_state(NR_FILE_DIRTY) +
> >  			global_page_state(NR_UNSTABLE_NFS) +
> > -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> > +			(atomic_read(&inodes_stat.nr_inodes) -
> > +			atomic_read(&inodes_stat.nr_unused));
> 
> race bug.

What new race?  The code has gone from using subtraction of unlocked
counters to using subtraction of unlocked atomic counters. I can't
see any new race condition in that transformation....

> > +struct inodes_stat_t {
> > +	atomic_t nr_inodes;
> > +	atomic_t nr_unused;
> > +	int dummy[5];		/* padding for sysctl ABI compatibility */
> > +};
> 
> OK, that's a hack.  The first two "ints" are copied out to userspace. 
> This change assumes that sizeof(atomic_t)=4 and that an atomic_t has
> the same layout, alignment and padding as an int.
> 
> Probably that's true in current kernels and with current architectures
> but it's a hack and it's presumptive.
> 
> It shouldn't be snuck into the tree unchangelogged and uncommented.
> 
> (time passes)
> 
> OK, I see that all of this gets reverted later on.  Please update the
> changelog so the next reviewer doesn't get fooled.

That's the multiple changes to the counter infrastructure I
described in the preliminary series description. clearly it needs to
be cleaned up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 07/17] fs: icache atomic inodes_stat
  2010-09-30  6:20     ` Dave Chinner
@ 2010-09-30  6:37       ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  6:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Thu, 30 Sep 2010 16:20:57 +1000 Dave Chinner <david@fromorbit.com> wrote:

> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -764,7 +764,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
> > >  	wb->last_old_flush = jiffies;
> > >  	nr_pages = global_page_state(NR_FILE_DIRTY) +
> > >  			global_page_state(NR_UNSTABLE_NFS) +
> > > -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> > > +			(atomic_read(&inodes_stat.nr_inodes) -
> > > +			atomic_read(&inodes_stat.nr_unused));
> > 
> > race bug.
> 
> What new race?  The code has gone from using subtraction of unlocked
> counters to using subtraction of unlocked atomic counters. I can't
> see any new race condition in that transformation....

Oh.  I assumed that all the above used to happen under inode_lock.

Coz that's wot the changelog told me :)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-30  5:36     ` Eric Dumazet
@ 2010-09-30  7:53       ` Eric Dumazet
  2010-09-30  8:14         ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2010-09-30  7:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le jeudi 30 septembre 2010 à 07:36 +0200, Eric Dumazet a écrit :
> Le mercredi 29 septembre 2010 à 21:53 -0700, Andrew Morton a écrit :

> > > +static unsigned int last_ino_get(void)
> > > +{
> > > +	static unsigned int last_ino;
> > > +
> > > +	return ++last_ino;
> > > +}
> > 
> > This is racy with CONFIG_PREEMPT on some architectures, I suspect.  I'd
> > suggest conversion to atomic_t with, of course, an explanatory comment ;)
> > 
> 

In fact this code was OK when I submitted my original patch back in
2008, since it replaced fs/inode.c

	inode->i_ino = ++last_ino;

And this was protected by a surrounding spinlock
(spin_lock(&inode_lock); at that time)

Even after Nick patches, preemption is still disabled (by two
spinlocks... spin_lock(&sb_inode_list_lock); /
spin_lock(&inode->i_lock);)

So patch 15/17 seems good to me, I re-sign it as-is ;)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

If it happens preemption is re-enabled later (with future patches), we
might need to change last_ino_get() too.

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-30  7:53       ` Eric Dumazet
@ 2010-09-30  8:14         ` Andrew Morton
  2010-09-30 10:22           ` [PATCH] " Eric Dumazet
  0 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30  8:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Thu, 30 Sep 2010 09:53:09 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 30 septembre 2010 __ 07:36 +0200, Eric Dumazet a __crit :
> > Le mercredi 29 septembre 2010 __ 21:53 -0700, Andrew Morton a __crit :
> 
> > > > +static unsigned int last_ino_get(void)
> > > > +{
> > > > +	static unsigned int last_ino;
> > > > +
> > > > +	return ++last_ino;
> > > > +}
> > > 
> > > This is racy with CONFIG_PREEMPT on some architectures, I suspect.  I'd
> > > suggest conversion to atomic_t with, of course, an explanatory comment ;)
> > > 
> > 
> 
> In fact this code was OK when I submitted my original patch back in
> 2008, since it replaced fs/inode.c
> 
> 	inode->i_ino = ++last_ino;
> 
> And this was protected by a surrounding spinlock
> (spin_lock(&inode_lock); at that time)
> 
> Even after Nick patches, preemption is still disabled (by two
> spinlocks... spin_lock(&sb_inode_list_lock); /
> spin_lock(&inode->i_lock);)

You know, if it took you and me this long to work that out then perhaps
the code isn't quite as clear as we would like it to be, no?

I think you know what's coming next ;)

As a general rule, if a reviewer's comment doesn't result in a code
change then it should result in a changelog fix or a code comment. 
Because if the code wasn't clear enough to the reviewer then it won't be
clear enough to later readers.

> So patch 15/17 seems good to me, I re-sign it as-is ;)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> If it happens preemption is re-enabled later (with future patches), we
> might need to change last_ino_get() too.

Perhaps

	WARN_ON_ONCE(preemptible());

if we had a developer-only version of WARN_ON_ONCE, which we don't.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30  8:14         ` Andrew Morton
@ 2010-09-30 10:22           ` Eric Dumazet
  2010-09-30 16:45             ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2010-09-30 10:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le jeudi 30 septembre 2010 à 01:14 -0700, Andrew Morton a écrit :

> Perhaps
> 
> 	WARN_ON_ONCE(preemptible());
> 
> if we had a developer-only version of WARN_ON_ONCE, which we don't.

Or just use a regular PER_CPU variable, even on !SMP, and get preempt
safe implementation.

What do you think of following patch, on top of current linux-2.6 tree ?

Thanks

[PATCH] fs: inode per-cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.
This reduces contention on the shared last_ino, and give same
spreading ino numbers than before (i.e. same wraparound after 2^32
allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   45 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8646433..122914e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -624,6 +624,43 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
+#define LAST_INO_BATCH 1024
+
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static noinline unsigned int last_ino_get(void)
+{
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+	*p = ++res;
+	put_cpu_var(last_ino);
+	return res;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -638,12 +675,6 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static unsigned int last_ino;
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -652,7 +683,7 @@ struct inode *new_inode(struct super_block *sb)
 	if (inode) {
 		spin_lock(&inode_lock);
 		__inode_add_to_lists(sb, NULL, inode);
-		inode->i_ino = ++last_ino;
+		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30 10:22           ` [PATCH] " Eric Dumazet
@ 2010-09-30 16:45             ` Andrew Morton
  2010-09-30 17:28               ` Eric Dumazet
  0 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30 16:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Thu, 30 Sep 2010 12:22:16 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 30 septembre 2010 __ 01:14 -0700, Andrew Morton a __crit :
> 
> > Perhaps
> > 
> > 	WARN_ON_ONCE(preemptible());
> > 
> > if we had a developer-only version of WARN_ON_ONCE, which we don't.
> 
> Or just use a regular PER_CPU variable, even on !SMP, and get preempt
> safe implementation.

Good stuff.

> What do you think of following patch, on top of current linux-2.6 tree ?
>
> ...
>
> +static noinline unsigned int last_ino_get(void)
> +{
> +	unsigned int *p = &get_cpu_var(last_ino);
> +	unsigned int res = *p;
> +
> +#ifdef CONFIG_SMP
> +	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
> +		static atomic_t shared_last_ino;
> +		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
> +
> +		res = next - LAST_INO_BATCH;
> +	}
> +#endif
> +	*p = ++res;
> +	put_cpu_var(last_ino);
> +	return res;
> +}

Could eliminate `p' I guess, but that would involve using
__get_cpu_var() as an lval, which looks vile and might generate worse
code.

Readers of this code won't know why last_ino_get() was marked noinline.
It looks wrong, really.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30 16:45             ` Andrew Morton
@ 2010-09-30 17:28               ` Eric Dumazet
  2010-09-30 17:39                 ` Andrew Morton
                                   ` (2 more replies)
  0 siblings, 3 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-09-30 17:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le jeudi 30 septembre 2010 à 09:45 -0700, Andrew Morton a écrit :

> Could eliminate `p' I guess, but that would involve using
> __get_cpu_var() as an lval, which looks vile and might generate worse
> code.
> 

Hmm, I see, please check this new patch, using the most modern stuff ;)

> Readers of this code won't know why last_ino_get() was marked noinline.
> It looks wrong, really.

Oops sorry, this was a temporary hack of mine to ease disassembly
analysis. Good catch !

Here is the new generated code on i686 (with the noinline) : 
pretty good ;)

c02e5930 <last_ino_get>:
c02e5930:	55                   	push   %ebp
c02e5931:	89 e5                	mov    %esp,%ebp
c02e5933:	64 a1 44 29 7d c0    	mov    %fs:0xc07d2944,%eax
c02e5939:	a9 ff 03 00 00       	test   $0x3ff,%eax
c02e593e:	74 09                	je     c02e5949 <last_ino_get+0x19>
c02e5940:	40                   	inc    %eax
c02e5941:	64 a3 44 29 7d c0    	mov    %eax,%fs:0xc07d2944
c02e5947:	c9                   	leave  
c02e5948:	c3                   	ret    
c02e5949:	b8 00 04 00 00       	mov    $0x400,%eax
c02e594e:	f0 0f c1 05 80 c8 92 c0	lock xadd %eax,0xc092c880
c02e5956:	eb e8                	jmp    c02e5940 <last_ino_get+0x10>


Thanks

[PATCH] fs: inode per-cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.
This reduces contention on the shared last_ino, and give same
spreading ino numbers than before (i.e. same wraparound after 2^32
allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   47 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8646433..5c233f0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -624,6 +624,45 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
+#define LAST_INO_BATCH 1024
+
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int last_ino_get(void)
+{
+	unsigned int res;
+
+	get_cpu();
+	res = __this_cpu_read(last_ino);
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+	res++;
+	__this_cpu_write(last_ino, res);
+	put_cpu();
+	return res;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -638,12 +677,6 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static unsigned int last_ino;
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -652,7 +685,7 @@ struct inode *new_inode(struct super_block *sb)
 	if (inode) {
 		spin_lock(&inode_lock);
 		__inode_add_to_lists(sb, NULL, inode);
-		inode->i_ino = ++last_ino;
+		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		spin_unlock(&inode_lock);
 	}


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30 17:28               ` Eric Dumazet
@ 2010-09-30 17:39                 ` Andrew Morton
  2010-09-30 18:05                   ` Eric Dumazet
  2010-10-01  6:12                 ` Christoph Hellwig
  2010-10-16  6:36                 ` Nick Piggin
  2 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-09-30 17:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Thu, 30 Sep 2010 19:28:05 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 30 septembre 2010 __ 09:45 -0700, Andrew Morton a __crit :
> 
> > Could eliminate `p' I guess, but that would involve using
> > __get_cpu_var() as an lval, which looks vile and might generate worse
> > code.
> > 
> 
> Hmm, I see, please check this new patch, using the most modern stuff ;)
> 
> > Readers of this code won't know why last_ino_get() was marked noinline.
> > It looks wrong, really.
> 
> Oops sorry, this was a temporary hack of mine to ease disassembly
> analysis. Good catch !
> 
> Here is the new generated code on i686 (with the noinline) : 
> pretty good ;)
> 
> c02e5930 <last_ino_get>:
> c02e5930:	55                   	push   %ebp
> c02e5931:	89 e5                	mov    %esp,%ebp
> c02e5933:	64 a1 44 29 7d c0    	mov    %fs:0xc07d2944,%eax
> c02e5939:	a9 ff 03 00 00       	test   $0x3ff,%eax
> c02e593e:	74 09                	je     c02e5949 <last_ino_get+0x19>
> c02e5940:	40                   	inc    %eax
> c02e5941:	64 a3 44 29 7d c0    	mov    %eax,%fs:0xc07d2944
> c02e5947:	c9                   	leave  
> c02e5948:	c3                   	ret    
> c02e5949:	b8 00 04 00 00       	mov    $0x400,%eax
> c02e594e:	f0 0f c1 05 80 c8 92 c0	lock xadd %eax,0xc092c880
> c02e5956:	eb e8                	jmp    c02e5940 <last_ino_get+0x10>
> 

That uniprocessor, PREEMPT=n I guess.

> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -624,6 +624,45 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
>  }
>  EXPORT_SYMBOL_GPL(inode_add_to_lists);
>  
> +#define LAST_INO_BATCH 1024
> +
> +/*
> + * Each cpu owns a range of LAST_INO_BATCH numbers.
> + * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
> + * to renew the exhausted range.
> + *
> + * This does not significantly increase overflow rate because every CPU can
> + * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
> + * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> + * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> + * overflow rate by 2x, which does not seem too significant.
> + *
> + * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> + * error if st_ino won't fit in target struct field. Use 32bit counter
> + * here to attempt to avoid that.
> + */
> +static DEFINE_PER_CPU(unsigned int, last_ino);
> +
> +static unsigned int last_ino_get(void)
> +{
> +	unsigned int res;
> +
> +	get_cpu();
> +	res = __this_cpu_read(last_ino);
> +#ifdef CONFIG_SMP
> +	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
> +		static atomic_t shared_last_ino;
> +		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
> +
> +		res = next - LAST_INO_BATCH;
> +	}
> +#endif
> +	res++;
> +	__this_cpu_write(last_ino, res);
> +	put_cpu();
> +	return res;
> +}

Looks good ;)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30 17:39                 ` Andrew Morton
@ 2010-09-30 18:05                   ` Eric Dumazet
  0 siblings, 0 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-09-30 18:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

Le jeudi 30 septembre 2010 à 10:39 -0700, Andrew Morton a écrit :
> On Thu, 30 Sep 2010 19:28:05 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Le jeudi 30 septembre 2010 __ 09:45 -0700, Andrew Morton a __crit :
> > 
> > > Could eliminate `p' I guess, but that would involve using
> > > __get_cpu_var() as an lval, which looks vile and might generate worse
> > > code.
> > > 
> > 
> > Hmm, I see, please check this new patch, using the most modern stuff ;)
> > 
> > > Readers of this code won't know why last_ino_get() was marked noinline.
> > > It looks wrong, really.
> > 
> > Oops sorry, this was a temporary hack of mine to ease disassembly
> > analysis. Good catch !
> > 
> > Here is the new generated code on i686 (with the noinline) : 
> > pretty good ;)
> > 
> > c02e5930 <last_ino_get>:
> > c02e5930:	55                   	push   %ebp
> > c02e5931:	89 e5                	mov    %esp,%ebp
> > c02e5933:	64 a1 44 29 7d c0    	mov    %fs:0xc07d2944,%eax
> > c02e5939:	a9 ff 03 00 00       	test   $0x3ff,%eax
> > c02e593e:	74 09                	je     c02e5949 <last_ino_get+0x19>
> > c02e5940:	40                   	inc    %eax
> > c02e5941:	64 a3 44 29 7d c0    	mov    %eax,%fs:0xc07d2944
> > c02e5947:	c9                   	leave  
> > c02e5948:	c3                   	ret    
> > c02e5949:	b8 00 04 00 00       	mov    $0x400,%eax
> > c02e594e:	f0 0f c1 05 80 c8 92 c0	lock xadd %eax,0xc092c880
> > c02e5956:	eb e8                	jmp    c02e5940 <last_ino_get+0x10>
> > 
> 
> That uniprocessor, PREEMPT=n I guess.
> 

It is SMP. PREEMPT=n

On UP, you would not have the %fs suffixes, and no "lock" before the
xadd.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 01/17] kernel: add bl_list
  2010-09-29 12:18 ` [PATCH 01/17] kernel: add bl_list Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
@ 2010-10-01  5:48   ` Christoph Hellwig
  1 sibling, 0 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  5:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

I don't think rculist_bl.h is actually used in this series, so there is
no point adding it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 02/17] fs: icache lock s_inodes list
  2010-09-29 12:18 ` [PATCH 02/17] fs: icache lock s_inodes list Dave Chinner
@ 2010-10-01  5:49   ` Christoph Hellwig
  2010-10-16  7:54     ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  5:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:34PM +1000, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> To allow removal of the inode_lock, we first need to protect the
> superblock inode list with it's own lock instead of using the
> inode_lock for this purpose. Nest the new sb_inode_list_lock inside
> the inode_lock around the list operations it needs to protect.

Is there any good reason not to make this lock per-superblock?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 04/17] fs: icache lock i_state
  2010-09-29 12:18 ` [PATCH 04/17] fs: icache lock i_state Dave Chinner
@ 2010-10-01  5:54   ` Christoph Hellwig
  2010-10-16  7:54     ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  5:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

> +		spin_lock(&inode->i_lock);
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
> +				|| inode->i_mapping->nrpages == 0) {


This is some pretty strange formatting.

		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
		    inode->i_mapping->nrpages == 0) {

would be more standard.

>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>  		struct address_space *mapping;
>  
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> -			continue;
>  		mapping = inode->i_mapping;
>  		if (mapping->nrpages == 0)
>  			continue;
> +		spin_lock(&inode->i_lock);
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}

Can we access the mapping safely when the inode isn't actually fully
setup?  Even if we can I'd rather not introduce this change hidden
inside an unrelated patch.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 05/17] fs: icache lock i_count
  2010-09-30  4:52   ` Andrew Morton
@ 2010-10-01  5:55     ` Christoph Hellwig
  2010-10-01  6:04       ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  5:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:29PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:37 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > -	if (atomic_read(&inode->i_count) != 1)
> > +	if (inode->i_count != 1)
> 
> This really should have been renamed to catch unconverted code.
> 
> Such code usually wouldn't compile anyway, but it will if it takes the
> address of i_count only (for example).

If people do whacky things they'll lose - there is a reason why C has a
fairly strict type system after all.  We've changed types of variables
all the time and we didn't run into problems.

> And maybe we should access this guy via accessor functions, dunno.

Seems like complete overkill.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 06/17] fs: icache lock lru/writeback lists
  2010-09-29 12:18 ` [PATCH 06/17] fs: icache lock lru/writeback lists Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
@ 2010-10-01  6:01   ` Christoph Hellwig
  2010-10-05 22:30     ` Dave Chinner
  1 sibling, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:38PM +1000, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> The inode moves between different lists protected by the inode_lock. Introduce
> a new lock that protects all of the lists (dirty, unused, in use, etc) that the
> inode will move around as it changes state. As this is mostly a list for
> protecting the writeback lists, name it wb_inode_list_lock and nest all the
> list manipulations in this lock inside the current inode_lock scope.

As a band-aid to get rid of the inode_lock this might be fine, but I
don't really like it.  For one all the list are per-bdi_writeback, so
the lock should be as well.  Second the lock is held over far too long
periods during writeback, which leads to a lot of whacky trylock
operations and unlock and sleep cycles inside it.  In practice we only
need it in the places where we manipulate the lists.

Also it feels like it really should nest outside i_lock, not inside it,
but I need to look more deeply to figure why that might not easily be
possible.

But maybe all that is better left as a separate patch on top of the
current queue.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 08/17] fs: icache protect inode state
  2010-09-29 12:18 ` [PATCH 08/17] fs: icache protect inode state Dave Chinner
@ 2010-10-01  6:02   ` Christoph Hellwig
  2010-10-16  7:54     ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:40PM +1000, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Before removing the inode_lock, we need to protect the inode list
> operations with the inode->i_lock. This ensures that all inode state
> changes are serialised regardless of the fact that the lists they
> are moving around might be protected by different locks. Hence we
> can safely protect an inode in transit from one list to another
> without needing to hold all the list locks at the same time.

The subject does not seem to match the patch description and content.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 05/17] fs: icache lock i_count
  2010-10-01  5:55     ` Christoph Hellwig
@ 2010-10-01  6:04       ` Andrew Morton
  2010-10-01  6:16         ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-01  6:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 1 Oct 2010 01:55:36 -0400 Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Sep 29, 2010 at 09:52:29PM -0700, Andrew Morton wrote:
> > On Wed, 29 Sep 2010 22:18:37 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > -	if (atomic_read(&inode->i_count) != 1)
> > > +	if (inode->i_count != 1)
> > 
> > This really should have been renamed to catch unconverted code.
> > 
> > Such code usually wouldn't compile anyway, but it will if it takes the
> > address of i_count only (for example).
> 
> If people do whacky things they'll lose - there is a reason why C has a
> fairly strict type system after all.  We've changed types of variables
> all the time and we didn't run into problems.

No, we've run into problems *frequently*.  A common case is where we
convert a mutex to a spinlock or vice versa.  If you don't rename the
lock, the code still compiles (with warnings) and crashes horridly at
runtime.

> > And maybe we should access this guy via accessor functions, dunno.
> 
> Seems like complete overkill.

Still wrong.  We do this frequently and we do it in areas where we
believe that the implementation might change in the future.

Had we done it with i_count from day one then this part of the patchset
would be far simpler.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 11/17] fs: Factor inode hash operations into functions
  2010-09-29 12:18 ` [PATCH 11/17] fs: Factor inode hash operations into functions Dave Chinner
@ 2010-10-01  6:06   ` Christoph Hellwig
  2010-10-16  7:54     ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:43PM +1000, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Before we can replace the inode hash locking with a more scalable
> mechanism, we need to remove external users of the inode_hash_lock.
> Make it private by adding a function __remove_inode_hash that can be
> called by filesystems instead of open-coding their own inode hash
> removal operations.

I like the factoring, but this changelog is misleading.  At least in
this series no new user of __remove_inode_hash appears, and I'm not sure
where it would appear anyway.  Just making the function global without
actually exporting it is not helping external filesystems anyway.  For
now it can simply be made static.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 03/17] fs: icache lock inode hash
  2010-09-29 12:18 ` [PATCH 03/17] fs: icache lock inode hash Dave Chinner
  2010-09-30  4:52   ` Andrew Morton
@ 2010-10-01  6:06   ` Christoph Hellwig
  2010-10-16  7:57     ` Nick Piggin
  1 sibling, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:35PM +1000, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Currently the inode hash lists are protected by the inode_lock. To
> allow removal of the inode_lock, we need to protect the inode hash
> table lists with a new lock. Nest the new inode_hash_lock inside the
> inode_lock to protect the hash lists.

There is no reason to make inode_hash_lock global, it's only used inside
inode.c


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock
  2010-09-29 12:18 ` [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock Dave Chinner
  2010-09-30  4:53   ` Andrew Morton
@ 2010-10-01  6:08   ` Christoph Hellwig
  2010-10-16  7:54     ` Nick Piggin
  1 sibling, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:18:41PM +1000, Dave Chinner wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Before removing the inode_lock, we need to make the last_ino  and iunique
> counters independent of the inode_lock. last_ino can be trivially converted to
> an atomic variable, while the iunique counter needs a new lock nested inside
> the inode_lock to provide the same protection that the inode_lock previously
> provided.

Given that last_ino becomes a per-cpu construct only a few patches later
I think there's no point to make it an atomic_t here - just reorder the
per-cpu patch before the inode_lock removal.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30 17:28               ` Eric Dumazet
  2010-09-30 17:39                 ` Andrew Morton
@ 2010-10-01  6:12                 ` Christoph Hellwig
  2010-10-01  6:45                   ` Eric Dumazet
  2010-10-16  6:36                 ` Nick Piggin
  2 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Dave Chinner, linux-fsdevel, linux-kernel

Eric,

what workload did this matter for?  As mentioned elsewhere we don't
actually need to set i_ino in new_inode in many cases.  Normal
disk/network filesystem already set their own i_ino anyway and don't
need it at all.  Various in-kernel filesystems don't need any valid
i_ino and Nick's full series actually has a patch dealing with it,
which only leaves various user-mountable syntetic filesystems that
want to generate an inode number, and IMHO they're better off using
iunique.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 05/17] fs: icache lock i_count
  2010-10-01  6:04       ` Andrew Morton
@ 2010-10-01  6:16         ` Christoph Hellwig
  2010-10-01  6:23           ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-01  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Sep 30, 2010 at 11:04:16PM -0700, Andrew Morton wrote:
> No, we've run into problems *frequently*.  A common case is where we
> convert a mutex to a spinlock or vice versa.  If you don't rename the
> lock, the code still compiles (with warnings) and crashes horridly at
> runtime.

Sorry, if you run code with that obvious warnings you beg for trouble.
If you really believe your advanced users arw too stupid to read
compiler warnings enforcing -Werror is for sure better than obsfucating
the code.

> Still wrong.  We do this frequently and we do it in areas where we
> believe that the implementation might change in the future.
> 
> Had we done it with i_count from day one then this part of the patchset
> would be far simpler.

I don't thin that's quite true.  The big point of using i_lock is that
we can hold it over accessing other things that also are protected by
it.  No accessor is going to help you with that.  For plain opencoded
increments we indeed need a helper as shown by Chris' aio speedup and
the churn in here - but that's already added later in the series and I
asked Dave to move it before this patch.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 05/17] fs: icache lock i_count
  2010-10-01  6:16         ` Christoph Hellwig
@ 2010-10-01  6:23           ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2010-10-01  6:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 1 Oct 2010 02:16:02 -0400 Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Sep 30, 2010 at 11:04:16PM -0700, Andrew Morton wrote:
> > No, we've run into problems *frequently*.  A common case is where we
> > convert a mutex to a spinlock or vice versa.  If you don't rename the
> > lock, the code still compiles (with warnings) and crashes horridly at
> > runtime.
> 
> Sorry, if you run code with that obvious warnings you beg for trouble.
> If you really believe your advanced users arw too stupid to read
> compiler warnings enforcing -Werror is for sure better than obsfucating
> the code.

Well, it has happened, fairly regularly.  A common scenario is where
someone has done a conversion in one tree and someone else has touched
overlapping code in another tree and when the two meet in linux-next,
splat.  Renaming the field simply eliminates this.

Of course, the warnings don't get noticed because of the enormous
warning storm which a kernel build produces (generally much worse on
non-x86, btw).

Another reason for remaining a field is when we desire that it
henceforth be accessed via accessor functions - renaming it will
reliably break any unconverted code.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-10-01  6:12                 ` Christoph Hellwig
@ 2010-10-01  6:45                   ` Eric Dumazet
  0 siblings, 0 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-10-01  6:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Dave Chinner, linux-fsdevel, linux-kernel

Le vendredi 01 octobre 2010 à 02:12 -0400, Christoph Hellwig a écrit :
> Eric,
> 
> what workload did this matter for?  As mentioned elsewhere we don't
> actually need to set i_ino in new_inode in many cases.  Normal
> disk/network filesystem already set their own i_ino anyway and don't
> need it at all.  Various in-kernel filesystems don't need any valid
> i_ino and Nick's full series actually has a patch dealing with it,
> which only leaves various user-mountable syntetic filesystems that
> want to generate an inode number, and IMHO they're better off using
> iunique.
> 

Dont focus on this small last_ino problem, it is only one problem out of
a huge pile.

As you might know, I am mostly a netdev guy ;)

In 2008, I tried to scale workloads that need to setup / dismantles
thousand of sockets per second. Then Nick took the challenge, because it
was clear nobody was interested to review my stuff, or maybe I was not
enough re-sending patches.

FS guys live in a closed world, it seems, they want to control every bit
of patches.

for (i = 0; i < LIMIT; i++)
	close(socket(...));

My results were impressive :

(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-12/msg04479.html
(fs: Scalability of sockets/pipes allocation/deallocation on SMP)

<quote>
Hi Andrew

Take v2 of this patch serie got no new feedback, maybe its time for mm
inclusion for a while ?

In this third version I added last two patches, one intialy from
Christoph
Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs.

Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work
done
on "struct file".

Thank you

Short summary : Nice speedups for allocation/deallocation of
sockets/pipes
(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer). This point is addressed by 6th
patch.

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list (for sockets/pipes, this is useless)
- dirties superblock s_inodes. - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount

At close() time, we must undo the things. Its even more expensive
because
of the _atomic_dec_and_lock() that stress a lot, and because of two
cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in
dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd (signalfd, timerfd, ...)

socketallocbench is a very simple program (attached to this mail) that
makes
a loop :

for (i = 0; i < 1000000; i++)
close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real 1.561s
user 0.092s
sys 1.469s

Cost if 8 processes are launched on a 8 CPU machine
(socketallocbench -n 8) :

real 27.496s <<<< !!!! >>>>
user 0.657s
sys 3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock
3301428 6648780 27.6388 55.6620 d_instantiate
2971130 9619910 24.8736 80.5355 d_alloc
241318 9861228 2.0203 82.5558 init_file
146190 10007418 1.2239 83.7797 __slab_free
144149 10151567 1.2068 84.9864 inotify_d_instantiate
143971 10295538 1.2053 86.1917 inet_create
137168 10432706 1.1483 87.3401 new_inode
117549 10550255 0.9841 88.3242 add_partial
110795 10661050 0.9275 89.2517 generic_drop_inode
107137 10768187 0.8969 90.1486 kmem_cache_alloc
94029 10862216 0.7872 90.9358 tcp_close
82837 10945053 0.6935 91.6293 dput
67486 11012539 0.5650 92.1943 dentry_iput
57751 11070290 0.4835 92.6778 iput
54327 11124617 0.4548 93.1326 tcp_v4_init_sock
49921 11174538 0.4179 93.5505 sysenter_past_esp
47616 11222154 0.3986 93.9491 kmem_cache_free
30792 11252946 0.2578 94.2069 clear_inode
27540 11280486 0.2306 94.4375 copy_from_user
26509 11306995 0.2219 94.6594 init_timer
26363 11333358 0.2207 94.8801 discard_slab
25284 11358642 0.2117 95.0918 __fput
22482 11381124 0.1882 95.2800 __percpu_counter_add
20369 11401493 0.1705 95.4505 sock_alloc
18501 11419994 0.1549 95.6054 inet_csk_destroy_sock
17923 11437917 0.1500 95.7555 sys_close

This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.

New cost if run on one cpu :

real 1.245s (instead of 1.561s)
user 0.074s
sys 1.161s

If run on 8 CPUS :

real 1.624s
user 0.580s
sys 12.296s

On oprofile, we finally can see network stuff coming at the front of
expensive stuff. (with the exception of kmem_cache_[z]alloc(), because
it has to clear 192 bytes of file structures, this takes half of the
time)

CPU: Core 2, speed 3000.09 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100
000
samples cum. samples % cum. % symbol name
176586 176586 10.9376 10.9376 kmem_cache_alloc
169838 346424 10.5196 21.4572 tcp_close
105331 451755 6.5241 27.9813 tcp_v4_init_sock
105146 556901 6.5126 34.4939 tcp_v4_destroy_sock
83307 640208 5.1600 39.6539 sysenter_past_esp
80241 720449 4.9701 44.6239 inet_csk_destroy_sock
74263 794712 4.5998 49.2237 kmem_cache_free
56806 851518 3.5185 52.7422 __percpu_counter_add
48619 900137 3.0114 55.7536 copy_from_user
44803 944940 2.7751 58.5287 init_timer
28539 973479 1.7677 60.2964 d_alloc
27795 1001274 1.7216 62.0180 alloc_fd
26747 1028021 1.6567 63.6747 __fput
24312 1052333 1.5059 65.1805 sys_close
24205 1076538 1.4992 66.6798 inet_create
22409 1098947 1.3880 68.0677 alloc_inode
21359 1120306 1.3230 69.3907 release_sock
19865 1140171 1.2304 70.6211 fd_install
19472 1159643 1.2061 71.8272 lock_sock_nested
18956 1178599 1.1741 73.0013 sock_init_data
17301 1195900 1.0716 74.0729 drop_file_write_access
17113 1213013 1.0600 75.1329 inotify_d_instantiate
16384 1229397 1.0148 76.1477 dput
15173 1244570 0.9398 77.0875 local_bh_enable_ip
15017 1259587 0.9301 78.0176 local_bh_enable
13354 1272941 0.8271 78.8448 __sock_create
13139 1286080 0.8138 79.6586 inet_release
13062 1299142 0.8090 80.4676 sysenter_do_call
11935 1311077 0.7392 81.2069 iput_single

This patch serie contains 7 patches, against linux-2.6 tree,
plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c)

[PATCH 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n 8" bench result : 27.5s to 25s)

[PATCH 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

("socketallocbench -n 8" bench result : no difference at this point)

[PATCH 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 232 allocations)

("socketallocbench -n 8" result : no difference)

[PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.

Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it
as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines
in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)

[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx>

Currently we schedule RCU frees for each file we free separately. That
has
several drawbacks against the earlier file handling (in 2.6.5 f.e.),
which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
queues that in turn cause long latencies. We hit SLUB page allocation
more often than necessary.

2. The cache hot object is not preserved between free and realloc. A
close
followed by another open is very fast with the RCUless approach because
the last freed object is returned by the slab allocator that is
still cache hot. RCU free means that the object is not immediately
available again. The new object is cache cold and therefore open/close
performance tests show a significant degradation with the RCU
implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab
creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the
correct
object after establishing a stable object reference (incrementing the
refcounter does that).

</quote>

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-12/msg04475.html
(1/7 fs: Use a percpu_counter to track nr_dentry)

http://patchwork.ozlabs.org/patch/13602/
(2/7 Use a percpu_counter to track nr_inodes)

http://patchwork.ozlabs.org/patch/13603/
(3/7 Introduce a per_cpu last_ino allocator)

http://patchwork.ozlabs.org/patch/13605/ 
(4/7 Introduce SINGLE dentries for pipes, socket, anon fd)

...

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-09-30  4:53   ` Andrew Morton
  2010-09-30  6:10     ` Dave Chinner
@ 2010-10-02 16:02     ` Christoph Hellwig
  1 sibling, 0 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-02 16:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:53:22PM -0700, Andrew Morton wrote:
> > +extern int get_nr_inodes(void);
> > +extern int get_nr_inodes_unused(void);
> 
> These are pretty cruddy names.  Unfotunately we don't really have a vfs
> or "inode" subsystem name to prefix them with.

We don't really need to export these anyway.  We have two callers for
each of them, and both are in the form of:


	/* approximate dirty inodes */
	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
	if (nr_dirty_inodes < 0)
		nr_dirty_inodes = 0;
	
which means we should just have a properly documented
get_nr_dirty_inodes() helper, the rest can stay private to inode.c


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/17] fs: Inode cache scalability
  2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
                   ` (18 preceding siblings ...)
  2010-09-30  2:21 ` Christoph Hellwig
@ 2010-10-02 23:10 ` Carlos Carvalho
  2010-10-04  7:22   ` Dave Chinner
  19 siblings, 1 reply; 111+ messages in thread
From: Carlos Carvalho @ 2010-10-02 23:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel

We have serious problems with 34.6 in a machine with ~11TiB xfs, with
a lot of simultaneous IO, particularly hundreds of rm and a sync
afterwards. Maybe they're related to these issues.

The machine is a file server (almost all via http/apache) and has
several thousand connections all the time. It behaves quite well for
at most 4 days; from then on kswapd's start appearing on the display
of top consuming ever increasing percentages of cpu. This is no
problem, the machine has 16 nearly idle cores. However, after about
5-7 days there's an abrupt transition: in about 30s the load goes to
several thousand, apache shows up consuming all possible cpu and
downloads nearly stop. I have to reboot the machine to get service
back. It manages to unmount the filesystems and reboot properly.

Stopping/restarting apache restores the situation but only for
a short while; after about 2-3h the problem reappears. That's why I
have to reboot.

With 35.6 the behaviour seems to have changed: now often
CONFIG_DETECT_HUNG_TASK produces this kind of call trace in the log:

[<ffffffff81098578>] ? igrab+0x10/0x30
[<ffffffff811160fe>] ? xfs_sync_inode_valid+0x4c/0x76
[<ffffffff81116241>] ? xfs_sync_inode_data+0x1b/0xa8
[<ffffffff811163e0>] ? xfs_inode_ag_walk+0x96/0xe4
[<ffffffff811163dd>] ? xfs_inode_ag_walk+0x93/0xe4
[<ffffffff81116226>] ? xfs_sync_inode_data+0x0/0xa8
[<ffffffff81116495>] ? xfs_inode_ag_iterator+0x67/0xc4
[<ffffffff81116226>] ? xfs_sync_inode_data+0x0/0xa8
[<ffffffff810a48dd>] ? sync_one_sb+0x0/0x1e
[<ffffffff81116712>] ? xfs_sync_data+0x22/0x42
[<ffffffff810a48dd>] ? sync_one_sb+0x0/0x1e
[<ffffffff8111678b>] ? xfs_quiesce_data+0x2b/0x94
[<ffffffff81113f03>] ? xfs_fs_sync_fs+0x2d/0xd7
[<ffffffff810a48dd>] ? sync_one_sb+0x0/0x1e
[<ffffffff810a48c4>] ? __sync_filesystem+0x62/0x7b
[<ffffffff8108993e>] ? iterate_supers+0x60/0x9d
[<ffffffff810a493a>] ? sys_sync+0x3f/0x53
[<ffffffff81001dab>] ? system_call_fastpath+0x16/0x1b

It doesn't seem to cause service disruption (at least the flux graphs
don't show drops). I didn't see it happen while I was watching so it
may be that service degrades for short intervals. Uptime with 35.6 is
only 3d8h so it's still not sure that the breakdown of 34.6 is gone
but kswapd's cpu usages are very small, less than with 34.6 for a
similar uptime. There are only 2 filesystems, and the big one has 256
AGs. They're not mounted with delaylog.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/17] fs: Inode cache scalability
  2010-10-02 23:10 ` Carlos Carvalho
@ 2010-10-04  7:22   ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-10-04  7:22 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-fsdevel, linux-kernel

On Sat, Oct 02, 2010 at 08:10:02PM -0300, Carlos Carvalho wrote:
> We have serious problems with 34.6 in a machine with ~11TiB xfs, with
> a lot of simultaneous IO, particularly hundreds of rm and a sync
> afterwards. Maybe they're related to these issues.
> 
> The machine is a file server (almost all via http/apache) and has
> several thousand connections all the time. It behaves quite well for
> at most 4 days; from then on kswapd's start appearing on the display
> of top consuming ever increasing percentages of cpu. This is no
> problem, the machine has 16 nearly idle cores. However, after about
> 5-7 days there's an abrupt transition: in about 30s the load goes to
> several thousand, apache shows up consuming all possible cpu and
> downloads nearly stop. I have to reboot the machine to get service
> back. It manages to unmount the filesystems and reboot properly.
> 
> Stopping/restarting apache restores the situation but only for
> a short while; after about 2-3h the problem reappears. That's why I
> have to reboot.
> 
> With 35.6 the behaviour seems to have changed: now often
> CONFIG_DETECT_HUNG_TASK produces this kind of call trace in the log:
> 
> [<ffffffff81098578>] ? igrab+0x10/0x30
> [<ffffffff811160fe>] ? xfs_sync_inode_valid+0x4c/0x76
> [<ffffffff81116241>] ? xfs_sync_inode_data+0x1b/0xa8
> [<ffffffff811163e0>] ? xfs_inode_ag_walk+0x96/0xe4
> [<ffffffff811163dd>] ? xfs_inode_ag_walk+0x93/0xe4
> [<ffffffff81116226>] ? xfs_sync_inode_data+0x0/0xa8
> [<ffffffff81116495>] ? xfs_inode_ag_iterator+0x67/0xc4
> [<ffffffff81116226>] ? xfs_sync_inode_data+0x0/0xa8
> [<ffffffff810a48dd>] ? sync_one_sb+0x0/0x1e
> [<ffffffff81116712>] ? xfs_sync_data+0x22/0x42
> [<ffffffff810a48dd>] ? sync_one_sb+0x0/0x1e
> [<ffffffff8111678b>] ? xfs_quiesce_data+0x2b/0x94
> [<ffffffff81113f03>] ? xfs_fs_sync_fs+0x2d/0xd7
> [<ffffffff810a48dd>] ? sync_one_sb+0x0/0x1e
> [<ffffffff810a48c4>] ? __sync_filesystem+0x62/0x7b
> [<ffffffff8108993e>] ? iterate_supers+0x60/0x9d
> [<ffffffff810a493a>] ? sys_sync+0x3f/0x53
> [<ffffffff81001dab>] ? system_call_fastpath+0x16/0x1b
> 
> It doesn't seem to cause service disruption (at least the flux graphs
> don't show drops). I didn't see it happen while I was watching so it
> may be that service degrades for short intervals. Uptime with 35.6 is
> only 3d8h so it's still not sure that the breakdown of 34.6 is gone
> but kswapd's cpu usages are very small, less than with 34.6 for a
> similar uptime. There are only 2 filesystems, and the big one has 256
> AGs. They're not mounted with delaylog.

Apply this:

http://www.oss.sgi.com/archives/xfs/2010-10/msg00000.html

And in future, can you please report bugs in a new thread to the
appropriate lists (xfs@oss.sgi.com), not as a reply to a completely
unrelated development thread....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 06/17] fs: icache lock lru/writeback lists
  2010-10-01  6:01   ` Christoph Hellwig
@ 2010-10-05 22:30     ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2010-10-05 22:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 02:01:03AM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:38PM +1000, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > The inode moves between different lists protected by the inode_lock. Introduce
> > a new lock that protects all of the lists (dirty, unused, in use, etc) that the
> > inode will move around as it changes state. As this is mostly a list for
> > protecting the writeback lists, name it wb_inode_list_lock and nest all the
> > list manipulations in this lock inside the current inode_lock scope.
> 
> As a band-aid to get rid of the inode_lock this might be fine, but I
> don't really like it.  For one all the list are per-bdi_writeback, so
> the lock should be as well.  Second the lock is held over far too long
> periods during writeback, which leads to a lot of whacky trylock
> operations and unlock and sleep cycles inside it.  In practice we only
> need it in the places where we manipulate the lists.

per-bdi writeback lock won't work with the patch set as it stands -
it also protects the LRU which is a global list. I'll have to pull
back another patch to split the LRU and IO lists to make this lock
per-bdi.

> Also it feels like it really should nest outside i_lock, not inside it,
> but I need to look more deeply to figure why that might not easily be
> possible.

Yeah, I'm trying to rework the patch series to not nest anything
inside i_lock. The more I look at all that trylock stuff, the more
my eyes bleed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-09-30  2:07   ` Christoph Hellwig
@ 2010-10-06  6:29     ` Dave Chinner
  2010-10-06  8:51       ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2010-10-06  6:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dada1, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:07:59PM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:47PM +1000, Dave Chinner wrote:
> > From: Eric Dumazet <dada1@cosmosbay.com>
> > 
> > last_ino was converted to an atomic variable to allow the inode_lock
> > to go away. However, contended atomics do not scale on large
> > machines, and new_inode() triggers excessive contention in such
> > situations.
> 
> And the good thing is most users of new_inode couldn't care less about
> the fake i_ino assigned because they have a real inode number.  So
> the first step is to move the i_ino assignment into a separate helper
> and only use it in those filesystems that need it.  Second step is
> to figure out why some filesystems need iunique() and some are fine
> with the incrementing counter and then we should find a scalable way
> to generate an inode number - preferably just one and not to, but if
> that's not possible we need some documentation on why which one is
> needed.

Sounds like a good plan, but I don't really have time right now to
understand the iget routines of every single filesystem to determine
which rely on the current new_inode() allocated inode number. I
think that is best left for a later cleanup, seeing as the
last_ino scalability problem is easily addressed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 15/17] fs: inode per-cpu last_ino allocator
  2010-10-06  6:29     ` Dave Chinner
@ 2010-10-06  8:51       ` Christoph Hellwig
  0 siblings, 0 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-06  8:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, dada1, linux-fsdevel, linux-kernel

On Wed, Oct 06, 2010 at 05:29:21PM +1100, Dave Chinner wrote:
> Sounds like a good plan, but I don't really have time right now to
> understand the iget routines of every single filesystem to determine
> which rely on the current new_inode() allocated inode number. I
> think that is best left for a later cleanup, seeing as the
> last_ino scalability problem is easily addressed...

It's fairly easy to do it pessimisticly - all disk based filesystem
don't need it.  Anyway, I can do this ontop of your series later.  It just
seems a bit counter-intuitive to scale something we don't actually need it
most cases.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-09-30 17:28               ` Eric Dumazet
  2010-09-30 17:39                 ` Andrew Morton
  2010-10-01  6:12                 ` Christoph Hellwig
@ 2010-10-16  6:36                 ` Nick Piggin
  2010-10-16  6:40                   ` Nick Piggin
  2 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  6:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Sep 30, 2010 at 07:28:05PM +0200, Eric Dumazet wrote:
> Le jeudi 30 septembre 2010 à 09:45 -0700, Andrew Morton a écrit :
> 
> > Could eliminate `p' I guess, but that would involve using
> > __get_cpu_var() as an lval, which looks vile and might generate worse
> > code.
> > 
> 
> Hmm, I see, please check this new patch, using the most modern stuff ;)
> 
> > Readers of this code won't know why last_ino_get() was marked noinline.
> > It looks wrong, really.
> 
> Oops sorry, this was a temporary hack of mine to ease disassembly
> analysis. Good catch !
> 
> Here is the new generated code on i686 (with the noinline) : 
> pretty good ;)
> 
> c02e5930 <last_ino_get>:
> c02e5930:	55                   	push   %ebp
> c02e5931:	89 e5                	mov    %esp,%ebp
> c02e5933:	64 a1 44 29 7d c0    	mov    %fs:0xc07d2944,%eax
> c02e5939:	a9 ff 03 00 00       	test   $0x3ff,%eax
> c02e593e:	74 09                	je     c02e5949 <last_ino_get+0x19>
> c02e5940:	40                   	inc    %eax
> c02e5941:	64 a3 44 29 7d c0    	mov    %eax,%fs:0xc07d2944
> c02e5947:	c9                   	leave  
> c02e5948:	c3                   	ret    
> c02e5949:	b8 00 04 00 00       	mov    $0x400,%eax
> c02e594e:	f0 0f c1 05 80 c8 92 c0	lock xadd %eax,0xc092c880
> c02e5956:	eb e8                	jmp    c02e5940 <last_ino_get+0x10>
> 
> 
> Thanks
> 
> [PATCH] fs: inode per-cpu last_ino allocator

Thanks Eric, this looks good. You didn't seem to add a comment about
preempt safety that Andrew wanted, but I'll add it.

> 
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
> 
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
> This reduces contention on the shared last_ino, and give same
> spreading ino numbers than before (i.e. same wraparound after 2^32
> allocations).
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/inode.c |   47 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 8646433..5c233f0 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -624,6 +624,45 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
>  }
>  EXPORT_SYMBOL_GPL(inode_add_to_lists);
>  
> +#define LAST_INO_BATCH 1024
> +
> +/*
> + * Each cpu owns a range of LAST_INO_BATCH numbers.
> + * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
> + * to renew the exhausted range.
> + *
> + * This does not significantly increase overflow rate because every CPU can
> + * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
> + * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> + * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> + * overflow rate by 2x, which does not seem too significant.
> + *
> + * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> + * error if st_ino won't fit in target struct field. Use 32bit counter
> + * here to attempt to avoid that.
> + */
> +static DEFINE_PER_CPU(unsigned int, last_ino);
> +
> +static unsigned int last_ino_get(void)
> +{
> +	unsigned int res;
> +
> +	get_cpu();
> +	res = __this_cpu_read(last_ino);
> +#ifdef CONFIG_SMP
> +	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
> +		static atomic_t shared_last_ino;
> +		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
> +
> +		res = next - LAST_INO_BATCH;
> +	}
> +#endif
> +	res++;
> +	__this_cpu_write(last_ino, res);
> +	put_cpu();
> +	return res;
> +}
> +
>  /**
>   *	new_inode 	- obtain an inode
>   *	@sb: superblock
> @@ -638,12 +677,6 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
>   */
>  struct inode *new_inode(struct super_block *sb)
>  {
> -	/*
> -	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> -	 * error if st_ino won't fit in target struct field. Use 32bit counter
> -	 * here to attempt to avoid that.
> -	 */
> -	static unsigned int last_ino;
>  	struct inode *inode;
>  
>  	spin_lock_prefetch(&inode_lock);
> @@ -652,7 +685,7 @@ struct inode *new_inode(struct super_block *sb)
>  	if (inode) {
>  		spin_lock(&inode_lock);
>  		__inode_add_to_lists(sb, NULL, inode);
> -		inode->i_ino = ++last_ino;
> +		inode->i_ino = last_ino_get();
>  		inode->i_state = 0;
>  		spin_unlock(&inode_lock);
>  	}
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] fs: inode per-cpu last_ino allocator
  2010-10-16  6:36                 ` Nick Piggin
@ 2010-10-16  6:40                   ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  6:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andrew Morton, Dave Chinner, linux-fsdevel,
	linux-kernel

On Sat, Oct 16, 2010 at 05:36:04PM +1100, Nick Piggin wrote:
> On Thu, Sep 30, 2010 at 07:28:05PM +0200, Eric Dumazet wrote:
> > [PATCH] fs: inode per-cpu last_ino allocator
> 
> Thanks Eric, this looks good. You didn't seem to add a comment about
> preempt safety that Andrew wanted, but I'll add it.

Oh I beg your pardon, you added preempt disable in the code. Much safer,
thanks.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 02/17] fs: icache lock s_inodes list
  2010-10-01  5:49   ` Christoph Hellwig
@ 2010-10-16  7:54     ` Nick Piggin
  2010-10-16 16:12       ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 01:49:09AM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:34PM +1000, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > To allow removal of the inode_lock, we first need to protect the
> > superblock inode list with it's own lock instead of using the
> > inode_lock for this purpose. Nest the new sb_inode_list_lock inside
> > the inode_lock around the list operations it needs to protect.
> 
> Is there any good reason not to make this lock per-superblock?

Because in the first part of the inode lock series, it is breaking
locks in obvious small steps as possible, by adding global locks
protecting bits of what inode_lock used to.

If we did want to make it per-superblock, that would come at the
last part of the series, where inode_lock is removed and steps are
being taken to improve scalability and locking.

But I don't see why we want to make it per-superblock really anyway.
We want to have scalability within a single superblock, so per CPU
locks are needed. Once we have those, per-superblock doesn't really
buy much.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 04/17] fs: icache lock i_state
  2010-10-01  5:54   ` Christoph Hellwig
@ 2010-10-16  7:54     ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 01:54:33AM -0400, Christoph Hellwig wrote:
> > +		spin_lock(&inode->i_lock);
> > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
> > +				|| inode->i_mapping->nrpages == 0) {
> 
> 
> This is some pretty strange formatting.
> 
> 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> 		    inode->i_mapping->nrpages == 0) {
> 
> would be more standard.
> 
> >  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> >  		struct address_space *mapping;
> >  
> > -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > -			continue;
> >  		mapping = inode->i_mapping;
> >  		if (mapping->nrpages == 0)
> >  			continue;
> > +		spin_lock(&inode->i_lock);
> > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > +			spin_unlock(&inode->i_lock);
> > +			continue;
> > +		}
> 
> Can we access the mapping safely when the inode isn't actually fully
> setup?  Even if we can I'd rather not introduce this change hidden
> inside an unrelated patch.

Good point, fixed.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 08/17] fs: icache protect inode state
  2010-10-01  6:02   ` Christoph Hellwig
@ 2010-10-16  7:54     ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 02:02:27AM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:40PM +1000, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Before removing the inode_lock, we need to protect the inode list
> > operations with the inode->i_lock. This ensures that all inode state
> > changes are serialised regardless of the fact that the lists they
> > are moving around might be protected by different locks. Hence we
> > can safely protect an inode in transit from one list to another
> > without needing to hold all the list locks at the same time.
> 
> The subject does not seem to match the patch description and content.

It is adding i_lock around remaining places where an inode can be moved
on or off icache data structures. As I've described, this is quite
central to my locking design, isn't the changelog understandable?



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock
  2010-10-01  6:08   ` Christoph Hellwig
@ 2010-10-16  7:54     ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 02:08:27AM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:41PM +1000, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Before removing the inode_lock, we need to make the last_ino  and iunique
> > counters independent of the inode_lock. last_ino can be trivially converted to
> > an atomic variable, while the iunique counter needs a new lock nested inside
> > the inode_lock to provide the same protection that the inode_lock previously
> > provided.
> 
> Given that last_ino becomes a per-cpu construct only a few patches later
> I think there's no point to make it an atomic_t here - just reorder the
> per-cpu patch before the inode_lock removal.

I wanted to avoid doing any of that until inode_lock is gone, but
perhaps for this one it makes sense. At the very least, I'll merge
the latter two patches into one, and perhaps this one too.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 11/17] fs: Factor inode hash operations into functions
  2010-10-01  6:06   ` Christoph Hellwig
@ 2010-10-16  7:54     ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 02:06:07AM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:43PM +1000, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Before we can replace the inode hash locking with a more scalable
> > mechanism, we need to remove external users of the inode_hash_lock.
> > Make it private by adding a function __remove_inode_hash that can be
> > called by filesystems instead of open-coding their own inode hash
> > removal operations.
> 
> I like the factoring, but this changelog is misleading.  At least in
> this series no new user of __remove_inode_hash appears, and I'm not sure
> where it would appear anyway.  Just making the function global without
> actually exporting it is not helping external filesystems anyway.  For
> now it can simply be made static.

Yeah, hugetlbfs was using this a while back as I said, and I've missed
refactoring it. Will do.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 13/17] fs: Implement lazy LRU updates for inodes.
  2010-09-30  2:05   ` Christoph Hellwig
@ 2010-10-16  7:54     ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:05:17PM -0400, Christoph Hellwig wrote:
> > @@ -1058,8 +1051,6 @@ static void wait_sb_inodes(struct super_block *sb)
> >  	 */
> >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >  
> > -	spin_lock(&sb_inode_list_lock);
> > -
> >  	/*
> >  	 * Data integrity sync. Must wait for all pages under writeback,
> >  	 * because there may have been pages dirtied before our sync
> > @@ -1067,6 +1058,7 @@ static void wait_sb_inodes(struct super_block *sb)
> >  	 * In which case, the inode may not be on the dirty list, but
> >  	 * we still have to wait for that writeout.
> >  	 */
> > +	spin_lock(&sb_inode_list_lock);
> 
> I think this should be folded back into the patch introducing
> sb_inode_list_lock.
> 
> > @@ -1083,10 +1075,10 @@ static void wait_sb_inodes(struct super_block *sb)
> >  		spin_unlock(&sb_inode_list_lock);
> >  		/*
> >  		 * We hold a reference to 'inode' so it couldn't have been
> > -		 * removed from s_inodes list while we dropped the
> > -		 * sb_inode_list_lock.  We cannot iput the inode now as we can
> > -		 * be holding the last reference and we cannot iput it under
> > -		 * spinlock. So we keep the reference and iput it later.
> > +		 * removed from s_inodes list while we dropped the i_lock.  We
> > +		 * cannot iput the inode now as we can be holding the last
> > +		 * reference and we cannot iput it under spinlock. So we keep
> > +		 * the reference and iput it later.
> 
> This also looks like a hunk that got in by accident and should be merged
> into an earlier patch.

These two actually came from a patch to do rcu locking (which Dave has
changed a bit, but originally due to my fault), so I'll fix those, thanks.

 
> > @@ -431,11 +412,12 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
> >  		invalidate_inode_buffers(inode);
> >  		if (!inode->i_count) {
> >  			spin_lock(&wb_inode_list_lock);
> > -			list_move(&inode->i_list, dispose);
> > +			list_del(&inode->i_list);
> >  			spin_unlock(&wb_inode_list_lock);
> >  			WARN_ON(inode->i_state & I_NEW);
> >  			inode->i_state |= I_FREEING;
> >  			spin_unlock(&inode->i_lock);
> > +			list_add(&inode->i_list, dispose);
> 
> Moving the list_add out of the lock looks fine, but I can't really
> see how it's related to the rest of the patch.

Just helps shows that dispose isn't being protected by
wb_inode_list_lock, I guess.

> 
> > +		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
> > +			list_del_init(&inode->i_list);
> > +			spin_unlock(&inode->i_lock);
> > +			atomic_dec(&inodes_stat.nr_unused);
> > +			continue;
> > +		}
> > +		if (inode->i_state) {
> 
> Slightly confusing but okay given the only i_state that will get us here
> is I_REFERENCED.  Do we really care about the additional cycle or two a
> dumb compiler might generate when writing
> 
> 	if (inode->i_state & I_REFERENCED)

Sure, why not.

> 
> ?
> 
> >  		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> > +			list_move(&inode->i_list, &inode_unused);
> 
> Why are we now moving the inode to the front of the list? 

It was always being moved to the front of the list, but with lazy LRU,
iput_final doesn't move it for us, hence the list_move here.

Without this, it busy-spins and locks badly under heavy reclaim load
when buffers or pagecache can't be invalidated.

Seeing as it wasn't obvious to you, I'll add a comment here.

I was thinking we should probably have a shortcut to go back to the
tail of the LRU in case of invalidation success, but that's out of the
scope of this patch and I never got around to testing such a change
yet.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 12/17] fs: Introduce per-bucket inode hash locks
  2010-09-30  1:52   ` Christoph Hellwig
  2010-09-30  2:43     ` Dave Chinner
@ 2010-10-16  7:55     ` Nick Piggin
  1 sibling, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:14PM -0400, Christoph Hellwig wrote:
> 
> Instead of doing the lock overkill on a still fundamentally global data

How do you figure it is overkill?  Actually the hash insertion/removal
scales *really* well with per-bucket locks and it is a technique used
and proven in other parts of the kernel like networking.

Having a global lock there is certainly a huge bottleneck when you
start increasing system size, so I don't know why you keep arguing
against this.

> structure what about replacing this with something better.

I won't be doing this until after the scalability work.

> you've already done this with the XFS icache, and while the per-AG
> concept obviously can't be generic at least some of the lessons could be
> applied.
> 
> then again how much testing did this get anyway given that you
> benchmark ran mostly XFS which doesn't hit this at all?
> 
> If it was up to me I'd dtop this (and the bl_list addition) from the
> series for now and wait for people who care about the scalability of
> the generic icache code to come up with a better data structure.

I do care about scalability of icache code. Given how simple this
is, and seeing as we're about to have the big locking rework, I
much prefer just fixing all the global locks now (which need to
be fixed anyway).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-09-30  6:10     ` Dave Chinner
@ 2010-10-16  7:55       ` Nick Piggin
  2010-10-16  8:29         ` Eric Dumazet
  0 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Thu, Sep 30, 2010 at 04:10:39PM +1000, Dave Chinner wrote:
> On Wed, Sep 29, 2010 at 09:53:22PM -0700, Andrew Morton wrote:
> > On Wed, 29 Sep 2010 22:18:48 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > From: Eric Dumazet <dada1@cosmosbay.com>
> > > 
> > > The number of inodes allocated does not need to be tied to the
> > > addition or removal of an inode to/from a list. If we are not tied
> > > to a list lock, we could update the counters when inodes are
> > > initialised or destroyed, but to do that we need to convert the
> > > counters to be per-cpu (i.e. independent of a lock). This means that
> > > we have the freedom to change the list/locking implementation
> > > without needing to care about the counters.
> > > 
> > >
> > > ...
> > >
> > > +int get_nr_inodes(void)
> > > +{
> > > +	int i;
> > > +	int sum = 0;
> > > +	for_each_possible_cpu(i)
> > > +		sum += per_cpu(nr_inodes, i);
> > > +	return sum < 0 ? 0 : sum;
> > > +}
> > 
> > This reimplements percpu_counter_sum_positive(), rather poorly

Why is it poorly?


> > If one never intends to use the approximate percpu_counter_read() then
> > one could initialise the counter with a really large batch value, for a
> > very small performance gain.

I did that to start with, and I was just looking to shave off cycles
and icache size. this_cpu_inc on x86 on a local variable is really
tiny and fast. percpu_counter does a function call which is large
and clobbers memory and registers, several branches, several loads and
stores, etc.

When it is a simple dumb statistics counter but with a critical
fastpath, this_cpu_inc just seems to be so much better.


> > > +int get_nr_inodes_unused(void)
> > > +{
> > > +	return inodes_stat.nr_unused;
> > > +}
> > >
> > > ...
> > >
> > > @@ -407,6 +407,8 @@ extern struct files_stat_struct files_stat;
> > >  extern int get_max_files(void);
> > >  extern int sysctl_nr_open;
> > >  extern struct inodes_stat_t inodes_stat;
> > > +extern int get_nr_inodes(void);
> > > +extern int get_nr_inodes_unused(void);
> > 
> > These are pretty cruddy names.  Unfotunately we don't really have a vfs
> > or "inode" subsystem name to prefix them with.

Any ideas? inodes_stat_nr_unused()?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 17/17] fs: Clean up inode reference counting
  2010-09-30  2:15   ` Christoph Hellwig
@ 2010-10-16  7:55     ` Nick Piggin
  2010-10-16 16:14       ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 10:15:20PM -0400, Christoph Hellwig wrote:
> Besides moving this much earlier in the series as mentioned before
> the most important thing is giving the helpers a different name as
> the iget name is already used for a different purpose, even if
> got rid of the original iget and only have iget_locked.

It's a bit of a mess. We also have __iget. I don't care much about
the name though so I'll see what you've chosen in future posts.

I'll see how it looks to move earlier in the series, shouldn't
be a big problem.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 01/17] kernel: add bl_list
  2010-09-30  4:52   ` Andrew Morton
@ 2010-10-16  7:55     ` Nick Piggin
  2010-10-16 16:28       ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:08PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:33 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Introduce a type of hlist that can support the use of the lowest bit
> > in the hlist_head. This will be subsequently used to implement
> > per-bucket bit spinlock for inode hashes.
> > 
> >
> > ...
> >
> > +static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
> > +{
> > +	h->next = NULL;
> > +	h->pprev = NULL;
> > +}
> 
> No need to shout.

Just following the rest of the lists.


> > ...
> >
> > +static inline void hlist_bl_del(struct hlist_bl_node *n)
> > +{
> > +	__hlist_bl_del(n);
> > +	n->next = LIST_POISON1;
> > +	n->pprev = LIST_POISON2;
> > +}
> 
> I'd suggest creating new poison values for hlist_bl's, leave
> LIST_POISON1 and LIST_POISON2 for list_head (and any other list
> variants which went and used them :()

I guess they're used for lists, hlists, nulls lists. Would it really
help much seeing as you have so many lists anyway? I guess we could
do an incremental patch but I'll postpone it for now.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 06/17] fs: icache lock lru/writeback lists
  2010-09-30  4:52   ` Andrew Morton
  2010-09-30  6:16     ` Dave Chinner
@ 2010-10-16  7:55     ` Nick Piggin
  1 sibling, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:40PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:38 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > The inode moves between different lists protected by the inode_lock. Introduce
> > a new lock that protects all of the lists (dirty, unused, in use, etc) that the
> > inode will move around as it changes state. As this is mostly a list for
> > protecting the writeback lists, name it wb_inode_list_lock and nest all the
> > list manipulations in this lock inside the current inode_lock scope.
> 
> All those spin_trylock()s are real ugly.  They're unexplained in the
> changelog and unexplained in code comments.
> 
> I'd suggest that each such site have a comment explaining why we're
> resorting to this.

They're really a side effect of how I'm building up the locking in steps
and then streamlining it in steps. Most of them disappear or get much
improved as inode removal, rcu, etc greatly help with lock ordering.

The intermediate steps are not supposed to be so pretty, so much as an
easily verifiable "ok, we have enough locking to cover what inode_lock
used to be protecting".

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 07/17] fs: icache atomic inodes_stat
  2010-09-30  4:52   ` Andrew Morton
  2010-09-30  6:20     ` Dave Chinner
@ 2010-10-16  7:56     ` Nick Piggin
  1 sibling, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Sep 29, 2010 at 09:52:53PM -0700, Andrew Morton wrote:
> On Wed, 29 Sep 2010 22:18:39 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > The inode use statistics are currently protected by the inode_lock.
> > Before we can remove the inode_lock, we need to protect these
> > counters against races. Do this by converting them to atomic
> > counters so they ar enot dependent on any lock at all.
> 
> typo

It's Dave, I swear :)


> > +struct inodes_stat_t {
> > +	atomic_t nr_inodes;
> > +	atomic_t nr_unused;
> > +	int dummy[5];		/* padding for sysctl ABI compatibility */
> > +};
> 
> OK, that's a hack.  The first two "ints" are copied out to userspace. 
> This change assumes that sizeof(atomic_t)=4 and that an atomic_t has
> the same layout, alignment and padding as an int.
> 
> Probably that's true in current kernels and with current architectures
> but it's a hack and it's presumptive.
> 
> It shouldn't be snuck into the tree unchangelogged and uncommented.
> 
> (time passes)
> 
> OK, I see that all of this gets reverted later on.  Please update the
> changelog so the next reviewer doesn't get fooled.

Yeah it is. I might end up folding the per-cpu stuff back over it
and avoid the issue completely. Otherwise I'll add a comment.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 03/17] fs: icache lock inode hash
  2010-10-01  6:06   ` Christoph Hellwig
@ 2010-10-16  7:57     ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16  7:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Oct 01, 2010 at 02:06:50AM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:18:35PM +1000, Dave Chinner wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Currently the inode hash lists are protected by the inode_lock. To
> > allow removal of the inode_lock, we need to protect the inode hash
> > table lists with a new lock. Nest the new inode_hash_lock inside the
> > inode_lock to protect the hash lists.
> 
> There is no reason to make inode_hash_lock global, it's only used inside
> inode.c

It was like that while hugetlbfs still used it. Fixed, thanks.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-10-16  7:55       ` Nick Piggin
@ 2010-10-16  8:29         ` Eric Dumazet
  2010-10-16  9:07           ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2010-10-16  8:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Andrew Morton, linux-fsdevel, linux-kernel

Le samedi 16 octobre 2010 à 18:55 +1100, Nick Piggin a écrit :
> On Thu, Sep 30, 2010 at 04:10:39PM +1000, Dave Chinner wrote:
> > On Wed, Sep 29, 2010 at 09:53:22PM -0700, Andrew Morton wrote:
> > > On Wed, 29 Sep 2010 22:18:48 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > From: Eric Dumazet <dada1@cosmosbay.com>
> > > > 
> > > > The number of inodes allocated does not need to be tied to the
> > > > addition or removal of an inode to/from a list. If we are not tied
> > > > to a list lock, we could update the counters when inodes are
> > > > initialised or destroyed, but to do that we need to convert the
> > > > counters to be per-cpu (i.e. independent of a lock). This means that
> > > > we have the freedom to change the list/locking implementation
> > > > without needing to care about the counters.
> > > > 
> > > >
> > > > ...
> > > >
> > > > +int get_nr_inodes(void)
> > > > +{
> > > > +	int i;
> > > > +	int sum = 0;
> > > > +	for_each_possible_cpu(i)
> > > > +		sum += per_cpu(nr_inodes, i);
> > > > +	return sum < 0 ? 0 : sum;
> > > > +}
> > > 
> > > This reimplements percpu_counter_sum_positive(), rather poorly
> 
> Why is it poorly?

Nick

Some people believe percpu_counter object is the right answer to such
distributed counters, because the loop is done on 'online' cpus instead
of 'possible' cpus. "It must be better if number of possible cpus is
4096 and only one or two cpus are online"...

But if we do this loop only on rare events, like
"cat /proc/sys/fs/inode-nr", then the percpu_counter() is more
expensive, because percpu_add() _is_ more expensive :

- Its a function call and lot of instructions/cycles per call, while
this_cpu_inc(nr_inodes) is a single instruction, using no register on
x86.

- Its possibly accessing a shared spinlock and counter when the percpu
counter reaches the batch limit.


To recap : nr_inodes is not a counter that needs to be estimated in real
time, since we have not limit on number of inodes in the machine (limit
is the memory allocator).

Unless someone can prove "cat /proc/sys/fs/inode-nr" must be performed
thousand of times per second on their setup, the choice I made to scale
nr_inodes is better over the 'obvious percpu_counter choice'

This choice was made to scale some counters in network stack some years
ago, and this rocks.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-10-16  8:29         ` Eric Dumazet
@ 2010-10-16  9:07           ` Andrew Morton
  2010-10-16  9:31             ` Eric Dumazet
  0 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-16  9:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, 16 Oct 2010 10:29:08 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le samedi 16 octobre 2010 __ 18:55 +1100, Nick Piggin a __crit :
> > On Thu, Sep 30, 2010 at 04:10:39PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 29, 2010 at 09:53:22PM -0700, Andrew Morton wrote:
> > > > On Wed, 29 Sep 2010 22:18:48 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > > 
> > > > > From: Eric Dumazet <dada1@cosmosbay.com>
> > > > > 
> > > > > The number of inodes allocated does not need to be tied to the
> > > > > addition or removal of an inode to/from a list. If we are not tied
> > > > > to a list lock, we could update the counters when inodes are
> > > > > initialised or destroyed, but to do that we need to convert the
> > > > > counters to be per-cpu (i.e. independent of a lock). This means that
> > > > > we have the freedom to change the list/locking implementation
> > > > > without needing to care about the counters.
> > > > > 
> > > > >
> > > > > ...
> > > > >
> > > > > +int get_nr_inodes(void)
> > > > > +{
> > > > > +	int i;
> > > > > +	int sum = 0;
> > > > > +	for_each_possible_cpu(i)
> > > > > +		sum += per_cpu(nr_inodes, i);
> > > > > +	return sum < 0 ? 0 : sum;
> > > > > +}
> > > > 
> > > > This reimplements percpu_counter_sum_positive(), rather poorly
> > 
> > Why is it poorly?
> 
> Nick
> 
> Some people believe percpu_counter object is the right answer to such
> distributed counters, because the loop is done on 'online' cpus instead
> of 'possible' cpus. "It must be better if number of possible cpus is
> 4096 and only one or two cpus are online"...
> 
> But if we do this loop only on rare events, like
> "cat /proc/sys/fs/inode-nr", then the percpu_counter() is more
> expensive, because percpu_add() _is_ more expensive :
> 
> - Its a function call and lot of instructions/cycles per call, while
> this_cpu_inc(nr_inodes) is a single instruction, using no register on
> x86.

You want an inlined percpu_counter_inc() then write one!  Bonus points
for writing this_cpu_add_return() and doing it without a
preempt_disable().  It collapses to just a few instructions.

It's extremely poor form to say "oh X sucks so I'm going to implement
my own" without first addressing why X allegedly sucks.

> - Its possibly accessing a shared spinlock and counter when the percpu
> counter reaches the batch limit.

That's in the noise foor.

> To recap : nr_inodes is not a counter that needs to be estimated in real
> time, since we have not limit on number of inodes in the machine (limit
> is the memory allocator).
> 
> Unless someone can prove "cat /proc/sys/fs/inode-nr" must be performed
> thousand of times per second on their setup, the choice I made to scale
> nr_inodes is better over the 'obvious percpu_counter choice'
> 
> This choice was made to scale some counters in network stack some years
> ago, and this rocks.

And we get open-coded reimplementations of the same damn thing all over
the tree.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-10-16  9:07           ` Andrew Morton
@ 2010-10-16  9:31             ` Eric Dumazet
  2010-10-16 14:19               ` [PATCH] percpu_counter : add percpu_counter_add_fast() Eric Dumazet
  2010-10-21 22:31               ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Andrew Morton
  0 siblings, 2 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-10-16  9:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

Le samedi 16 octobre 2010 à 02:07 -0700, Andrew Morton a écrit :
> On Sat, 16 Oct 2010 10:29:08 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Some people believe percpu_counter object is the right answer to such
> > distributed counters, because the loop is done on 'online' cpus instead
> > of 'possible' cpus. "It must be better if number of possible cpus is
> > 4096 and only one or two cpus are online"...
> > 
> > But if we do this loop only on rare events, like
> > "cat /proc/sys/fs/inode-nr", then the percpu_counter() is more
> > expensive, because percpu_add() _is_ more expensive :
> > 
> > - Its a function call and lot of instructions/cycles per call, while
> > this_cpu_inc(nr_inodes) is a single instruction, using no register on
> > x86.
> 
> You want an inlined percpu_counter_inc() then write one!  Bonus points
> for writing this_cpu_add_return() and doing it without a
> preempt_disable().  It collapses to just a few instructions.
> 

A few instructions, but no guarantee of false sharing eviction.

Each time one cpu dirties the percpu_counter object, it slow down other
cpus because they need to fetch the cache line again.

Btw, I believe my previous patch against include/linux/percpu_counter.h
was lost. Are you sure I am the right guy to work on percpu_counter
infra ? If yes I can implement your inlined idea.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-16  9:31             ` Eric Dumazet
@ 2010-10-16 14:19               ` Eric Dumazet
  2010-10-18 15:24                 ` Christoph Lameter
                                   ` (2 more replies)
  2010-10-21 22:31               ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Andrew Morton
  1 sibling, 3 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-10-16 14:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel,
	Christoph Lameter

Andrew

I based following patch against linux-2.6, I dont know if previous
Christoph patch is in a git tree. I'll respin it eventually.

Thanks

[PATCH] percpu_counter : percpu_counter_add_fast()

The current way to change a percpu_counter is to call
percpu_counter_add(), which is a bit expensive.
(More than 40 instructions, possible false sharing, ...)

When we dont need to maintain the approximate value of the
percpu_counter (aka fbc->count), and dont need a "s64" wide counter but
a regular "int" or "long" one, we can use this new function : 
percpu_counter_add_fast()

This function is pretty fast : 
- One instruction on x86 SMP, no register pressure.
- Is safe in preempt enabled contexts.
- No lock acquisition, less false sharing.

Users of this percpu_counter variant should not use
percpu_counter_read() or percpu_counter_read_positive() anymore, only
percpu_counter_sum{_positive}() variant.

Note: we could add later irqsafe variant, still one instruction on x86
SMP...

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Dave Chinner <david@fromorbit.com>
---
 include/linux/percpu_counter.h |   36 +++++++++++++++++++++++++++----
 lib/percpu_counter.c           |   12 +++++-----
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 8a7d510..b9f4cc1 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -3,7 +3,9 @@
 /*
  * A simple "approximate counter" for use in ext2 and ext3 superblocks.
  *
- * WARNING: these things are HUGE.  4 kbytes per counter on 32-way P4.
+ * WARNING: these things are big.  sizeof(long) bytes per possible cpu per counter.
+ * For a 64 cpus 64bit machine :
+ *	64*8 (512) bytes + sizeof(struct percpu_counter)
  */
 
 #include <linux/spinlock.h>
@@ -21,7 +23,7 @@ struct percpu_counter {
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
-	s32 __percpu *counters;
+	long __percpu *counters;
 };
 
 extern int percpu_counter_batch;
@@ -38,7 +40,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
-void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, long batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs);
 
@@ -47,6 +49,24 @@ static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 	__percpu_counter_add(fbc, amount, percpu_counter_batch);
 }
 
+/**
+ * percpu_counter_add_fast - fast variant of percpu_counter_add
+ * @fbc: pointer to percpu_counter
+ * @amount: value to add to counter
+ *
+ * Add amount to a percpu_counter object, without approximate (fbc->count)
+ * estimation / correction.
+ * Notes :
+ * - This fast version is limited to "long" counters, not "s64".
+ * - It is preempt safe, but not IRQ safe (on UP)
+ * - Use of percpu_counter_read{_positive}() is discouraged.
+ * - fbc->count accumulates the counters from offlined cpus.
+ */
+static inline void percpu_counter_add_fast(struct percpu_counter *fbc, long amount)
+{
+	this_cpu_add(*fbc->counters, amount);
+}
+
 static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
 {
 	s64 ret = __percpu_counter_sum(fbc);
@@ -118,7 +138,15 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 }
 
 static inline void
-__percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+percpu_counter_add_fast(struct percpu_counter *fbc, long amount)
+{
+	preempt_disable();
+	fbc->count += amount;
+	preempt_enable();
+}
+
+static inline void
+__percpu_counter_add(struct percpu_counter *fbc, s64 amount, long batch)
 {
 	percpu_counter_add(fbc, amount);
 }
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index ec9048e..93d50a5 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -18,7 +18,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 
 	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		long *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
 	fbc->count = amount;
@@ -26,10 +26,10 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
-void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, long batch)
 {
 	s64 count;
-	s32 *pcount;
+	long *pcount;
 	int cpu = get_cpu();
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
@@ -58,7 +58,7 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 	spin_lock(&fbc->lock);
 	ret = fbc->count;
 	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		long *pcount = per_cpu_ptr(fbc->counters, cpu);
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
@@ -72,7 +72,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 	spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
 	fbc->count = amount;
-	fbc->counters = alloc_percpu(s32);
+	fbc->counters = alloc_percpu(long);
 	if (!fbc->counters)
 		return -ENOMEM;
 #ifdef CONFIG_HOTPLUG_CPU
@@ -123,7 +123,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 	cpu = (unsigned long)hcpu;
 	mutex_lock(&percpu_counters_lock);
 	list_for_each_entry(fbc, &percpu_counters, list) {
-		s32 *pcount;
+		long *pcount;
 		unsigned long flags;
 
 		spin_lock_irqsave(&fbc->lock, flags);

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH 02/17] fs: icache lock s_inodes list
  2010-10-16  7:54     ` Nick Piggin
@ 2010-10-16 16:12       ` Christoph Hellwig
  2010-10-16 17:09         ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:54:11PM +1100, Nick Piggin wrote:
> Because in the first part of the inode lock series, it is breaking
> locks in obvious small steps as possible, by adding global locks
> protecting bits of what inode_lock used to.

As seen by Dave's respin making it per-sb was just as easy as making
it global. And it really is the logical synchronization domain.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 17/17] fs: Clean up inode reference counting
  2010-10-16  7:55     ` Nick Piggin
@ 2010-10-16 16:14       ` Christoph Hellwig
  2010-10-16 17:09         ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:55:15PM +1100, Nick Piggin wrote:
> On Wed, Sep 29, 2010 at 10:15:20PM -0400, Christoph Hellwig wrote:
> > Besides moving this much earlier in the series as mentioned before
> > the most important thing is giving the helpers a different name as
> > the iget name is already used for a different purpose, even if
> > got rid of the original iget and only have iget_locked.
> 
> It's a bit of a mess. We also have __iget. I don't care much about
> the name though so I'll see what you've chosen in future posts.
> 
> I'll see how it looks to move earlier in the series, shouldn't
> be a big problem.

Dave has sorted out the naming mess quite nicely in his repost.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 01/17] kernel: add bl_list
  2010-10-16  7:55     ` Nick Piggin
@ 2010-10-16 16:28       ` Christoph Hellwig
  0 siblings, 0 replies; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-16 16:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 06:55:21PM +1100, Nick Piggin wrote:
> > > +static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
> > > +{
> > > +	h->next = NULL;
> > > +	h->pprev = NULL;
> > > +}
> > 
> > No need to shout.
> 
> Just following the rest of the lists.

Yes, just following hlist is logical.  Anyway, I don't think this
or the poisoning value really matter very much.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 02/17] fs: icache lock s_inodes list
  2010-10-16 16:12       ` Christoph Hellwig
@ 2010-10-16 17:09         ` Nick Piggin
  2010-10-17  0:42           ` Christoph Hellwig
  0 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-16 17:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:12:10PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 06:54:11PM +1100, Nick Piggin wrote:
> > Because in the first part of the inode lock series, it is breaking
> > locks in obvious small steps as possible, by adding global locks
> > protecting bits of what inode_lock used to.
> 
> As seen by Dave's respin making it per-sb was just as easy as making
> it global. And it really is the logical synchronization domain.

If you want it to be scalable within a single sb, it needs to be
per cpu. If it is per-cpu it does not need to be per-sb as well
which just adds bloat.

And the entire idea of the first half of the inode series is that
it starts simple and just uses globals to demonstrate the locking
steps. It's obviously not supposed to be a "production" locking
model so I prefer it to be like that.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 17/17] fs: Clean up inode reference counting
  2010-10-16 16:14       ` Christoph Hellwig
@ 2010-10-16 17:09         ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-16 17:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 12:14:11PM -0400, Christoph Hellwig wrote:
> On Sat, Oct 16, 2010 at 06:55:15PM +1100, Nick Piggin wrote:
> > On Wed, Sep 29, 2010 at 10:15:20PM -0400, Christoph Hellwig wrote:
> > > Besides moving this much earlier in the series as mentioned before
> > > the most important thing is giving the helpers a different name as
> > > the iget name is already used for a different purpose, even if
> > > got rid of the original iget and only have iget_locked.
> > 
> > It's a bit of a mess. We also have __iget. I don't care much about
> > the name though so I'll see what you've chosen in future posts.
> > 
> > I'll see how it looks to move earlier in the series, shouldn't
> > be a big problem.
> 
> Dave has sorted out the naming mess quite nicely in his repost.

Pity we can't use get/put but as I said, no big deal. I'll pick
up those names.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 02/17] fs: icache lock s_inodes list
  2010-10-16 17:09         ` Nick Piggin
@ 2010-10-17  0:42           ` Christoph Hellwig
  2010-10-17  2:03             ` Nick Piggin
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2010-10-17  0:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Oct 17, 2010 at 04:09:11AM +1100, Nick Piggin wrote:
> If you want it to be scalable within a single sb, it needs to be
> per cpu. If it is per-cpu it does not need to be per-sb as well
> which just adds bloat.

Right now the patches split up the inode lock and do not add
per-cpu magic.  It's not any more work to move from per-sb lists
to per-cpu locking if we eventually do it than moving from global
to per-cpu.

I'm not entirely convinced moving s_inodes to a per-cpu list is a good
idea.  For now per-sb is just fine for disk filesystems as they have
much more fs-wide cachelines they touch for inode creatation/deletion
anyway, and for sockets/pipes a variant of your patch to not ever
add them to s_inodes sounds like the better approach.

If we eventually hit the limit for disk filesystems I have some better
ideas to solve this.  One is to abuse whatever data sturcture we use
for the inode hash also for iterating over all inodes - we only
iterate over them in very few places, and none of them is a fast path.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 02/17] fs: icache lock s_inodes list
  2010-10-17  0:42           ` Christoph Hellwig
@ 2010-10-17  2:03             ` Nick Piggin
  0 siblings, 0 replies; 111+ messages in thread
From: Nick Piggin @ 2010-10-17  2:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Oct 16, 2010 at 08:42:57PM -0400, Christoph Hellwig wrote:
> On Sun, Oct 17, 2010 at 04:09:11AM +1100, Nick Piggin wrote:
> > If you want it to be scalable within a single sb, it needs to be
> > per cpu. If it is per-cpu it does not need to be per-sb as well
> > which just adds bloat.
> 
> Right now the patches split up the inode lock and do not add
> per-cpu magic.  It's not any more work to move from per-sb lists
> to per-cpu locking if we eventually do it than moving from global
> to per-cpu.

But it's more work to do per-sb lists than a global list, and as
I'm going to a per-cpu locking anyway it's a strange transition
to go from per-sb to per-cpu (rather than per-sb, per-cpu). In short,
the fact that I build up the locking transformations starting with
global locks is just not something that can be held against my
patch set (unless you really disagree with the whole concept of
how the series is structured).

>
> I'm not entirely convinced moving s_inodes to a per-cpu list is a good
> idea.  For now per-sb is just fine for disk filesystems as they have
> much more fs-wide cachelines they touch for inode creatation/deletion
> anyway, and for sockets/pipes a variant of your patch to not ever
> add them to s_inodes sounds like the better approach.

Traditional filesystems on slow spinning disk are not the main
problem. It's very fast ssds and storage servers. XFS actually with
its per-AG lock splitting can already have problems on small servers
with not-incredibly-fast storage with per-sb scalability bottlenecks.

And if the VFS is not scalable, then the contention doesn't even
get pushed into the filesystem so the fs developers never even _see_
the locking problems to fix them.

I'm telling you it will be increasingly a problem because cores and
storage speeds continue to increase, and also people want to manage
more storage with fewer filesystems. It's obvious that it will be a
problem.

I've already got per cpu locking in vfsmounts and files lock, so it's
not magic.

> If we eventually hit the limit for disk filesystems I have some better
> ideas to solve this.  One is to abuse whatever data sturcture we use
> for the inode hash also for iterating over all inodes - we only
> iterate over them in very few places, and none of them is a fast path.

Doing your handwaving about changing data types and better ideas
is just not helpful. _If_ you do have some better ideas, and _if_ we
change the data structure, _then_ it's trivial to change from percpu
locking to your better idea. It just doesn't work as an argument to
slow progress.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-16 14:19               ` [PATCH] percpu_counter : add percpu_counter_add_fast() Eric Dumazet
@ 2010-10-18 15:24                 ` Christoph Lameter
  2010-10-18 15:39                   ` Eric Dumazet
  2010-10-21 22:37                 ` Andrew Morton
  2010-10-21 22:43                 ` Andrew Morton
  2 siblings, 1 reply; 111+ messages in thread
From: Christoph Lameter @ 2010-10-18 15:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

On Sat, 16 Oct 2010, Eric Dumazet wrote:

> I based following patch against linux-2.6, I dont know if previous
> Christoph patch is in a git tree. I'll respin it eventually.

The prior patch was accepted by Andrew.

> + * - It is preempt safe, but not IRQ safe (on UP)

The IRQ safeness depends on the arch. this_cpu_add() in general only
guarantees safety against preemption. It so happens that the x86
implementation is irq safe as well.

The IRQ safety for UP is therefore not an issue if you use this_cpu_add().

If you want to guarantee irqsafeness then use irqsafe_cpu_add() instead.
It generates the same code on x86 for SMP but takes care of the UP issues.

> +static inline void percpu_counter_add_fast(struct percpu_counter *fbc, long amount)
> +{
> +	this_cpu_add(*fbc->counters, amount);
> +}

What happens in case of counter overflow?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-18 15:24                 ` Christoph Lameter
@ 2010-10-18 15:39                   ` Eric Dumazet
  2010-10-18 16:12                     ` Christoph Lameter
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2010-10-18 15:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

Le lundi 18 octobre 2010 à 10:24 -0500, Christoph Lameter a écrit :
> On Sat, 16 Oct 2010, Eric Dumazet wrote:
> 
> > I based following patch against linux-2.6, I dont know if previous
> > Christoph patch is in a git tree. I'll respin it eventually.
> 
> The prior patch was accepted by Andrew.
> 
> > + * - It is preempt safe, but not IRQ safe (on UP)
> 
> The IRQ safeness depends on the arch. this_cpu_add() in general only
> guarantees safety against preemption. It so happens that the x86
> implementation is irq safe as well.
> 
> The IRQ safety for UP is therefore not an issue if you use this_cpu_add().
> 

Nope, on UP, we dont use a per_cpu field, just a "s64 count".

struct percpu_counter {
        s64 count;
};



> If you want to guarantee irqsafeness then use irqsafe_cpu_add() instead.
> It generates the same code on x86 for SMP but takes care of the UP issues.
> 
> > +static inline void percpu_counter_add_fast(struct percpu_counter *fbc, long amount)
> > +{
> > +	this_cpu_add(*fbc->counters, amount);
> > +}
> 
> What happens in case of counter overflow?
> 

Nothing special, as I stated the usuable width of such counters would be
restricted to a long, not an s64.

It should be enough to count "number of inodes, of sockets, ..."



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-18 15:39                   ` Eric Dumazet
@ 2010-10-18 16:12                     ` Christoph Lameter
  0 siblings, 0 replies; 111+ messages in thread
From: Christoph Lameter @ 2010-10-18 16:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

On Mon, 18 Oct 2010, Eric Dumazet wrote:

> Nope, on UP, we dont use a per_cpu field, just a "s64 count".
>
> struct percpu_counter {
>         s64 count;
> };

Ok that is percpu counter specific then.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-10-16  9:31             ` Eric Dumazet
  2010-10-16 14:19               ` [PATCH] percpu_counter : add percpu_counter_add_fast() Eric Dumazet
@ 2010-10-21 22:31               ` Andrew Morton
  2010-10-21 22:58                 ` Eric Dumazet
  1 sibling, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-21 22:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

(working in time-reversed oreder again :()

On Sat, 16 Oct 2010 11:31:15 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Btw, I believe my previous patch against include/linux/percpu_counter.h
> was lost. Are you sure I am the right guy to work on percpu_counter
> infra ? If yes I can implement your inlined idea.

I don't know what "previous patch" you're referring to.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-16 14:19               ` [PATCH] percpu_counter : add percpu_counter_add_fast() Eric Dumazet
  2010-10-18 15:24                 ` Christoph Lameter
@ 2010-10-21 22:37                 ` Andrew Morton
  2010-10-21 23:10                   ` Christoph Lameter
  2010-10-21 22:43                 ` Andrew Morton
  2 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-21 22:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel,
	Christoph Lameter

On Sat, 16 Oct 2010 16:19:14 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> The current way to change a percpu_counter is to call
> percpu_counter_add(), which is a bit expensive.
> (More than 40 instructions, possible false sharing, ...)
> 
> When we dont need to maintain the approximate value of the
> percpu_counter (aka fbc->count), and dont need a "s64" wide counter but
> a regular "int" or "long" one, we can use this new function : 
> percpu_counter_add_fast()
> 
> This function is pretty fast : 
> - One instruction on x86 SMP, no register pressure.
> - Is safe in preempt enabled contexts.
> - No lock acquisition, less false sharing.
> 
> Users of this percpu_counter variant should not use
> percpu_counter_read() or percpu_counter_read_positive() anymore, only
> percpu_counter_sum{_positive}() variant.

That isn't actually what I was suggesting.  I was suggesting the use of
an inlined, this_cpu_add()-using percpu_counter_add() variant which
still does the batched spilling into ->count.  IOW, just speed up the
current implementation along the lines of

{
	val = this_cpu_add_return(*fbc->counters, amount);
	if (unlikely(abs(val) > fbc->batch))
		out_of_line_stuff();
}	

I suppose what you're proposing here is useful, although the name isn't
a good one.  It's a different way of using the existing data structure.
I'd suggest that a better name is something like percpu_counter_add_local()?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-16 14:19               ` [PATCH] percpu_counter : add percpu_counter_add_fast() Eric Dumazet
  2010-10-18 15:24                 ` Christoph Lameter
  2010-10-21 22:37                 ` Andrew Morton
@ 2010-10-21 22:43                 ` Andrew Morton
  2010-10-21 22:58                   ` Eric Dumazet
  2 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-21 22:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel,
	Christoph Lameter

On Sat, 16 Oct 2010 16:19:14 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> The current way to change a percpu_counter is to call
> percpu_counter_add(), which is a bit expensive.
> (More than 40 instructions, possible false sharing, ...)

This is incorrect.  With my compiler it's 25 instructions except in the
very rare case where a batch overflow occurs.

And more than half of that is function call entry/exit overhead.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-21 22:43                 ` Andrew Morton
@ 2010-10-21 22:58                   ` Eric Dumazet
  2010-10-21 23:18                     ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2010-10-21 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel,
	Christoph Lameter

Le jeudi 21 octobre 2010 à 15:43 -0700, Andrew Morton a écrit :
> On Sat, 16 Oct 2010 16:19:14 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > The current way to change a percpu_counter is to call
> > percpu_counter_add(), which is a bit expensive.
> > (More than 40 instructions, possible false sharing, ...)
> 
> This is incorrect.  With my compiler it's 25 instructions except in the
> very rare case where a batch overflow occurs.
> 

Hmm

> And more than half of that is function call entry/exit overhead.
> 

gcc version 4.5.1

count : 5 instructions to call function

c10cfbb5:	a1 00 8f 53 c1       	mov    0xc1538f00,%eax
c10cfbba:	31 c9                	xor    %ecx,%ecx
c10cfbbc:	89 04 24             	mov    %eax,(%esp)
c10cfbbf:	ba 01 00 00 00       	mov    $0x1,%edx
c10cfbc4:	b8 c0 5b 50 c1       	mov    $0xc1505bc0,%eax
c10cfbc9:	e8 a2 64 0b 00       	call   c1186070 <__percpu_counter_add>

Then 39 instructions in hot path (no lock taken)

So its more than 40 as I stated 

c1186070 <__percpu_counter_add>:
c1186070:	55                   	push   %ebp
c1186071:	89 e5                	mov    %esp,%ebp
c1186073:	83 ec 1c             	sub    $0x1c,%esp
c1186076:	89 5d f4             	mov    %ebx,-0xc(%ebp)
c1186079:	89 75 f8             	mov    %esi,-0x8(%ebp)
c118607c:	89 7d fc             	mov    %edi,-0x4(%ebp)
c118607f:	89 c3                	mov    %eax,%ebx
c1186081:	8b 73 20             	mov    0x20(%ebx),%esi
c1186084:	64 a1 6c 10 59 c1    	mov    %fs:0xc159106c,%eax
c118608a:	8b 3c 85 a0 7d 53 c1 	mov    -0x3eac8260(,%eax,4),%edi
c1186091:	01 fe                	add    %edi,%esi
c1186093:	89 75 e8             	mov    %esi,-0x18(%ebp)
c1186096:	8b 06                	mov    (%esi),%eax
c1186098:	8b 75 08             	mov    0x8(%ebp),%esi
c118609b:	89 c7                	mov    %eax,%edi
c118609d:	89 45 ec             	mov    %eax,-0x14(%ebp)
c11860a0:	c1 ff 1f             	sar    $0x1f,%edi
c11860a3:	01 55 ec             	add    %edx,-0x14(%ebp)
c11860a6:	89 7d f0             	mov    %edi,-0x10(%ebp)
c11860a9:	89 f7                	mov    %esi,%edi
c11860ab:	11 4d f0             	adc    %ecx,-0x10(%ebp)
c11860ae:	c1 ff 1f             	sar    $0x1f,%edi
c11860b1:	39 7d f0             	cmp    %edi,-0x10(%ebp)
c11860b4:	7f 3a                	jg     c11860f0 <__percpu_counter_add+0x80>
c11860b6:	7d 68                	jge    c1186120 <__percpu_counter_add+0xb0>
c11860b8:	8b 4d 08             	mov    0x8(%ebp),%ecx
c11860bb:	f7 d9                	neg    %ecx
c11860bd:	89 ca                	mov    %ecx,%edx
c11860bf:	c1 fa 1f             	sar    $0x1f,%edx
c11860c2:	39 55 f0             	cmp    %edx,-0x10(%ebp)
c11860c5:	7e 19                	jle    c11860e0 <__percpu_counter_add+0x70>
c11860c7:	8b 7d ec             	mov    -0x14(%ebp),%edi
c11860ca:	8b 75 e8             	mov    -0x18(%ebp),%esi
c11860cd:	89 3e                	mov    %edi,(%esi)
c11860cf:	8b 5d f4             	mov    -0xc(%ebp),%ebx
c11860d2:	8b 75 f8             	mov    -0x8(%ebp),%esi
c11860d5:	8b 7d fc             	mov    -0x4(%ebp),%edi
c11860d8:	c9                   	leave  
c11860d9:	c3                   	ret    
c11860da:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
c11860e0:	7c 0e                	jl     c11860f0 <__percpu_counter_add+0x80>
c11860e2:	39 4d ec             	cmp    %ecx,-0x14(%ebp)
c11860e5:	77 e0                	ja     c11860c7 <__percpu_counter_add+0x57>
c11860e7:	89 f6                	mov    %esi,%esi
c11860e9:	8d bc 27 00 00 00 00 	lea    0x0(%edi,%eiz,1),%edi
c11860f0:	89 d8                	mov    %ebx,%eax
c11860f2:	e8 e9 41 1d 00       	call   c135a2e0 <_raw_spin_lock>
c11860f7:	8b 45 ec             	mov    -0x14(%ebp),%eax
c11860fa:	8b 55 f0             	mov    -0x10(%ebp),%edx
c11860fd:	01 43 10             	add    %eax,0x10(%ebx)
c1186100:	89 d8                	mov    %ebx,%eax
c1186102:	11 53 14             	adc    %edx,0x14(%ebx)
c1186105:	8b 55 e8             	mov    -0x18(%ebp),%edx
c1186108:	c7 02 00 00 00 00    	movl   $0x0,(%edx)
c118610e:	e8 6d 41 1d 00       	call   c135a280 <_raw_spin_unlock>
c1186113:	8b 5d f4             	mov    -0xc(%ebp),%ebx
c1186116:	8b 75 f8             	mov    -0x8(%ebp),%esi
c1186119:	8b 7d fc             	mov    -0x4(%ebp),%edi
c118611c:	c9                   	leave  
c118611d:	c3                   	ret    
c118611e:	66 90                	xchg   %ax,%ax
c1186120:	8b 7d 08             	mov    0x8(%ebp),%edi
c1186123:	39 7d ec             	cmp    %edi,-0x14(%ebp)
c1186126:	73 c8                	jae    c11860f0 <__percpu_counter_add+0x80>
c1186128:	8b 4d 08             	mov    0x8(%ebp),%ecx
c118612b:	f7 d9                	neg    %ecx
c118612d:	89 ca                	mov    %ecx,%edx
c118612f:	c1 fa 1f             	sar    $0x1f,%edx
c1186132:	39 55 f0             	cmp    %edx,-0x10(%ebp)
c1186135:	7f 90                	jg     c11860c7 <__percpu_counter_add+0x57>
c1186137:	eb a7                	jmp    c11860e0 <__percpu_counter_add+0x70>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter
  2010-10-21 22:31               ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Andrew Morton
@ 2010-10-21 22:58                 ` Eric Dumazet
  0 siblings, 0 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-10-21 22:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel

Le jeudi 21 octobre 2010 à 15:31 -0700, Andrew Morton a écrit :
> (working in time-reversed oreder again :()

> I don't know what "previous patch" you're referring to.

Nothing exciting, it was to update the comment that was slightly wrong
about size of percpu_counter...

(percpu_counter: change inaccurate comment)

https://patchwork.kernel.org/patch/238441/

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-21 22:37                 ` Andrew Morton
@ 2010-10-21 23:10                   ` Christoph Lameter
  2010-10-22  0:45                     ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Lameter @ 2010-10-21 23:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

On Thu, 21 Oct 2010, Andrew Morton wrote:

> That isn't actually what I was suggesting.  I was suggesting the use of
> an inlined, this_cpu_add()-using percpu_counter_add() variant which
> still does the batched spilling into ->count.  IOW, just speed up the
> current implementation along the lines of
>
> {
> 	val = this_cpu_add_return(*fbc->counters, amount);

this_cpu_add_return() is not in the kernel but could be realized using a
variant offshoot of cmpxchg_local. I had something like that initially but
omitted it since there was no use case.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-21 22:58                   ` Eric Dumazet
@ 2010-10-21 23:18                     ` Andrew Morton
  2010-10-21 23:22                       ` Eric Dumazet
  0 siblings, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-21 23:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel,
	Christoph Lameter

On Fri, 22 Oct 2010 00:58:14 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Then 39 instructions in hot path (no lock taken)

Wow, your compiler sucks.

.globl __percpu_counter_add
	.type	__percpu_counter_add, @function
__percpu_counter_add:
	pushq	%rbp	#
	movq	%rsp, %rbp	#,
	pushq	%r13	#
	movq	%rdi, %r13	# fbc, fbc
	pushq	%r12	#
	pushq	%rbx	#
	pushq	%r10	#
	movq	96(%rdi), %rbx	# <variable>.counters, pcount
#APP
	movl %gs:cpu_number,%eax	# cpu_number, pfo_ret__
#NO_APP
	cltq
	addq	__per_cpu_offset(,%rax,8), %rbx	# __per_cpu_offset, pcount
	movslq	(%rbx),%rax	#* pcount, tmp70
	leaq	(%rax,%rsi), %r12	#, count
	movslq	%edx,%rax	# batch, batch
	cmpq	%rax, %r12	# batch, count
	jge	.L43	#,
	negl	%edx	# batch
	movslq	%edx,%rax	# batch, tmp74
	cmpq	%rax, %r12	# tmp74, count
	jg	.L45	#,
.L43:
	movq	%r13, %rdi	# fbc, D.14396
	call	_raw_spin_lock	#
	addq	%r12, 72(%r13)	# count, <variable>.count
	movq	%r13, %rdi	# fbc, D.14396
	movl	$0, (%rbx)	#,* pcount
	call	_raw_spin_unlock	#
	jmp	.L47	#
.L45:
	movl	%r12d, (%rbx)	# count,* pcount
.L47:
	popq	%r9	#
	popq	%rbx	#
	popq	%r12	#
	popq	%r13	#
	leave
	ret


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-21 23:18                     ` Andrew Morton
@ 2010-10-21 23:22                       ` Eric Dumazet
  0 siblings, 0 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-10-21 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Dave Chinner, linux-fsdevel, linux-kernel,
	Christoph Lameter

Le jeudi 21 octobre 2010 à 16:18 -0700, Andrew Morton a écrit :
> On Fri, 22 Oct 2010 00:58:14 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Then 39 instructions in hot path (no lock taken)
> 
> Wow, your compiler sucks.
> 


Actually my machines are mostly 32bit. Is it a problem ?

Compiler is OK, really.

percpu_counter() is misnamed, it should be named percpu_counter64() or
something like that :-(



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-21 23:10                   ` Christoph Lameter
@ 2010-10-22  0:45                     ` Andrew Morton
  2010-10-22  1:55                       ` Andrew Morton
  2010-10-22  4:12                       ` Eric Dumazet
  0 siblings, 2 replies; 111+ messages in thread
From: Andrew Morton @ 2010-10-22  0:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

On Thu, 21 Oct 2010 18:10:13 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:

> On Thu, 21 Oct 2010, Andrew Morton wrote:
> 
> > That isn't actually what I was suggesting.  I was suggesting the use of
> > an inlined, this_cpu_add()-using percpu_counter_add() variant which
> > still does the batched spilling into ->count.  IOW, just speed up the
> > current implementation along the lines of
> >
> > {
> > 	val = this_cpu_add_return(*fbc->counters, amount);
> 
> this_cpu_add_return() is not in the kernel but could be realized using a
> variant offshoot of cmpxchg_local. I had something like that initially but
> omitted it since there was no use case.

this_cpu_add_return() isn't really needed in this application.

{
	this_cpu_add(*fbc->counters, amount);
	if (unlikely(abs(this_cpu_read(*fbc->counters)) > fbc->batch))
		out_of_line_stuff();
}	

will work just fine.				

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-22  0:45                     ` Andrew Morton
@ 2010-10-22  1:55                       ` Andrew Morton
  2010-10-22  1:58                         ` Nick Piggin
  2010-10-22  4:12                       ` Eric Dumazet
  1 sibling, 1 reply; 111+ messages in thread
From: Andrew Morton @ 2010-10-22  1:55 UTC (permalink / raw)
  To: Christoph Lameter, Eric Dumazet, Nick Piggin, Dave Chinner,
	linux-fsdevel, linux-ker

On Thu, 21 Oct 2010 17:45:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> this_cpu_add_return() isn't really needed in this application.
> 
> {
> 	this_cpu_add(*fbc->counters, amount);
> 	if (unlikely(abs(this_cpu_read(*fbc->counters)) > fbc->batch))
> 		out_of_line_stuff();
> }	
> 
> will work just fine.

Did that.  Got alarmed at a few things.

The compiler cannot CSE the above code - it has to reload the percpu
base each time.  Doing it by hand:


{
	long *p;

	p = this_cpu_ptr(fbc->counters);
	*p += amount;
	if (unlikely(abs(*p) > fbc->batch))
		out_of_line_stuff();
}	

generates better code.

So this:

static __always_inline void percpu_counter_add_batch(struct percpu_counter *fbc,
						s64 amount, long batch)
{
	long *pcounter;

	preempt_disable();
	pcounter = this_cpu_ptr(fbc->counters);
	*pcounter += amount;
	if (unlikely(abs(*pcounter) >= batch))
		percpu_counter_handle_overflow(fbc);
	preempt_enable();
}

when compiling this:

--- a/lib/proportions.c~b
+++ a/lib/proportions.c
@@ -263,6 +263,11 @@ void __prop_inc_percpu(struct prop_descr
 	prop_put_global(pd, pg);
 }
 
+void foo(struct prop_local_percpu *pl)
+{
+	percpu_counter_add(&pl->events, 1);
+}
+
 /*
  * identical to __prop_inc_percpu, except that it limits this pl's fraction to
  * @frac/PROP_FRAC_BASE by ignoring events when this limit has been exceeded.

comes down to

.globl foo
	.type	foo, @function
foo:
	pushq	%rbp	#
	movslq	percpu_counter_batch(%rip),%rcx	# percpu_counter_batch, batch
	movq	96(%rdi), %rdx	# <variable>.counters, tcp_ptr__
	movq	%rsp, %rbp	#,
#APP
	add %gs:this_cpu_off, %rdx	# this_cpu_off, tcp_ptr__
#NO_APP
	movq	(%rdx), %rax	#* tcp_ptr__, D.11817
	incq	%rax	# D.11817
	movq	%rax, (%rdx)	# D.11817,* tcp_ptr__
	cqto
	xorq	%rdx, %rax	# tmp67, D.11817
	subq	%rdx, %rax	# tmp67, D.11817
	cmpq	%rcx, %rax	# batch, D.11817
	jl	.L33	#,
	call	percpu_counter_handle_overflow	#
.L33:
	leave
	ret


But what's really alarming is that the compiler (4.0.2) is cheerily
ignoring the inline directives and was generating out-of-line versions
of most of the percpu_counter.h functions into lib/proportions.s. 
That's rather a worry.

lib/proportions.o got rather larger as a result of inlining things and
it's not obvious that it's all a net benefit.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-22  1:55                       ` Andrew Morton
@ 2010-10-22  1:58                         ` Nick Piggin
  2010-10-22  2:14                           ` Andrew Morton
  0 siblings, 1 reply; 111+ messages in thread
From: Nick Piggin @ 2010-10-22  1:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Eric Dumazet, Nick Piggin, Dave Chinner,
	linux-fsdevel, linux-kernel

On Thu, Oct 21, 2010 at 06:55:16PM -0700, Andrew Morton wrote:
> On Thu, 21 Oct 2010 17:45:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > this_cpu_add_return() isn't really needed in this application.
> > 
> > {
> > 	this_cpu_add(*fbc->counters, amount);
> > 	if (unlikely(abs(this_cpu_read(*fbc->counters)) > fbc->batch))
> > 		out_of_line_stuff();
> > }	
> > 
> > will work just fine.
> 
> Did that.  Got alarmed at a few things.
> 
> The compiler cannot CSE the above code - it has to reload the percpu
> base each time.  Doing it by hand:
> 
> 
> {
> 	long *p;
> 
> 	p = this_cpu_ptr(fbc->counters);
> 	*p += amount;
> 	if (unlikely(abs(*p) > fbc->batch))
> 		out_of_line_stuff();
> }	
> 
> generates better code.
> 
> So this:
> 
> static __always_inline void percpu_counter_add_batch(struct percpu_counter *fbc,
> 						s64 amount, long batch)
> {
> 	long *pcounter;
> 
> 	preempt_disable();
> 	pcounter = this_cpu_ptr(fbc->counters);
> 	*pcounter += amount;
> 	if (unlikely(abs(*pcounter) >= batch))
> 		percpu_counter_handle_overflow(fbc);
> 	preempt_enable();
> }
> 
> when compiling this:
> 
> --- a/lib/proportions.c~b
> +++ a/lib/proportions.c
> @@ -263,6 +263,11 @@ void __prop_inc_percpu(struct prop_descr
>  	prop_put_global(pd, pg);
>  }
>  
> +void foo(struct prop_local_percpu *pl)
> +{
> +	percpu_counter_add(&pl->events, 1);
> +}
> +
>  /*
>   * identical to __prop_inc_percpu, except that it limits this pl's fraction to
>   * @frac/PROP_FRAC_BASE by ignoring events when this limit has been exceeded.
> 
> comes down to
> 
> .globl foo
> 	.type	foo, @function
> foo:
> 	pushq	%rbp	#
> 	movslq	percpu_counter_batch(%rip),%rcx	# percpu_counter_batch, batch
> 	movq	96(%rdi), %rdx	# <variable>.counters, tcp_ptr__
> 	movq	%rsp, %rbp	#,
> #APP
> 	add %gs:this_cpu_off, %rdx	# this_cpu_off, tcp_ptr__
> #NO_APP
> 	movq	(%rdx), %rax	#* tcp_ptr__, D.11817
> 	incq	%rax	# D.11817
> 	movq	%rax, (%rdx)	# D.11817,* tcp_ptr__
> 	cqto
> 	xorq	%rdx, %rax	# tmp67, D.11817
> 	subq	%rdx, %rax	# tmp67, D.11817
> 	cmpq	%rcx, %rax	# batch, D.11817
> 	jl	.L33	#,
> 	call	percpu_counter_handle_overflow	#
> .L33:
> 	leave
> 	ret
> 
> 
> But what's really alarming is that the compiler (4.0.2) is cheerily
> ignoring the inline directives and was generating out-of-line versions
> of most of the percpu_counter.h functions into lib/proportions.s. 
> That's rather a worry.

You you have the "ignore inlining" and "compile for size" turned
on? They often suck.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-22  1:58                         ` Nick Piggin
@ 2010-10-22  2:14                           ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2010-10-22  2:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Eric Dumazet, Dave Chinner, linux-fsdevel,
	linux-kernel

On Fri, 22 Oct 2010 12:58:45 +1100 Nick Piggin <npiggin@kernel.dk> wrote:

> > But what's really alarming is that the compiler (4.0.2) is cheerily
> > ignoring the inline directives and was generating out-of-line versions
> > of most of the percpu_counter.h functions into lib/proportions.s. 
> > That's rather a worry.
> 
> You you have the "ignore inlining"

# CONFIG_OPTIMIZE_INLINING is not set

> and "compile for size" turned on?

CONFIG_CC_OPTIMIZE_FOR_SIZE=y

> They often suck.

Everything sucks.  Are they any use?

With

# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_OPTIMIZE_INLINING=y

my kernel/built-in.o text went from 563638 bytes to 659852. 
That's rather a lot.

I haven't looked at this stuff in years.  Has anyone dug into it?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH] percpu_counter : add percpu_counter_add_fast()
  2010-10-22  0:45                     ` Andrew Morton
  2010-10-22  1:55                       ` Andrew Morton
@ 2010-10-22  4:12                       ` Eric Dumazet
  1 sibling, 0 replies; 111+ messages in thread
From: Eric Dumazet @ 2010-10-22  4:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Dave Chinner, linux-fsdevel,
	linux-kernel

Le jeudi 21 octobre 2010 à 17:45 -0700, Andrew Morton a écrit :

> this_cpu_add_return() isn't really needed in this application.
> 
> {
> 	this_cpu_add(*fbc->counters, amount);
> 	if (unlikely(abs(this_cpu_read(*fbc->counters)) > fbc->batch))
> 		out_of_line_stuff();
> }	
> 
> will work just fine.				

Hmm, you cannot do this on 32bit machines because "amount" is 64bit
wide.

Switching counters to s64 is not an option (makes summation racy, and
memory use bigger)



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2010-10-22  4:12 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-29 12:18 [PATCH 0/17] fs: Inode cache scalability Dave Chinner
2010-09-29 12:18 ` [PATCH 01/17] kernel: add bl_list Dave Chinner
2010-09-30  4:52   ` Andrew Morton
2010-10-16  7:55     ` Nick Piggin
2010-10-16 16:28       ` Christoph Hellwig
2010-10-01  5:48   ` Christoph Hellwig
2010-09-29 12:18 ` [PATCH 02/17] fs: icache lock s_inodes list Dave Chinner
2010-10-01  5:49   ` Christoph Hellwig
2010-10-16  7:54     ` Nick Piggin
2010-10-16 16:12       ` Christoph Hellwig
2010-10-16 17:09         ` Nick Piggin
2010-10-17  0:42           ` Christoph Hellwig
2010-10-17  2:03             ` Nick Piggin
2010-09-29 12:18 ` [PATCH 03/17] fs: icache lock inode hash Dave Chinner
2010-09-30  4:52   ` Andrew Morton
2010-09-30  6:13     ` Dave Chinner
2010-10-01  6:06   ` Christoph Hellwig
2010-10-16  7:57     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 04/17] fs: icache lock i_state Dave Chinner
2010-10-01  5:54   ` Christoph Hellwig
2010-10-16  7:54     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 05/17] fs: icache lock i_count Dave Chinner
2010-09-30  4:52   ` Andrew Morton
2010-10-01  5:55     ` Christoph Hellwig
2010-10-01  6:04       ` Andrew Morton
2010-10-01  6:16         ` Christoph Hellwig
2010-10-01  6:23           ` Andrew Morton
2010-09-29 12:18 ` [PATCH 06/17] fs: icache lock lru/writeback lists Dave Chinner
2010-09-30  4:52   ` Andrew Morton
2010-09-30  6:16     ` Dave Chinner
2010-10-16  7:55     ` Nick Piggin
2010-10-01  6:01   ` Christoph Hellwig
2010-10-05 22:30     ` Dave Chinner
2010-09-29 12:18 ` [PATCH 07/17] fs: icache atomic inodes_stat Dave Chinner
2010-09-30  4:52   ` Andrew Morton
2010-09-30  6:20     ` Dave Chinner
2010-09-30  6:37       ` Andrew Morton
2010-10-16  7:56     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 08/17] fs: icache protect inode state Dave Chinner
2010-10-01  6:02   ` Christoph Hellwig
2010-10-16  7:54     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 09/17] fs: Make last_ino, iunique independent of inode_lock Dave Chinner
2010-09-30  4:53   ` Andrew Morton
2010-10-01  6:08   ` Christoph Hellwig
2010-10-16  7:54     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 10/17] fs: icache remove inode_lock Dave Chinner
2010-09-29 12:18 ` [PATCH 11/17] fs: Factor inode hash operations into functions Dave Chinner
2010-10-01  6:06   ` Christoph Hellwig
2010-10-16  7:54     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 12/17] fs: Introduce per-bucket inode hash locks Dave Chinner
2010-09-30  1:52   ` Christoph Hellwig
2010-09-30  2:43     ` Dave Chinner
2010-10-16  7:55     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 13/17] fs: Implement lazy LRU updates for inodes Dave Chinner
2010-09-30  2:05   ` Christoph Hellwig
2010-10-16  7:54     ` Nick Piggin
2010-09-29 12:18 ` [PATCH 14/17] fs: Inode counters do not need to be atomic Dave Chinner
2010-09-29 12:18 ` [PATCH 15/17] fs: inode per-cpu last_ino allocator Dave Chinner
2010-09-30  2:07   ` Christoph Hellwig
2010-10-06  6:29     ` Dave Chinner
2010-10-06  8:51       ` Christoph Hellwig
2010-09-30  4:53   ` Andrew Morton
2010-09-30  5:36     ` Eric Dumazet
2010-09-30  7:53       ` Eric Dumazet
2010-09-30  8:14         ` Andrew Morton
2010-09-30 10:22           ` [PATCH] " Eric Dumazet
2010-09-30 16:45             ` Andrew Morton
2010-09-30 17:28               ` Eric Dumazet
2010-09-30 17:39                 ` Andrew Morton
2010-09-30 18:05                   ` Eric Dumazet
2010-10-01  6:12                 ` Christoph Hellwig
2010-10-01  6:45                   ` Eric Dumazet
2010-10-16  6:36                 ` Nick Piggin
2010-10-16  6:40                   ` Nick Piggin
2010-09-29 12:18 ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Dave Chinner
2010-09-30  2:12   ` Christoph Hellwig
2010-09-30  4:53   ` Andrew Morton
2010-09-30  6:10     ` Dave Chinner
2010-10-16  7:55       ` Nick Piggin
2010-10-16  8:29         ` Eric Dumazet
2010-10-16  9:07           ` Andrew Morton
2010-10-16  9:31             ` Eric Dumazet
2010-10-16 14:19               ` [PATCH] percpu_counter : add percpu_counter_add_fast() Eric Dumazet
2010-10-18 15:24                 ` Christoph Lameter
2010-10-18 15:39                   ` Eric Dumazet
2010-10-18 16:12                     ` Christoph Lameter
2010-10-21 22:37                 ` Andrew Morton
2010-10-21 23:10                   ` Christoph Lameter
2010-10-22  0:45                     ` Andrew Morton
2010-10-22  1:55                       ` Andrew Morton
2010-10-22  1:58                         ` Nick Piggin
2010-10-22  2:14                           ` Andrew Morton
2010-10-22  4:12                       ` Eric Dumazet
2010-10-21 22:43                 ` Andrew Morton
2010-10-21 22:58                   ` Eric Dumazet
2010-10-21 23:18                     ` Andrew Morton
2010-10-21 23:22                       ` Eric Dumazet
2010-10-21 22:31               ` [PATCH 16/17] fs: Convert nr_inodes to a per-cpu counter Andrew Morton
2010-10-21 22:58                 ` Eric Dumazet
2010-10-02 16:02     ` Christoph Hellwig
2010-09-29 12:18 ` [PATCH 17/17] fs: Clean up inode reference counting Dave Chinner
2010-09-30  2:15   ` Christoph Hellwig
2010-10-16  7:55     ` Nick Piggin
2010-10-16 16:14       ` Christoph Hellwig
2010-10-16 17:09         ` Nick Piggin
2010-09-30  4:53   ` Andrew Morton
2010-09-29 23:57 ` [PATCH 0/17] fs: Inode cache scalability Christoph Hellwig
2010-09-30  0:24   ` Dave Chinner
2010-09-30  2:21 ` Christoph Hellwig
2010-10-02 23:10 ` Carlos Carvalho
2010-10-04  7:22   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).