public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/18] xfs: metadata scalability V3
@ 2010-09-24 12:30 Dave Chinner
  2010-09-24 12:30 ` [PATCH 01/18] xfs: force background CIL push under sustained load Dave Chinner
                   ` (17 more replies)
  0 siblings, 18 replies; 29+ messages in thread
From: Dave Chinner @ 2010-09-24 12:30 UTC (permalink / raw)
  To: xfs

Version 3:

o added CIL background push fixup. While it is a correctness bug fix,
  it also signifincantly speeds up sustained workloads. This version
  of the patch has addresed the review comments.
o cleaned up some typos and removed useless comments around timestamp
  changes
o changed xfs_buf_get_uncached() parameters to pass the buftarg first.
o split inode walk batch lookup in two patches to separate out grabbing and
  releasing inodes from the batch lookups.

Version 2:
o dropped inode cache RCU/spinlock conversion (needs more testing)
o dropped buffer cache LRU/no page cache conversion (needs more testing)
o added CIL item insertion cleanup as suggested by Christoph.
o added flags to xfs_buf_get_uncached() and xfs_buf_read_uncached()
  to control memory allocation flags.
o cleaned up buffer page allocation failure path
o reworked inode reclaim shrinker scalability
	- separated reclaim AG walk from sync walks
	- implemented batch lookups for both sync and reclaim walks
	- added per-ag reclaim serialisation locks and traversal
	  cursors

This patchset started out as a "convert the buffer cache to rbtrees"
patch, and just gew from there as I peeled the onion from one
bottleneck to another. The second version of this patch does not go
as far as the first version - it drops the more radical changes as
they are not ready for integration yet.

The lock contention reductions allowed by the RCU inode cache
lookups are replaced by more efficient lookup mechanisms during
inode cache walking - using batching mechanisms as originally
suggested by Nick Piggin. The code is a lot more efficient than
Nick's proof of concept as it uses batched gang lookups on the radix
trees. These batched lookups show almost the same performance
improvement as the RCU lookup did but without changing the locking
algorithms at all.  This batching would be necessary for efficient
reclaim walks regardless of whether the sync walk is protected by
RCU or the current rwlock.

The shrinker rework improves parallel unlink performance
substantially more than just single threading the shrinker execution
and does not have the OOM problems that single threading the
shrinker had. It avoids the OOM problems by ensuring that every
shrinker call does some work or sleeps while waiting for an AG to do
some work on. The lookup optimisations done for gang lookups ensure
that the scanning is as efficient as possible, so overall shrinker
overhead has gone down significantly.

Performance numbers here are 8-way fs_mark create to 50M files, and
8-way rm -rf to remove the files created.

			wall time	fs_mark rate
2.6.36-rc4:
	create:		13m10s		65k file/s
	unlink:		23m58s		N/A

2.6.36-rc4 + v1-patchset:
	create:		 9m47s		95k files/s
	unlink:		14m16s		N/A

2.6.36-rc3 + v2-patchset:
	create:		10m32s		85k file/s
	unlink:		11m49s		N/A

2.6.36-rc4 + v3-patchset
	create:		10m03s		90k file/s
	unlink:		11m29s		N/A

The patches are available in the following git tree. The branch is
based on the current OSS xfs tree, and as such is based on
2.6.36-rc4. This is a rebase of the previous branch.

The following changes since commit e89318c670af3959db3aa483da509565f5a2536c:

  xfs: eliminate some newly-reported gcc warnings (2010-09-16 12:56:42 -0500)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git metadata-scale

Dave Chinner (18):
      xfs: force background CIL push under sustained load
      xfs: reduce the number of CIL lock round trips during commit
      xfs: remove debug assert for per-ag reference counting
      xfs: lockless per-ag lookups
      xfs: don't use vfs writeback for pure metadata modifications
      xfs: rename xfs_buf_get_nodaddr to be more appropriate
      xfs: introduced uncached buffer read primitve
      xfs: store xfs_mount in the buftarg instead of in the xfs_buf
      xfs: kill XBF_FS_MANAGED buffers
      xfs: use unhashed buffers for size checks
      xfs: remove buftarg hash for external devices
      xfs: split inode AG walking into separate code for reclaim
      xfs: split out inode walk inode grabbing
      xfs: implement batched inode lookups for AG walking
      xfs: batch inode reclaim lookup
      xfs: serialise inode reclaim within an AG
      xfs: convert buffer cache hash to rbtree
      xfs: pack xfs_buf structure more tightly

 fs/xfs/linux-2.6/xfs_buf.c     |  200 +++++++++++---------
 fs/xfs/linux-2.6/xfs_buf.h     |   50 +++---
 fs/xfs/linux-2.6/xfs_ioctl.c   |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c    |   55 ++++--
 fs/xfs/linux-2.6/xfs_super.c   |   15 +-
 fs/xfs/linux-2.6/xfs_sync.c    |  407 +++++++++++++++++++++++-----------------
 fs/xfs/linux-2.6/xfs_sync.h    |    4 +-
 fs/xfs/linux-2.6/xfs_trace.h   |    4 +-
 fs/xfs/quota/xfs_qm_syscalls.c |   14 +--
 fs/xfs/xfs_ag.h                |    9 +
 fs/xfs/xfs_attr.c              |   31 +--
 fs/xfs/xfs_buf_item.c          |    3 +-
 fs/xfs/xfs_fsops.c             |   11 +-
 fs/xfs/xfs_inode.h             |    1 +
 fs/xfs/xfs_inode_item.c        |    9 -
 fs/xfs/xfs_log.c               |    3 +-
 fs/xfs/xfs_log_cil.c           |  244 +++++++++++++-----------
 fs/xfs/xfs_log_priv.h          |   37 ++--
 fs/xfs/xfs_log_recover.c       |   19 +-
 fs/xfs/xfs_mount.c             |  152 ++++++++-------
 fs/xfs/xfs_mount.h             |    2 +
 fs/xfs/xfs_rename.c            |   12 +-
 fs/xfs/xfs_rtalloc.c           |   29 ++--
 fs/xfs/xfs_utils.c             |    4 +-
 fs/xfs/xfs_vnodeops.c          |   16 +-
 25 files changed, 730 insertions(+), 603 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread
* [PATCH 0/18] xfs: metadata scalability V4
@ 2010-09-27  1:47 Dave Chinner
  2010-09-27  1:47 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-09-27  1:47 UTC (permalink / raw)
  To: xfs

xfs: Metadata scalability patchset V4

Version 4:
o removed xfs_ichgtime by open coding the only unlogged time change
  and moved xfs_trans_ichgtime() to xfs_trans_inode.c
o cleaned up trylock semantics in per-ag reclaim locking algorithm.
o made xfs_inode_ag_walk_grab() STATIC.

Version 3:
o added CIL background push fixup. While it is a correctness bug fix,
  it also signifincantly speeds up sustained workloads. This version
  of the patch has addresed the review comments.
o cleaned up some typos and removed useless comments around timestamp
  changes
o changed xfs_buf_get_uncached() parameters to pass the buftarg first.
o split inode walk batch lookup in two patches to separate out grabbing and
  releasing inodes from the batch lookups.

Version 2:
o dropped inode cache RCU/spinlock conversion (needs more testing)
o dropped buffer cache LRU/no page cache conversion (needs more testing)
o added CIL item insertion cleanup as suggested by Christoph.
o added flags to xfs_buf_get_uncached() and xfs_buf_read_uncached()
  to control memory allocation flags.
o cleaned up buffer page allocation failure path
o reworked inode reclaim shrinker scalability
	- separated reclaim AG walk from sync walks
	- implemented batch lookups for both sync and reclaim walks
	- added per-ag reclaim serialisation locks and traversal
	  cursors

This patchset started out as a "convert the buffer cache to rbtrees"
patch, and just gew from there as I peeled the onion from one
bottleneck to another. The second version of this patch does not go
as far as the first version - it drops the more radical changes as
they are not ready for integration yet.

The lock contention reductions allowed by the RCU inode cache
lookups are replaced by more efficient lookup mechanisms during
inode cache walking - using batching mechanisms as originally
suggested by Nick Piggin. The code is a lot more efficient than
Nick's proof of concept as it uses batched gang lookups on the radix
trees. These batched lookups show almost the same performance
improvement as the RCU lookup did but without changing the locking
algorithms at all.  This batching would be necessary for efficient
reclaim walks regardless of whether the sync walk is protected by
RCU or the current rwlock.

I dropped the no-page-cache conversion patches for the buffer cache
as well, as they need more work and testing before they are ready.

The shrinker rework improves parallel unlink performance
substantially more than just single threading the shrinker execution
and does not have the OOM problems that single threading the
shrinker had. It avoids the OOM problems by ensuring that every
shrinker call does some work or sleeps while waiting for an AG to do
some work on. The lookup optimisations done for gang lookups ensure
that the scanning is as efficient as possible, so overall shrinker
overhead has gone down significantly.

Performance numbers here are 8-way fs_mark create to 50M files, and
8-way rm -rf to remove the files created.

			wall time	fs_mark rate
2.6.36-rc4:
	create:		13m10s		65k file/s
	unlink:		23m58s		N/A

2.6.36-rc4 + v1-patchset:
	create:		 9m47s		95k files/s
	unlink:		14m16s		N/A

2.6.36-rc3 + v2-patchset:
	create:		10m32s		85k file/s
	unlink:		11m49s		N/A

2.6.36-rc4 + v3-patchset
	create:		10m03s		90k file/s
	unlink:		11m29s		N/A

Also, the new CIL push patch has greatly improved 8-way 1 billion inode create
and unlink times, with create dropping from 4h38m to 3h41m, and 8-way unlink
dropping from 5h36m to 4h28m.

The patches are available in the following git tree. The branch is
based on the current OSS xfs tree, and as such is based on
2.6.36-rc4. This is a rebase of the previous branch.

The following changes since commit e89318c670af3959db3aa483da509565f5a2536c:

  xfs: eliminate some newly-reported gcc warnings (2010-09-16 12:56:42 -0500)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git metadata-scale

Dave Chinner (18):
      xfs: force background CIL push under sustained load
      xfs: reduce the number of CIL lock round trips during commit
      xfs: remove debug assert for per-ag reference counting
      xfs: lockless per-ag lookups
      xfs: don't use vfs writeback for pure metadata modifications
      xfs: rename xfs_buf_get_nodaddr to be more appropriate
      xfs: introduced uncached buffer read primitve
      xfs: store xfs_mount in the buftarg instead of in the xfs_buf
      xfs: kill XBF_FS_MANAGED buffers
      xfs: use unhashed buffers for size checks
      xfs: remove buftarg hash for external devices
      xfs: split inode AG walking into separate code for reclaim
      xfs: split out inode walk inode grabbing
      xfs: implement batched inode lookups for AG walking
      xfs: batch inode reclaim lookup
      xfs: serialise inode reclaim within an AG
      xfs: convert buffer cache hash to rbtree
      xfs: pack xfs_buf structure more tightly

 fs/xfs/linux-2.6/xfs_buf.c     |  200 +++++++++++---------
 fs/xfs/linux-2.6/xfs_buf.h     |   50 +++---
 fs/xfs/linux-2.6/xfs_ioctl.c   |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c    |   35 ----
 fs/xfs/linux-2.6/xfs_super.c   |   15 +-
 fs/xfs/linux-2.6/xfs_sync.c    |  413 +++++++++++++++++++++++-----------------
 fs/xfs/linux-2.6/xfs_sync.h    |    4 +-
 fs/xfs/linux-2.6/xfs_trace.h   |    4 +-
 fs/xfs/quota/xfs_qm_syscalls.c |   14 +--
 fs/xfs/xfs_ag.h                |    9 +
 fs/xfs/xfs_attr.c              |   31 +--
 fs/xfs/xfs_buf_item.c          |    3 +-
 fs/xfs/xfs_fsops.c             |   11 +-
 fs/xfs/xfs_inode.h             |    1 -
 fs/xfs/xfs_inode_item.c        |    9 -
 fs/xfs/xfs_log.c               |    3 +-
 fs/xfs/xfs_log_cil.c           |  244 +++++++++++++-----------
 fs/xfs/xfs_log_priv.h          |   37 ++--
 fs/xfs/xfs_log_recover.c       |   19 +-
 fs/xfs/xfs_mount.c             |  152 ++++++++-------
 fs/xfs/xfs_mount.h             |    2 +
 fs/xfs/xfs_rename.c            |   12 +-
 fs/xfs/xfs_rtalloc.c           |   29 ++--
 fs/xfs/xfs_trans.h             |    1 +
 fs/xfs/xfs_trans_inode.c       |   30 +++
 fs/xfs/xfs_utils.c             |    4 +-
 fs/xfs/xfs_vnodeops.c          |   23 ++-
 27 files changed, 732 insertions(+), 625 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread
* [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
@ 2010-09-14 10:55 Dave Chinner
  2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-09-14 10:55 UTC (permalink / raw)
  To: xfs

This patchset has grown quite a bit - it started out as a "convert
the buffer cache to rbtrees" patch, and has gotten bigger as I
peeled the onion from one bottleneck to another.

Performance numbers here are 8-way fs_mark create to 50M files, and
8-way rm -rf to remove the files created.

			wall time	fs_mark rate
2.6.36-rc4:
	create:		13m10s		65k file/s
	unlink:		23m58s		N/A

The first set of patches are generic infrastructure changes that
address pain points the rbtree based buffer cache introduces. I've
put them first because they are simpler to review and have immediate
impact on performance. These patches address lock contention as
measured by the kernel lockstat infrastructure.

xfs: single thread inode cache shrinking.
	- prevents per-ag contention during cache shrinking

xfs: reduce the number of CIL lock round trips during commit
	- reduces lock traffic on the xc_cil_lock by two orders of
	  magnitude

xfs: remove debug assert for per-ag reference counting
xfs: lockless per-ag lookups
	- hottest lock in the system with buffer cache rbtree path
	- converted to use RCU.

xfs: convert inode cache lookups to use RCU locking
xfs: convert pag_ici_lock to a spin lock
	- addresses lookup vs reclaim contention on pag_ici_lock
	- converted to use RCU.

xfs: don't use vfs writeback for pure metadata modifications
	- inode writeback does not keep up with dirtying 100,000
	  inodes a second. Avoids the superblock dirty list where
	  possible by using the AIL as the age-order flusher.

Performance with these patches:

2.6.36-rc4 + shrinker + CIL + RCU:
	create:		11m38s		80k files/s
	unlink:		14m29s		N/A

Create rate has improved by 20%, unlink time has almost halved. On
large numbers of inodes, the unlink rate improves even more
dramatically.

The buffer cache to rbtree series current stands at:

xfs: rename xfs_buf_get_nodaddr to be more appropriate
xfs: introduced uncached buffer read primitve
xfs: store xfs_mount in the buftarg instead of in the xfs_buf
xfs: kill XBF_FS_MANAGED buffers
xfs: use unhashed buffers for size checks
xfs: remove buftarg hash for external devices
	- preparatory buffer cache API cleanup patches

xfs: convert buffer cache hash to rbtree
	- what it says ;)
	- includes changes based on Alex's review.

xfs; pack xfs_buf structure more tightly
	- memory usage reduction, means adding the LRU list head is
	  effectively memory usage neutral.

xfs: convert xfsbud shrinker to a per-buftarg shrinker.
xfs: add a lru to the XFS buffer cache
	- Add an LRU for reclaim

xfs: stop using the page cache to back the buffer cache
	- kill all the page cache code

2.6.36-rc4 + shrinker + CIL + RCU + rbtree:
	create:		 9m47s		95k files/s
	unlink:		14m16s		N/A

Create rate has improved by another 20%, unlink rate has improved
marginally (noise, really).

There are two remaining parts to the buffer cache conversions:

	1. work out how to efficiently support block size smaller
	than page size. The current code works, but uses a page per
	sub-apge buffer.  A set of slab caches would be perfect for
	this use, but I'm not sure that we are allowed to use them
	for IO anymore. Christoph?

	2. Connect up the buffer type sepcific reclaim priority
	reference counting and convert the LRU reclaim to a cursor
	based walk that simply drops reclaim reference counts and
	frees anything that has a zero reclaim reference.

Overall, I can swap the order of the two patch sets, and the
incremental performance increases for create are pretty much
identical. For unlink, te benefit comes from the shrinker
modification. For those that care, the rbtree patch set in isolation
results in a time of 4h38m to create 1 billion inodes on my 8p/4GB
RAM test VM. I haven't run this test with the RCU and writeback
modifications yet.

Moving on from this point is to start testing against Nick Piggin's
VFS scalability tree, aѕ the inode_lock and dcache_lock are now the
performance limiting factors. That will, without doubt, bring new
hotspots out in XFS so I'll be starting this cycle over again soon.

Overall diffstat at this point is:

 fs/xfs/linux-2.6/kmem.h        |    1 +
 fs/xfs/linux-2.6/xfs_buf.c     |  588 ++++++++++++++--------------------------
 fs/xfs/linux-2.6/xfs_buf.h     |   61 +++--
 fs/xfs/linux-2.6/xfs_iops.c    |   18 +-
 fs/xfs/linux-2.6/xfs_super.c   |   11 +-
 fs/xfs/linux-2.6/xfs_sync.c    |   49 +++-
 fs/xfs/linux-2.6/xfs_trace.h   |    2 +-
 fs/xfs/quota/xfs_qm_syscalls.c |    4 +-
 fs/xfs/xfs_ag.h                |    9 +-
 fs/xfs/xfs_buf_item.c          |    3 +-
 fs/xfs/xfs_fsops.c             |   11 +-
 fs/xfs/xfs_iget.c              |   46 +++-
 fs/xfs/xfs_inode.c             |   22 +-
 fs/xfs/xfs_inode_item.c        |    9 -
 fs/xfs/xfs_log.c               |    3 +-
 fs/xfs/xfs_log_cil.c           |  116 +++++----
 fs/xfs/xfs_log_recover.c       |   18 +-
 fs/xfs/xfs_mount.c             |  126 ++++-----
 fs/xfs/xfs_mount.h             |    2 +
 fs/xfs/xfs_rtalloc.c           |   29 +-
 fs/xfs/xfs_vnodeops.c          |    2 +-
 21 files changed, 502 insertions(+), 628 deletions(-)

So it is improving performance, removing code and fixing
longstanding bugs all at the same time. ;)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-09-27  1:47 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-24 12:30 [PATCH 0/18] xfs: metadata scalability V3 Dave Chinner
2010-09-24 12:30 ` [PATCH 01/18] xfs: force background CIL push under sustained load Dave Chinner
2010-09-24 12:31 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
2010-09-24 12:31 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
2010-09-24 12:31 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
2010-09-24 12:31 ` [PATCH 05/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
2010-09-25 23:42   ` Christoph Hellwig
2010-09-27  1:09     ` Dave Chinner
2010-09-24 12:31 ` [PATCH 06/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
2010-09-24 12:31 ` [PATCH 07/18] xfs: introduced uncached buffer read primitve Dave Chinner
2010-09-24 12:31 ` [PATCH 08/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
2010-09-24 12:31 ` [PATCH 09/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
2010-09-24 12:31 ` [PATCH 10/18] xfs: use unhashed buffers for size checks Dave Chinner
2010-09-24 12:31 ` [PATCH 11/18] xfs: remove buftarg hash for external devices Dave Chinner
2010-09-24 12:31 ` [PATCH 12/18] xfs: split inode AG walking into separate code for reclaim Dave Chinner
2010-09-24 12:31 ` [PATCH 13/18] xfs: split out inode walk inode grabbing Dave Chinner
2010-09-25 16:31   ` Christoph Hellwig
2010-09-24 12:31 ` [PATCH 14/18] xfs: implement batched inode lookups for AG walking Dave Chinner
2010-09-25 16:32   ` Christoph Hellwig
2010-09-24 12:31 ` [PATCH 15/18] xfs: batch inode reclaim lookup Dave Chinner
2010-09-24 12:31 ` [PATCH 16/18] xfs: serialise inode reclaim within an AG Dave Chinner
2010-09-25 23:49   ` Christoph Hellwig
2010-09-27  0:56     ` Dave Chinner
2010-09-24 12:31 ` [PATCH 17/18] xfs: convert buffer cache hash to rbtree Dave Chinner
2010-09-24 12:31 ` [PATCH 18/18] xfs: pack xfs_buf structure more tightly Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2010-09-27  1:47 [PATCH 0/18] xfs: metadata scalability V4 Dave Chinner
2010-09-27  1:47 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
2010-09-14 14:48   ` Christoph Hellwig
2010-09-14 17:22   ` Alex Elder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox