linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: npiggin@kernel.dk, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [patch 31/35] fs: icache per-zone inode LRU
Date: Wed, 20 Oct 2010 13:35:56 +1100	[thread overview]
Message-ID: <20101020023556.GC3740@amd> (raw)
In-Reply-To: <20101019123852.GA12506@dastard>

[I should have cc'ed this one to linux-mm as well, so I quote your
reply in full here]

On Tue, Oct 19, 2010 at 11:38:52PM +1100, Dave Chinner wrote:
> On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > Per-zone LRUs and shrinkers for inode cache.
> 
> Regardless of whether this is the right way to scale or not, I don't
> like the fact that this moves the cache LRUs into the memory
> management structures, and expands the use of MM specific structures
> throughout the code.

The zone structure really is the basic unit of memory abstraction
in the whole zoned VM concept (which covers different properties
of both physical address and NUMA cost).

The zone contains structures for memory management that aren't
otherwise directly related to one another. Generic page waitqueues,
page allocator structures, pagecache reclaim structures, memory model
data, and various statistics.

Structures to reclaim inodes from a particular zone belong in the
zone struct as much as those to reclaim pagecache or anonymous
memory from that zone too. It actually fits far better in here than
globally, because all our allocation/reclaiming/watermarks etc is
driven per-zone.

The structure is not frequent -- a couple per NUMA node.


> It ties the cache implementation to the current
> VM implementation. That, IMO, goes against all the principle of
> modularisation at the source code level, and it means we have to tie
> all shrinker implemenations to the current internal implementation
> of the VM. I don't think that is wise thing to do because of the
> dependencies and impedance mismatches it introduces.

It's very fundamental. We allocate memory from, and have to reclaim
memory from -- zones. Memory reclaim is driven based on how the VM
wants to reclaim memory: nothing you can do to avoid some linkage
between the two.

Look at it this way. The dumb global shrinker is also tied to an
MM implementation detail, but that detail in fact does *not* match
the reality of the MM, and so it has all these problems interacting
with real reclaim.

What problems? OK, on an N zone system (assuming equal zones and
even distribution of objects around memory), then if there is a shortage
on a particular zone, slabs from _all_ zones are reclaimed. We reclaim
a factor of N too many objects. In a NUMA situation, we also touch
remote memory with a chance (N-1)/N.

As number of nodes grow beyond 2, this quickly goes down hill.

In summary, there needs to be some knowledge of how MM reclaims memory
in memory reclaim shrinkers -- simply can't do a good implementation
without that. If the zone concept changes, the MM gets turned upside
down and all those assumptions would need to be revisited anyway.


> As an example: XFS inodes to be reclaimed are simply tagged in a
> radix tree so the shrinker can reclaim inodes in optimal IO order
> rather strict LRU order. It simply does not match a zone-based

This is another problem, similar to what we have in pagecache. In
the pagecache, we need to clean pages in optimal IO order, but we
still reclaim them according to some LRU order.

If you reclaim them in optimal IO order, cache efficiency will go
down because you sacrifice recency/frequency information. If you
IO in reclaim order, IO efficiency goes down. The solution is to
decouple them with like writeout versus reclaim.

But anyway, that's kind of an "aside": inode caches are reclaimed
in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
change that in the slightest.

> shrinker implementation in any way, shape or form, nor does it's
> inherent parallelism match that of the way shrinkers are called.
> 
> Any change in shrinker infrastructure needs to be able to handle
> these sorts of impedance mismatches between the VM and the cache
> subsystem. The current API doesn't handle this very well, either,
> so it's something that we need to fix so that scalability is easy
> for everyone.
> 
> Anyway, my main point is that tying the LRU and shrinker scaling to
> the implementation of the VM is a one-off solution that doesn't work
> for generic infrastructure.

No it isn't. It worked for the pagecache, and it works for dcache.


> Other subsystems need the same
> large-machine scaling treatment, and there's no way we should be
> tying them all into the struct zone. It needs further abstraction.

An abstraction? Other than the zone? What do you suggest? Invent
something that the VM has no concept of and try to use that?

No. The zone is the right thing to base it on.


  reply	other threads:[~2010-10-20  2:36 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
2010-10-19  3:42 ` [patch 01/35] bit_spinlock: add required includes npiggin
2010-10-19  3:42 ` [patch 02/35] kernel: add bl_list npiggin
2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
2010-10-19  4:49   ` KOSAKI Motohiro
2010-10-19  5:33     ` Nick Piggin
2010-10-19  5:40       ` KOSAKI Motohiro
2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
2010-10-19  3:42 ` [patch 05/35] fs: icache lock s_inodes list npiggin
2010-10-19  3:42 ` [patch 06/35] fs: icache lock inode hash npiggin
2010-10-19  3:42 ` [patch 07/35] fs: icache lock i_state npiggin
2010-10-19 10:47   ` Miklos Szeredi
2010-10-19 17:06     ` Peter Zijlstra
2010-10-19  3:42 ` [patch 08/35] fs: icache lock i_count npiggin
2010-10-19 10:16   ` Boaz Harrosh
2010-10-20  2:14     ` Nick Piggin
2010-10-19  3:42 ` [patch 09/35] fs: icache lock lru/writeback lists npiggin
2010-10-19  3:42 ` [patch 10/35] fs: icache atomic inodes_stat npiggin
2010-10-19  3:42 ` [patch 11/35] fs: icache lock inode state npiggin
2010-10-19  3:42 ` [patch 12/35] fs: inode atomic last_ino, iunique lock npiggin
2010-10-19  3:42 ` [patch 13/35] fs: icache remove inode_lock npiggin
2010-10-19  3:42 ` [patch 14/35] fs: icache factor hash lock into functions npiggin
2010-10-19  3:42 ` [patch 15/35] fs: icache per-bucket inode hash locks npiggin
2010-10-19  3:42 ` [patch 16/35] fs: icache lazy inode lru npiggin
2010-10-19  3:42 ` [patch 17/35] fs: icache RCU free inodes npiggin
2010-10-19  3:42 ` [patch 18/35] fs: avoid inode RCU freeing for pseudo fs npiggin
2010-10-19  3:42 ` [patch 19/35] fs: icache remove redundant i_sb_list umount locking npiggin
2010-10-20 12:46   ` Al Viro
2010-10-20 13:03     ` Nick Piggin
2010-10-20 13:27       ` Al Viro
2010-10-19  3:42 ` [patch 20/35] fs: icache rcu walk for i_sb_list npiggin
2010-10-19  3:42 ` [patch 21/35] fs: icache per-cpu nr_inodes, non-atomic nr_unused counters npiggin
2010-10-19  3:42 ` [patch 22/35] fs: icache per-cpu last_ino allocator npiggin
2010-10-19  3:42 ` [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists npiggin
2010-10-19 15:33   ` Miklos Szeredi
2010-10-20  2:37     ` Nick Piggin
2010-10-19  3:42 ` [patch 24/35] fs: icache use RCU to avoid locking in hash lookups npiggin
2010-10-19  3:42 ` [patch 25/35] fs: icache reduce some locking overheads npiggin
2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
2010-10-19 15:50   ` Miklos Szeredi
2010-10-20  2:38     ` Nick Piggin
2010-10-19 16:33   ` Christoph Hellwig
2010-10-20  3:07     ` Nick Piggin
2010-10-19  3:42 ` [patch 27/35] fs: icache split IO and LRU lists npiggin
2010-10-19 16:12   ` Miklos Szeredi
2010-10-20  2:41     ` Nick Piggin
2010-10-19  3:42 ` [patch 28/35] fs: icache split writeback and lru locks npiggin
2010-10-19  3:42 ` [patch 29/35] fs: icache per-bdi writeback list locking npiggin
2010-10-19  3:42 ` [patch 30/35] fs: icache lazy LRU avoid LRU locking after IO operation npiggin
2010-10-19  3:42 ` [patch 31/35] fs: icache per-zone inode LRU npiggin
2010-10-19 12:38   ` Dave Chinner
2010-10-20  2:35     ` Nick Piggin [this message]
2010-10-20  3:12       ` Nick Piggin
2010-10-20  9:43         ` Dave Chinner
2010-10-20 10:02           ` Nick Piggin
2010-10-20  3:14     ` KOSAKI Motohiro
2010-10-20  3:20       ` Nick Piggin
2010-10-20  3:29         ` KOSAKI Motohiro
2010-10-20 10:19         ` Dave Chinner
2010-10-20 10:41           ` Nick Piggin
2010-10-19  3:42 ` [patch 32/35] fs: icache minimise I_FREEING latency npiggin
2010-10-19  3:42 ` [patch 33/35] fs: icache introduce inode_get/inode_get_ilock npiggin
2010-10-19 10:17   ` Boaz Harrosh
2010-10-20  2:17     ` Nick Piggin
2010-10-19  3:42 ` [patch 34/35] fs: inode rename i_count to i_refs npiggin
2010-10-19  3:42 ` [patch 35/35] fs: icache document more lock orders npiggin
2010-10-19 16:22 ` [patch 00/35] my inode scaling series for review Christoph Hellwig
2010-10-20  3:05   ` Nick Piggin
2010-10-20 13:14 ` Al Viro
2010-10-20 13:59   ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101020023556.GC3740@amd \
    --to=npiggin@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).