* [PATCH] Avoid useless inodes and dentries reclamation @ 2013-08-28 21:52 Tim Chen 2013-08-28 21:19 ` Kirill A. Shutemov 2013-08-29 11:07 ` Dave Chinner 0 siblings, 2 replies; 13+ messages in thread From: Tim Chen @ 2013-08-28 21:52 UTC (permalink / raw) To: Alexander Viro, Jan Kara, Dave Chinner Cc: Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel This patch detects that when free inodes and dentries are really low, their reclamation is skipped so we do not have to contend on the global sb_lock uselessly under memory pressure. Otherwise we create a log jam trying to acquire the sb_lock in prune_super(), with little or no freed memory to show for the effort. The profile below shows a multi-threaded large file read exerting pressure on memory with page cache usage. It is dominated by the sb_lock contention in the cpu cycles profile. The patch eliminates the sb_lock contention almost entirely for prune_super(). 43.94% usemem [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--32.44%-- grab_super_passive | prune_super | shrink_slab | do_try_to_free_pages | try_to_free_pages | __alloc_pages_nodemask | alloc_pages_current | |--32.18%-- put_super | drop_super | prune_super | shrink_slab | do_try_to_free_pages | try_to_free_pages | __alloc_pages_nodemask | alloc_pages_current Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- fs/super.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/super.c b/fs/super.c index 68307c0..70fa26c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we * take a passive reference to the superblock to avoid this from occurring. */ +#define SB_CACHE_LOW 5 static int prune_super(struct shrinker *shrink, struct shrink_control *sc) { struct super_block *sb; @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) return -1; + /* + * Don't prune if we have few cached objects to reclaim to + * avoid useless sb_lock contention + */ + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) + return -1; + if (!grab_super_passive(sb)) return -1; -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-28 21:52 [PATCH] Avoid useless inodes and dentries reclamation Tim Chen @ 2013-08-28 21:19 ` Kirill A. Shutemov 2013-08-28 22:54 ` Tim Chen 2013-08-29 11:07 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Kirill A. Shutemov @ 2013-08-28 21:19 UTC (permalink / raw) To: Tim Chen Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Wed, Aug 28, 2013 at 02:52:12PM -0700, Tim Chen wrote: > This patch detects that when free inodes and dentries are really > low, their reclamation is skipped so we do not have to contend > on the global sb_lock uselessly under memory pressure. Otherwise > we create a log jam trying to acquire the sb_lock in prune_super(), > with little or no freed memory to show for the effort. > > The profile below shows a multi-threaded large file read exerting > pressure on memory with page cache usage. It is dominated > by the sb_lock contention in the cpu cycles profile. The patch > eliminates the sb_lock contention almost entirely for prune_super(). > > 43.94% usemem [kernel.kallsyms] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--32.44%-- grab_super_passive > | prune_super > | shrink_slab > | do_try_to_free_pages > | try_to_free_pages > | __alloc_pages_nodemask > | alloc_pages_current > | > |--32.18%-- put_super > | drop_super > | prune_super > | shrink_slab > | do_try_to_free_pages > | try_to_free_pages > | __alloc_pages_nodemask > | alloc_pages_current > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- > fs/super.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/super.c b/fs/super.c > index 68307c0..70fa26c 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > * take a passive reference to the superblock to avoid this from occurring. > */ > +#define SB_CACHE_LOW 5 > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > { > struct super_block *sb; > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > return -1; > > + /* > + * Don't prune if we have few cached objects to reclaim to > + * avoid useless sb_lock contention > + */ > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > + return -1; I don't think it's correct: you don't account fs_objects here and prune_icache_sb() calls invalidate_mapping_pages() which can free a lot of memory. It's too naive approach. You can miss a memory hog easily this way. > + > if (!grab_super_passive(sb)) > return -1; > -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-28 21:19 ` Kirill A. Shutemov @ 2013-08-28 22:54 ` Tim Chen 0 siblings, 0 replies; 13+ messages in thread From: Tim Chen @ 2013-08-28 22:54 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Thu, 2013-08-29 at 00:19 +0300, Kirill A. Shutemov wrote: > > --- > > fs/super.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/fs/super.c b/fs/super.c > > index 68307c0..70fa26c 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > > * take a passive reference to the superblock to avoid this from occurring. > > */ > > +#define SB_CACHE_LOW 5 > > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > { > > struct super_block *sb; > > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > > return -1; > > > > + /* > > + * Don't prune if we have few cached objects to reclaim to > > + * avoid useless sb_lock contention > > + */ > > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > > + return -1; > > I don't think it's correct: you don't account fs_objects here and > prune_icache_sb() calls invalidate_mapping_pages() which can free a lot of > memory. It's too naive approach. You can miss a memory hog easily this > way. Is it safe to compute sb->s_op->nr_cached_objects(sb), assuming non null s_op without holding sb_lock to increment ref count on sb? I think it is safe as we hold the shrinker_rwsem so we cannot unregister the shrinker and the s_op and sb structure should still be there. However, I'm not totally sure. Tim Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- diff --git a/fs/super.c b/fs/super.c index 68307c0..173d0d9 100644 --- a/fs/super.c +++ b/fs/super.c @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we * take a passive reference to the superblock to avoid this from occurring. */ +#define SB_CACHE_LOW 5 static int prune_super(struct shrinker *shrink, struct shrink_control *sc) { struct super_block *sb; @@ -68,6 +69,17 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) return -1; + total_objects = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused; + if (sb->s_op && sb->s_op->nr_cached_objects) + total_objects += sb->s_op->nr_cached_objects(sb); + + /* + * Don't prune if we have few cached objects to reclaim to + * avoid useless sb_lock contention + */ + if (total_objects <= SB_CACHE_LOW) + return -1; + if (!grab_super_passive(sb)) return -1; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-28 21:52 [PATCH] Avoid useless inodes and dentries reclamation Tim Chen 2013-08-28 21:19 ` Kirill A. Shutemov @ 2013-08-29 11:07 ` Dave Chinner 2013-08-29 18:07 ` Tim Chen 1 sibling, 1 reply; 13+ messages in thread From: Dave Chinner @ 2013-08-29 11:07 UTC (permalink / raw) To: Tim Chen Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Wed, Aug 28, 2013 at 02:52:12PM -0700, Tim Chen wrote: > This patch detects that when free inodes and dentries are really > low, their reclamation is skipped so we do not have to contend > on the global sb_lock uselessly under memory pressure. Otherwise > we create a log jam trying to acquire the sb_lock in prune_super(), > with little or no freed memory to show for the effort. > > The profile below shows a multi-threaded large file read exerting > pressure on memory with page cache usage. It is dominated > by the sb_lock contention in the cpu cycles profile. The patch > eliminates the sb_lock contention almost entirely for prune_super(). > > 43.94% usemem [kernel.kallsyms] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--32.44%-- grab_super_passive > | prune_super > | shrink_slab > | do_try_to_free_pages > | try_to_free_pages > | __alloc_pages_nodemask > | alloc_pages_current > | > |--32.18%-- put_super > | drop_super > | prune_super > | shrink_slab > | do_try_to_free_pages > | try_to_free_pages > | __alloc_pages_nodemask > | alloc_pages_current > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- > fs/super.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/super.c b/fs/super.c > index 68307c0..70fa26c 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > * take a passive reference to the superblock to avoid this from occurring. > */ > +#define SB_CACHE_LOW 5 > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > { > struct super_block *sb; > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > return -1; > > + /* > + * Don't prune if we have few cached objects to reclaim to > + * avoid useless sb_lock contention > + */ > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > + return -1; Those counters no longer exist in the current mmotm tree and the shrinker infrastructure is somewhat different, so this patch isn't the right way to solve this problem. Given that superblock LRUs and shrinkers in mmotm are node aware, there may even be more pressure on the sblock in such a workload. I think the right way to deal with this is to give the shrinker itself a "minimum call count" so that we can avoid even attempting to shrink caches that does have enough entries in them to be worthwhile shrinking. That said, the memcg guys have been saying that even small numbers of items per cache can be meaningful in terms of memory reclaim (e.g. when there are lots of memcgs) then such a threshold might only be appropriate for caches that are not memcg controlled. In that case, handling it in the shrinker infrastructure itself is a much better idea than hacking thresholds into individual shrinker callouts. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-29 11:07 ` Dave Chinner @ 2013-08-29 18:07 ` Tim Chen 2013-08-29 18:36 ` Dave Hansen 2013-08-30 1:40 ` Dave Chinner 0 siblings, 2 replies; 13+ messages in thread From: Tim Chen @ 2013-08-29 18:07 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > --- > > fs/super.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/fs/super.c b/fs/super.c > > index 68307c0..70fa26c 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > > * take a passive reference to the superblock to avoid this from occurring. > > */ > > +#define SB_CACHE_LOW 5 > > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > { > > struct super_block *sb; > > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > > return -1; > > > > + /* > > + * Don't prune if we have few cached objects to reclaim to > > + * avoid useless sb_lock contention > > + */ > > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > > + return -1; > > Those counters no longer exist in the current mmotm tree and the > shrinker infrastructure is somewhat different, so this patch isn't > the right way to solve this problem. These changes in mmotm tree do complicate solutions for this scalability issue. > > Given that superblock LRUs and shrinkers in mmotm are node aware, > there may even be more pressure on the sblock in such a workload. I > think the right way to deal with this is to give the shrinker itself > a "minimum call count" so that we can avoid even attempting to > shrink caches that does have enough entries in them to be worthwhile > shrinking. By "minimum call count", do you mean tracking the number of free entries per node in the shrinker, and invoking shrinker only when the number of free entries exceed "minimum call count"? There is some cost in list_lru_count_node to get the free entries, as we need to acquire the node's lru lock. Alternatively, we can set a special flag/node by list_add or list_del when count goes above/below a threshold and invoke shrinker based on this flag. Or do you mean that if we do not reap any memory in a shrink operation, we do a certain number of backoffs of shrink operation till the "minimum call count" is reached? > > That said, the memcg guys have been saying that even small numbers > of items per cache can be meaningful in terms of memory reclaim > (e.g. when there are lots of memcgs) then such a threshold might > only be appropriate for caches that are not memcg controlled. I've done some experiment with the CACHE thresholds. Even setting the threshold at 0 (i.e. there're no free entries) remove almost all the needless contentions. That should make the memcg guys happy by not holding any extra free entries. > In > that case, handling it in the shrinker infrastructure itself is a > much better idea than hacking thresholds into individual shrinker > callouts. Currently the problem is mostly with the sb shrinker due to the sb_lock. If we can have a general solution, that will be even better. Thanks. Tim ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-29 18:07 ` Tim Chen @ 2013-08-29 18:36 ` Dave Hansen 2013-08-30 1:56 ` Dave Chinner 2013-08-30 1:40 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Dave Hansen @ 2013-08-29 18:36 UTC (permalink / raw) To: Tim Chen Cc: Dave Chinner, Alexander Viro, Jan Kara, Dave Chinner, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel The new shrinker infrastructure in mmotm looks like it will make this problem worse. old code: shrink_slab() for_each_shrinker { do_shrinker_shrink(); // one per batch prune_super() grab_super_passive() } } Which means we've got at _most_ one grab_super_passive() per batch. The new code is something like this: shrink_slab() { list_for_each_entry(shrinker, &shrinker_list, list) { for_each_node_mask(... shrinkctl->nodes_to_scan) { shrink_slab_node() } } } shrink_slab_node() { max_pass = shrinker->count_objects(shrinker, shrinkctl); // ^^ does grab_super_passive() ... while (total_scan >= batch_size) { ret = shrinker->scan_objects(shrinker, shrinkctl); // ^^ does grab_super_passive() } } We've got an extra grab_super_passive()s in the case where we are actually doing a scan, plus we've got the extra for_each_node_mask() loop. That means even more lock acquisitions in the multi-node NUMA case, which is exactly where we want to get rid of global lock acquisitions. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-29 18:36 ` Dave Hansen @ 2013-08-30 1:56 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2013-08-30 1:56 UTC (permalink / raw) To: Dave Hansen Cc: Tim Chen, Alexander Viro, Jan Kara, Dave Chinner, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Thu, Aug 29, 2013 at 11:36:10AM -0700, Dave Hansen wrote: > The new shrinker infrastructure in mmotm looks like it will make this > problem worse. > > old code: > shrink_slab() > for_each_shrinker { > do_shrinker_shrink(); // one per batch > prune_super() > grab_super_passive() > } > } I think you've simplified it down too far. The current code does: for_each_shrinker { max_pass = do_shrinker_shrink(0); // ^^ does grab_super_passive() while(total_scan >= batch_size) { do_shrinker_shrink(0) // ^^ does grab_super_passive() do_shrinker_shrink(batch_size) // ^^ does grab_super_passive() } } > Which means we've got at _most_ one grab_super_passive() per batch. No, there's two. one count, one scan per batch. > The new code is something like this: > > shrink_slab() > { > list_for_each_entry(shrinker, &shrinker_list, list) { > for_each_node_mask(... shrinkctl->nodes_to_scan) { > shrink_slab_node() > } > } > } Right, but what you are missing here is that the nodemask passed in to shrink_slab() only has a single node bit set during reclaim - the bit that matches the zone being reclaimed from. drop_slab(), OTOH, does: nodes_setall(shrink.nodes_to_scan); before calling shrink_slab in a loopi because it's trying to free *everything*, and that's why the shrink_slab() code handles that case. > shrink_slab_node() > { > max_pass = shrinker->count_objects(shrinker, shrinkctl); > // ^^ does grab_super_passive() > ... > while (total_scan >= batch_size) { > ret = shrinker->scan_objects(shrinker, shrinkctl); > // ^^ does grab_super_passive() > } > } > > We've got an extra grab_super_passive()s in the case where we are > actually doing a scan, plus we've got the extra for_each_node_mask() > loop. That means even more lock acquisitions in the multi-node NUMA > case, which is exactly where we want to get rid of global lock acquisitions. I disagree. With direct memory reclaim, we have an identical number of calls to shrink_slab() occurring, and each target a single node. hence there is typically a 1:1 call ratio for shrink_slab:shrink_slab_node. An because shrink_slab_node() has one less callout per batch iteration, there is an overall reduction in the number of grab_super_passive calls from the shrinker. Worst case is no change, best case is a 50% reduction in the number of calls. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-29 18:07 ` Tim Chen 2013-08-29 18:36 ` Dave Hansen @ 2013-08-30 1:40 ` Dave Chinner 2013-08-30 16:21 ` Tim Chen 1 sibling, 1 reply; 13+ messages in thread From: Dave Chinner @ 2013-08-30 1:40 UTC (permalink / raw) To: Tim Chen Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Thu, Aug 29, 2013 at 11:07:56AM -0700, Tim Chen wrote: > > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > > --- > > > fs/super.c | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/fs/super.c b/fs/super.c > > > index 68307c0..70fa26c 100644 > > > --- a/fs/super.c > > > +++ b/fs/super.c > > > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > > > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > > > * take a passive reference to the superblock to avoid this from occurring. > > > */ > > > +#define SB_CACHE_LOW 5 > > > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > > { > > > struct super_block *sb; > > > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > > > return -1; > > > > > > + /* > > > + * Don't prune if we have few cached objects to reclaim to > > > + * avoid useless sb_lock contention > > > + */ > > > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > > > + return -1; > > > > Those counters no longer exist in the current mmotm tree and the > > shrinker infrastructure is somewhat different, so this patch isn't > > the right way to solve this problem. > > These changes in mmotm tree do complicate solutions for this > scalability issue. > > > > > Given that superblock LRUs and shrinkers in mmotm are node aware, > > there may even be more pressure on the sblock in such a workload. I > > think the right way to deal with this is to give the shrinker itself > > a "minimum call count" so that we can avoid even attempting to > > shrink caches that does have enough entries in them to be worthwhile > > shrinking. > > By "minimum call count", do you mean tracking the number of free > entries per node in the shrinker, and invoking shrinker > only when the number of free entries > exceed "minimum call count"? The new shrinker infrastructure has a ->count_objects() callout to specifically return the number of objects in the cache. shrink_slab_node() can check that return value against the "minimum call count" and determine whether it needs to call ->scan_objects() at all. Actually, the shrinker already behaves like this with the batch_size variable - the shrinker has to be asking for more items to be scanned than the batch size. That means the problem is that counting callouts are causing the problem, not the scanning callouts. With the new code in the mmotm tree, for counting purposes we probably don't need to grab a passive superblock reference at all - the superblock and LRUs are guaranteed to be valid if the shrinker is currently running, but we don't really care if the superblock is being shutdown and the values that come back are invalid because the ->scan_objects() callout will fail to grab the superblock to do anything with the calculated values. Indeed, I've made no attempt to optimise the code in the mmotm tree - I've been concerned with correctness. The fact that without any optimisation it significantly lessens contention in my testing has been sufficient to move forward with. > There is some cost in > list_lru_count_node to get the free entries, as we need > to acquire the node's lru lock. Right, but we really don't need the node's lru lock to get the count - reading the count is racy from an external perspective, anyway, so we can do lockless counting here. See this patch I proposed a while back, for example: https://lkml.org/lkml/2013/7/31/7 > > That said, the memcg guys have been saying that even small numbers > > of items per cache can be meaningful in terms of memory reclaim > > (e.g. when there are lots of memcgs) then such a threshold might > > only be appropriate for caches that are not memcg controlled. > > I've done some experiment with the CACHE thresholds. Even setting > the threshold at 0 (i.e. there're no free entries) remove almost all > the needless contentions. That should make the memcg guys happy by > not holding any extra free entries. Ok, that's good to know, and it further indicates that it is the ->count_objects() callout that is the issue, not the scanning/freeing. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-30 1:40 ` Dave Chinner @ 2013-08-30 16:21 ` Tim Chen 2013-08-31 9:00 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Tim Chen @ 2013-08-30 16:21 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Fri, 2013-08-30 at 11:40 +1000, Dave Chinner wrote: > > The new shrinker infrastructure has a ->count_objects() callout to > specifically return the number of objects in the cache. > shrink_slab_node() can check that return value against the "minimum > call count" and determine whether it needs to call ->scan_objects() > at all. > > Actually, the shrinker already behaves like this with the batch_size > variable - the shrinker has to be asking for more items to be > scanned than the batch size. That means the problem is that counting > callouts are causing the problem, not the scanning callouts. > > With the new code in the mmotm tree, for counting purposes we > probably don't need to grab a passive superblock reference at all - > the superblock and LRUs are guaranteed to be valid if the shrinker > is currently running, but we don't really care if the superblock is > being shutdown and the values that come back are invalid because the > ->scan_objects() callout will fail to grab the superblock to do > anything with the calculated values. If that's the case, then we should remove grab_super_passive from the super_cache_count code. That should remove the bottleneck in reclamation. Thanks for your detailed explanation. Tim Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- diff --git a/fs/super.c b/fs/super.c index 73d0952..4df1fab 100644 --- a/fs/super.c +++ b/fs/super.c @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); - if (!grab_super_passive(sb)) - return 0; - if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-30 16:21 ` Tim Chen @ 2013-08-31 9:00 ` Dave Chinner 2013-09-03 18:38 ` Tim Chen 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2013-08-31 9:00 UTC (permalink / raw) To: Tim Chen Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote: > On Fri, 2013-08-30 at 11:40 +1000, Dave Chinner wrote: > > > > > The new shrinker infrastructure has a ->count_objects() callout to > > specifically return the number of objects in the cache. > > shrink_slab_node() can check that return value against the "minimum > > call count" and determine whether it needs to call ->scan_objects() > > at all. > > > > Actually, the shrinker already behaves like this with the batch_size > > variable - the shrinker has to be asking for more items to be > > scanned than the batch size. That means the problem is that counting > > callouts are causing the problem, not the scanning callouts. > > > > With the new code in the mmotm tree, for counting purposes we > > probably don't need to grab a passive superblock reference at all - > > the superblock and LRUs are guaranteed to be valid if the shrinker > > is currently running, but we don't really care if the superblock is > > being shutdown and the values that come back are invalid because the > > ->scan_objects() callout will fail to grab the superblock to do > > anything with the calculated values. > > If that's the case, then we should remove grab_super_passive > from the super_cache_count code. That should remove the bottleneck > in reclamation. > > Thanks for your detailed explanation. > > Tim > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- > diff --git a/fs/super.c b/fs/super.c > index 73d0952..4df1fab 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, > > sb = container_of(shrink, struct super_block, s_shrink); > > - if (!grab_super_passive(sb)) > - return 0; > - I think the function needs a comment explaining why we aren't grabbing the sb here, otherwise people are going to read the code and ask why it's different to the scanning callout. > if (sb->s_op && sb->s_op->nr_cached_objects) > total_objects = sb->s_op->nr_cached_objects(sb, > sc->nid); But seeing this triggered further thought on my part. Being called during unmount means that ->nr_cached_objects implementations need to be robust against unmount tearing down private filesystem structures. Right now, grab_super_passive() protects us from that because it won't be able to get the sb->s_umount lock while generic_shutdown_super() is doing it's work. IOWs, the superblock based shrinker operations are safe because the structures don't get torn down until after the shrinker is unregistered. That's not true for the structures that ->nr_cached_objects() use: ->put_super() tears them down before the shrinker is unregistered and only grab_super_passive() protects us from thay. Let me have a bit more of a think about this - the solution may simply be unregistering the shrinker before we call ->kill_sb() so the shrinker can't get called while we are tearing down the fs. First, though, I need to go back and remind myself of why I put that after ->kill_sb() in the first place. If we unregister the shrinker before ->kill_sb is called, then we can probably get rid of grab_super_passive() in both shrinker callouts because they will no longer need to handle running concurrently with ->kill_sb().... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-08-31 9:00 ` Dave Chinner @ 2013-09-03 18:38 ` Tim Chen 2013-09-06 0:55 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Tim Chen @ 2013-09-03 18:38 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel, Kirill A. Shutemov On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote: > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote: > > > > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > --- > > diff --git a/fs/super.c b/fs/super.c > > index 73d0952..4df1fab 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > - if (!grab_super_passive(sb)) > > - return 0; > > - > > I think the function needs a comment explaining why we aren't > grabbing the sb here, otherwise people are going to read the code > and ask why it's different to the scanning callout. > > > if (sb->s_op && sb->s_op->nr_cached_objects) > > total_objects = sb->s_op->nr_cached_objects(sb, > > sc->nid); > Yes, those comments are needed. I also need to remove the corresponding drop_super(sb); So probably something like: --- diff --git a/fs/super.c b/fs/super.c index 73d0952..7b5a6e5 100644 --- a/fs/super.c +++ b/fs/super.c @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); - if (!grab_super_passive(sb)) - return 0; - + /* + * Don't call grab_super_passive as it is a potential + * scalability bottleneck. The counts could get updated + * between super_cache_count and super_cache_scan anyway. + * Call to super_cache_count with shrinker_rwsem held + * ensures the safety of call to list_lru_count_node() and + * s_op->nr_cached_objects(). + */ if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); @@ -125,7 +130,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, sc->nid); total_objects = vfs_pressure_ratio(total_objects); - drop_super(sb); return total_objects; } > But seeing this triggered further thought on my part. Being called > during unmount means that ->nr_cached_objects implementations need > to be robust against unmount tearing down private filesystem > structures. Right now, grab_super_passive() protects us from that > because it won't be able to get the sb->s_umount lock while > generic_shutdown_super() is doing it's work. > > IOWs, the superblock based shrinker operations are safe because the > structures don't get torn down until after the shrinker is > unregistered. That's not true for the structures that > ->nr_cached_objects() use: ->put_super() tears them down before the > shrinker is unregistered and only grab_super_passive() protects us > from thay. > > Let me have a bit more of a think about this - the solution may > simply be unregistering the shrinker before we call ->kill_sb() so > the shrinker can't get called while we are tearing down the fs. > First, though, I need to go back and remind myself of why I put that > after ->kill_sb() in the first place. Seems very reasonable as I haven't found a case where the shrinker is touched in ->kill_sb() yet. It looks like unregistering the shrinker before ->kill_sb() should be okay. > If we unregister the shrinker > before ->kill_sb is called, then we can probably get rid of > grab_super_passive() in both shrinker callouts because they will no > longer need to handle running concurrently with ->kill_sb().... > Thanks. Tim ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-09-03 18:38 ` Tim Chen @ 2013-09-06 0:55 ` Dave Chinner 2013-09-06 18:26 ` Tim Chen 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2013-09-06 0:55 UTC (permalink / raw) To: Tim Chen Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel, Kirill A. Shutemov On Tue, Sep 03, 2013 at 11:38:27AM -0700, Tim Chen wrote: > On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote: > > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote: > > > > > > > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > > --- > > > diff --git a/fs/super.c b/fs/super.c > > > index 73d0952..4df1fab 100644 > > > --- a/fs/super.c > > > +++ b/fs/super.c > > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, > > > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > > > - if (!grab_super_passive(sb)) > > > - return 0; > > > - > > > > I think the function needs a comment explaining why we aren't > > grabbing the sb here, otherwise people are going to read the code > > and ask why it's different to the scanning callout. > > > > > if (sb->s_op && sb->s_op->nr_cached_objects) > > > total_objects = sb->s_op->nr_cached_objects(sb, > > > sc->nid); > > > > Yes, those comments are needed. > I also need to remove the corresponding > drop_super(sb); > > So probably something like: > > --- > diff --git a/fs/super.c b/fs/super.c > index 73d0952..7b5a6e5 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink, > > sb = container_of(shrink, struct super_block, s_shrink); > > - if (!grab_super_passive(sb)) > - return 0; > - > + /* > + * Don't call grab_super_passive as it is a potential > + * scalability bottleneck. The counts could get updated > + * between super_cache_count and super_cache_scan anyway. > + * Call to super_cache_count with shrinker_rwsem held > + * ensures the safety of call to list_lru_count_node() and > + * s_op->nr_cached_objects(). > + */ Well, that's not true of s_op->nr_cached_objects() right now. It's only going to be true if the shrinker deregistration is moved before ->kill_sb().... > > Let me have a bit more of a think about this - the solution may > > simply be unregistering the shrinker before we call ->kill_sb() so > > the shrinker can't get called while we are tearing down the fs. > > First, though, I need to go back and remind myself of why I put that > > after ->kill_sb() in the first place. > > Seems very reasonable as I haven't found a case where the shrinker > is touched in ->kill_sb() yet. It looks like unregistering the > shrinker before ->kill_sb() should be okay. Having looked at it some more, I have to agree. I think the original reason for unregistering the shrinker there was to avoid problems with locking - the shrinker callouts are run holding the shrinker_rwsem in read mode, and then we lock the sb->s_umount in read mount. In the unmount case, we currently take the sb->s_umount lock in write mode (thereby locking out the shrinker) but we drop it before deregistering the shrinker and so there is no inverted locking order. The thing is, grab_super_passive does a try-lock on the sb->s_umount now, and so if we are in the unmount process, it won't ever block. That means what used to be a deadlock and races we were avoiding by using grab_super_passive() is now: shrinker umount down_read(shrinker_rwsem) down_write(sb->s_umount) shrinker_unregister down_write(shrinker_rwsem) <blocks> grab_super_passive(sb) down_read_trylock(sb->s_umount) <fails> <shrinker aborts> .... <shrinkers finish running> up_read(shrinker_rwsem) <unblocks> <removes shrinker> up_write(shrinker_rwsem) ->kill_sb() .... And so it appears to be safe to deregister the shrinker before ->kill_sb(). Can you do this as two patches? The first moves the shrinker deregistration to before ->kill_sb(), then second is the above patch that drops the grab-super_passive() calls from the ->count_objects function? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Avoid useless inodes and dentries reclamation 2013-09-06 0:55 ` Dave Chinner @ 2013-09-06 18:26 ` Tim Chen 0 siblings, 0 replies; 13+ messages in thread From: Tim Chen @ 2013-09-06 18:26 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jan Kara, Dave Chinner, Dave Hansen, Andi Kleen, Matthew Wilcox, linux-fsdevel, linux-kernel, Kirill A. Shutemov On Fri, 2013-09-06 at 10:55 +1000, Dave Chinner wrote: > On Tue, Sep 03, 2013 at 11:38:27AM -0700, Tim Chen wrote: > > On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote: > > > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote: > > > > > > > > > > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > > > --- > > > > diff --git a/fs/super.c b/fs/super.c > > > > index 73d0952..4df1fab 100644 > > > > --- a/fs/super.c > > > > +++ b/fs/super.c > > > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, > > > > > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > > > > > - if (!grab_super_passive(sb)) > > > > - return 0; > > > > - > > > > > > I think the function needs a comment explaining why we aren't > > > grabbing the sb here, otherwise people are going to read the code > > > and ask why it's different to the scanning callout. > > > > > > > if (sb->s_op && sb->s_op->nr_cached_objects) > > > > total_objects = sb->s_op->nr_cached_objects(sb, > > > > sc->nid); > > > > > > > Yes, those comments are needed. > > I also need to remove the corresponding > > drop_super(sb); > > > > So probably something like: > > > > --- > > diff --git a/fs/super.c b/fs/super.c > > index 73d0952..7b5a6e5 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink, > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > - if (!grab_super_passive(sb)) > > - return 0; > > - > > + /* > > + * Don't call grab_super_passive as it is a potential > > + * scalability bottleneck. The counts could get updated > > + * between super_cache_count and super_cache_scan anyway. > > + * Call to super_cache_count with shrinker_rwsem held > > + * ensures the safety of call to list_lru_count_node() and > > + * s_op->nr_cached_objects(). > > + */ > > Well, that's not true of s_op->nr_cached_objects() right now. It's > only going to be true if the shrinker deregistration is moved before > ->kill_sb().... > > > > Let me have a bit more of a think about this - the solution may > > > simply be unregistering the shrinker before we call ->kill_sb() so > > > the shrinker can't get called while we are tearing down the fs. > > > First, though, I need to go back and remind myself of why I put that > > > after ->kill_sb() in the first place. > > > > Seems very reasonable as I haven't found a case where the shrinker > > is touched in ->kill_sb() yet. It looks like unregistering the > > shrinker before ->kill_sb() should be okay. > > Having looked at it some more, I have to agree. I think the original > reason for unregistering the shrinker there was to avoid problems > with locking - the shrinker callouts are run holding the > shrinker_rwsem in read mode, and then we lock the sb->s_umount in > read mount. In the unmount case, we currently take the sb->s_umount > lock in write mode (thereby locking out the shrinker) but we drop it > before deregistering the shrinker and so there is no inverted > locking order. > > The thing is, grab_super_passive does a try-lock on the sb->s_umount > now, and so if we are in the unmount process, it won't ever block. > That means what used to be a deadlock and races we were avoiding > by using grab_super_passive() is now: > > shrinker umount > > down_read(shrinker_rwsem) > down_write(sb->s_umount) > shrinker_unregister > down_write(shrinker_rwsem) > <blocks> > grab_super_passive(sb) > down_read_trylock(sb->s_umount) > <fails> > <shrinker aborts> > .... > <shrinkers finish running> > up_read(shrinker_rwsem) > <unblocks> > <removes shrinker> > up_write(shrinker_rwsem) > ->kill_sb() > .... > > And so it appears to be safe to deregister the shrinker before > ->kill_sb(). > > Can you do this as two patches? The first moves the shrinker > deregistration to before ->kill_sb(), then second is the above patch > that drops the grab-super_passive() calls from the ->count_objects > function? I've sent the patches on a separate mail thread. Please add your sign-off if the patches look okay. Thanks. Tim ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-09-06 18:26 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-28 21:52 [PATCH] Avoid useless inodes and dentries reclamation Tim Chen 2013-08-28 21:19 ` Kirill A. Shutemov 2013-08-28 22:54 ` Tim Chen 2013-08-29 11:07 ` Dave Chinner 2013-08-29 18:07 ` Tim Chen 2013-08-29 18:36 ` Dave Hansen 2013-08-30 1:56 ` Dave Chinner 2013-08-30 1:40 ` Dave Chinner 2013-08-30 16:21 ` Tim Chen 2013-08-31 9:00 ` Dave Chinner 2013-09-03 18:38 ` Tim Chen 2013-09-06 0:55 ` Dave Chinner 2013-09-06 18:26 ` Tim Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).