* [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues @ 2019-02-19 7:13 Roman Gushchin 2019-02-20 2:47 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Roman Gushchin @ 2019-02-19 7:13 UTC (permalink / raw) To: lsf-pc@lists.linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org Sorry, once more, now with fsdevel@ in cc, asked by Dave. -- Recent reverts of memcg leak fixes [1, 2] reintroduced the problem with accumulating of dying memory cgroups. This is a serious problem: on most of our machines we've seen thousands on dying cgroups, and the corresponding memory footprint was measured in hundreds of megabytes. The problem was also independently discovered by other companies. The fixes were reverted due to xfs regression investigated by Dave Chinner. Simultaneously we've seen a very small (0.18%) cpu regression on some hosts, which caused Rik van Riel to propose a patch [3], which aimed to fix the regression. The idea is to accumulate small memory pressure and apply it periodically, so that we don't overscan small shrinker lists. According to Jan Kara's data [4], Rik's patch partially fixed the regression, but not entirely. The path forward isn't entirely clear now, and the status quo isn't acceptable due to memcg leak bug. Dave and Michal's position is to focus on dying memory cgroup case and apply some artificial memory pressure on corresponding slabs (probably, during cgroup deletion process). This approach can theoretically be less harmful for the subtle scanning balance, and not cause any regressions. In my opinion, it's not necessarily true. Slab objects can be shared between cgroups, and often can't be reclaimed on cgroup removal without an impact on the rest of the system. Applying constant artificial memory pressure precisely only on objects accounted to dying cgroups is challenging and will likely cause a quite significant overhead. Also, by "forgetting" of some slab objects under light or even moderate memory pressure, we're wasting memory, which can be used for something useful. Dying cgroups are just making this problem more obvious because of their size. So, using "natural" memory pressure in a way, that all slabs objects are scanned periodically, seems to me as the best solution. The devil is in details, and how to do it without causing any regressions, is an open question now. Also, completely re-parenting slabs to parent cgroup (not only shrinker lists) is a potential option to consider. It will be nice to discuss the problem on LSF/MM, agree on general path and make a potential list of benchmarks, which can be used to prove the solution. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69056ee6a8a3d576ed31e38b3b14c70d6c74edcc [3] https://lkml.org/lkml/2019/1/28/1865 [4] https://lkml.org/lkml/2019/2/8/336 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-19 7:13 [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues Roman Gushchin @ 2019-02-20 2:47 ` Dave Chinner 2019-02-20 5:50 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2019-02-20 2:47 UTC (permalink / raw) To: Roman Gushchin Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote: > Sorry, once more, now with fsdevel@ in cc, asked by Dave. > -- > > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem > with accumulating of dying memory cgroups. This is a serious problem: > on most of our machines we've seen thousands on dying cgroups, and > the corresponding memory footprint was measured in hundreds of megabytes. > The problem was also independently discovered by other companies. > > The fixes were reverted due to xfs regression investigated by Dave Chinner. Context: it wasn't one regression that I investigated. We had multiple bug reports with different regressions, and I saw evidence on my own machines that something wasn't right because of the change in the IO patterns in certain benchmarks. Some of the problems were caused by the first patch, some were caused by the second patch. This also affects ext4 (i.e. it's a general problem, not an XFS problem) as has been reported a couple of times, including this one overnight: https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/ > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts, > which caused Rik van Riel to propose a patch [3], which aimed to fix the > regression. The idea is to accumulate small memory pressure and apply it > periodically, so that we don't overscan small shrinker lists. According > to Jan Kara's data [4], Rik's patch partially fixed the regression, > but not entirely. Rik's patch was buggy and made an invalid assumptions about how a cache with a small number of freeable objects is a "small cache", so any comaprisons made with it are essentially worthless. More details about the problems with the patch and approach here: https://lore.kernel.org/stable/20190131224905.GN31397@rh/ > The path forward isn't entirely clear now, and the status quo isn't acceptable > due to memcg leak bug. Dave and Michal's position is to focus on dying memory > cgroup case and apply some artificial memory pressure on corresponding slabs > (probably, during cgroup deletion process). This approach can theoretically > be less harmful for the subtle scanning balance, and not cause any regressions. I outlined the dying memcg problem in patch[0] of the revert series: https://lore.kernel.org/linux-mm/20190130041707.27750-1-david@fromorbit.com/ It basically documents the solution I proposed for dying memcg cleanup: dgc> e.g. add a garbage collector via a background workqueue that sits on dgc> the dying memcg calling something like: dgc> dgc> void drop_slab_memcg(struct mem_cgroup *dying_memcg) dgc> { dgc> unsigned long freed; dgc> dgc> do { dgc> struct mem_cgroup *memcg = NULL; dgc> dgc> freed = 0; dgc> memcg = mem_cgroup_iter(dying_memcg, NULL, NULL); dgc> do { dgc> freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0); dgc> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); dgc> } while (freed > 0); dgc> } This is a pretty trivial piece of code and doesn't requiring changing the core memory reclaim code at all. > In my opinion, it's not necessarily true. Slab objects can be shared between > cgroups, and often can't be reclaimed on cgroup removal without an impact on the > rest of the system. I've already pointed out that it is preferable for shared objects to stay in cache, not face expedited reclaim: https://lore.kernel.org/linux-mm/20190131221904.GL4205@dastard/ dgc> However, the memcg reaper *doesn't need to be perfect* to solve the dgc> "takes too long to clean up dying memcgs". Even if it leaves shared dgc> objects behind (which we want to do!), it still trims those memcgs dgc> down to /just the shared objects still in use/. And given that dgc> objects shared by memcgs are in the minority (according to past dgc> discussions about the difficulies of accounting them correctly) I dgc> think this is just fine. dgc> dgc> Besides, those reamining shared objects are the ones we want to dgc> naturally age out under memory pressure, but otherwise the memcgs dgc> will have been shaken clean of all other objects accounted to them. dgc> i.e. the "dying memcg" memory footprint goes down massively and the dgc> "long term buildup" of dying memcgs basically goes away. This all seems like pretty desirable cross-memcg working set maintenance behaviour to me... > Applying constant artificial memory pressure precisely only > on objects accounted to dying cgroups is challenging and will likely > cause a quite significant overhead. I don't know where you got that from - the above example is clearly a once-off cleanup. And executing it via a workqueue in the async memcg cleanup path (which already runs through multiple work queues to run and wait for different stages of cleanup) is not complex or challenging, nor is it likely to add additional overhead because it means we will avoid the long term shrinker scanning overhead that cleanup currently requires. > Also, by "forgetting" of some slab objects > under light or even moderate memory pressure, we're wasting memory, which can be > used for something useful. Cached memory is not "forgotten" or "wasted memory". If the scan is too small and not used, it is deferred to the next shrinker invocation. This batching behaviour is intentionally done for scan efficiency purposes. Don't take my word for it, read the discussion that went along with commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if tight on memory") https://lore.kernel.org/lkml/20140115012541.ad302526.akpm@linux-foundation.org/ From Andrew: akpm> Actually, the intent of batching is to limit the number of calls to akpm> ->scan(). At least, that was the intent when I wrote it! This is a akpm> good principle and we should keep doing it. If we're going to send the akpm> CPU away to tread on a pile of cold cachelines, we should make sure akpm> that it does a good amount of work while it's there. IOWs, the "small scan" proposals defeat existing shrinker efficiency optimisations. This change in behaviour is where the CPU usage regressions in "small cache" scanning comes from. As Andrew said: scan batching is a good principle and we should keep doing it. > Dying cgroups are just making this problem more > obvious because of their size. Dying cgroups see this as a problem only because they have extremely poor life cycle management. Expediting dying memcg cache cleanup is the way to fix this and that does not need us to change global memory reclaim behaviour. > So, using "natural" memory pressure in a way, that all slabs objects are scanned > periodically, seems to me as the best solution. The devil is in details, and how > to do it without causing any regressions, is an open question now. > > Also, completely re-parenting slabs to parent cgroup (not only shrinker lists) > is a potential option to consider. That should be done once the memcg gc thread has shrunk the caches down to just the shared objects (which we want to keep in cache!) that reference the dying memcg. That will get rid of all the remaining references and allow the memcg to be reclaimed completely. > It will be nice to discuss the problem on LSF/MM, agree on general path and > make a potential list of benchmarks, which can be used to prove the solution. In reality, it comes down to this - should we: a) add a small amount of code into the subsystem to perform expedited reaping of subsystem owned objects and test against the known, specific reproducing workload; or b) change global memory reclaim algorithms in a way that affects every linux machine and workload in some way, resulting in us having to revalidate and rebalance memory reclaim for a large number of common workloads across all filesystems and subsystems that use shrinkers, on a wide range of different storage hardware and on both headless and desktop machines. And when we look at it this way, if we end up with option b) as the preferred solution then we've well and truly jumped the shark. The validation effort required for option b) is way out of proportion with the small niche of machines and environments affected by the dying memcg problem and the risk of regressions for users outside these memcg-heavy environments is extremely high (as has already been proven). Cheers, Dave -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-20 2:47 ` Dave Chinner @ 2019-02-20 5:50 ` Dave Chinner 2019-02-20 7:27 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2019-02-20 5:50 UTC (permalink / raw) To: Roman Gushchin Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Wed, Feb 20, 2019 at 01:47:23PM +1100, Dave Chinner wrote: > On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote: > > Sorry, once more, now with fsdevel@ in cc, asked by Dave. > > -- > > > > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem > > with accumulating of dying memory cgroups. This is a serious problem: > > on most of our machines we've seen thousands on dying cgroups, and > > the corresponding memory footprint was measured in hundreds of megabytes. > > The problem was also independently discovered by other companies. > > > > The fixes were reverted due to xfs regression investigated by Dave Chinner. > > Context: it wasn't one regression that I investigated. We had > multiple bug reports with different regressions, and I saw evidence > on my own machines that something wasn't right because of the change > in the IO patterns in certain benchmarks. Some of the problems were > caused by the first patch, some were caused by the second patch. > > This also affects ext4 (i.e. it's a general problem, not an XFS > problem) as has been reported a couple of times, including this one > overnight: > > https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/ > > > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts, > > which caused Rik van Riel to propose a patch [3], which aimed to fix the > > regression. The idea is to accumulate small memory pressure and apply it > > periodically, so that we don't overscan small shrinker lists. According > > to Jan Kara's data [4], Rik's patch partially fixed the regression, > > but not entirely. > > Rik's patch was buggy and made an invalid assumptions about how a > cache with a small number of freeable objects is a "small cache", so > any comaprisons made with it are essentially worthless. > > More details about the problems with the patch and approach here: > > https://lore.kernel.org/stable/20190131224905.GN31397@rh/ So, long story short, the dying memcg problem is actually a regression caused by previous shrinker changes, and the change in 4.18-rc1 was an attempt to fix the regression (which caused evenmore widespread problems) and Rik's patch is another different attempt to fix the original regression. The original regression broke the small scan accumulation algorithm in the shrinker, but i don't think that anyone actually understood how this was supposed to work and so the attempts to fix the regression haven't actually restored the original behaviour. The problematic commit: 9092c71bb724 ("mm: use sc->priority for slab shrink targets") which was included in 4.16-rc1. This changed the delta calculation and so any cache with less than 4096 freeable objects would now end up with a zero delta count. This means caches with few freeable objects had no scan pressure at all and nothing would get accumulated for later scanning. Prior to this change, such scans would result in single digit scan counts, which would get deferred and acummulated until the overal delta + deferred count went over the batch size and ti would scan the cache. IOWs, the above commit prevented accumulation of light pressure on caches and they'd only get scanned when extreme memory pressure occurs. The fix that went into 4.18-rc1 change this to make the minimum scan pressure the batch size, so instead of 0 pressure, it went to having extreme pressure on small caches. What wasn't used in a scan got deferred, and so the shrinker would wind up and keep heavy pressure on the cache even when there was only light memory pressure. IOWs, instead of having a scan count in the single digits under light memory pressure, those caches now had continual scan counts 1-2 orders of magnitude larger. i.e. way more agressive than in 4.15 and oler kernels. hence it introduced a different, more severe set of regressions than the one it was trying to fix. IOWs, the dying memcg issue is irrelevant here. The real problem that needs fixing is a shrinker regression that occurred in 4.16-rc1, not 4.18-rc1. I'm just going to fix the original regression in the shrinker algorithm by restoring the gradual accumulation behaviour, and this whole series of problems can be put to bed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-20 5:50 ` Dave Chinner @ 2019-02-20 7:27 ` Dave Chinner 2019-02-20 16:20 ` Johannes Weiner 2019-02-21 22:46 ` Roman Gushchin 0 siblings, 2 replies; 11+ messages in thread From: Dave Chinner @ 2019-02-20 7:27 UTC (permalink / raw) To: Roman Gushchin Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > I'm just going to fix the original regression in the shrinker > algorithm by restoring the gradual accumulation behaviour, and this > whole series of problems can be put to bed. Something like this lightly smoke tested patch below. It may be slightly more agressive than the original code for really small freeable values (i.e. < 100) but otherwise should be roughly equivalent to historic accumulation behaviour. Cheers, Dave. -- Dave Chinner david@fromorbit.com mm: fix shrinker scan accumulation regression From: Dave Chinner <dchinner@redhat.com> Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets") in 4.16-rc1 broke the shrinker scan accumulation algorithm for small freeable caches. This was active when there isn't enough work to run a full batch scan - the shrinker is supposed to defer that work until a future shrinker call. That then is fed back into the work to do on the next call, and if the work is larger than a batch it will run the scan. This is an efficiency mechanism that prevents repeated small scans of caches from consuming too much CPU. It also has the effect of ensure that caches with small numbers of freeable objects are slowly scanned. While an individual shrinker scan may not result in work to do, if the cache is queried enough times then the work will accumulate and the cache will be scanned and freed. This protects small and otherwise in use caches from excessive scanning under light memory pressure, but keeps cross caceh reclaim amounts fairly balalnced over time. The change in the above commit broke all this with the way it calculates the delta value. Instead of it being calculated to keep the freeable:scan shrinker count in the same ratio as the previous page cache freeable:scanned pass, it calculates the delta from the relcaim priority based on a logarithmic scale and applies this to the freeable count before anything else is done. This means that the resolution of the delta calculation is (1 << priority) and so for low pritority reclaim the cacluated delta does not go above zero unless there are at least 4096 freeable objects. This completely defeats the accumulation of work for caches with few freeable objects. Old code (ignoring seeks scaling) delta ~= (pages_scanned * freeable) / pages_freeable Accumulation resoution: pages_scanned / pages_freeable 4.16 code: delta ~= freeable >> priority Accumulation resolution: (1 << priority) IOWs, the old code would almost always result in delta being non-zero when freeable was non zero, and hence it would always accumulate scan even on the smallest of freeable caches regardless of the reclaim pressure being applied. The new code won't accumulate or scan the smallest of freeable caches until it reaches priority 1. This is extreme memory pressure, just before th OOM killer is to be kicked. We want to retain the priority mechanism to scale the work the shrinker does, but we also want to ensure it accumulates appropriately, too. In this case, offset the delta by ilog2(freeable) so that there is a slow accumulation of work. Use this regardless of the delta calculated so that we don't decrease the amount of work as the priority increases past the point where delta is non-zero. New code: delta ~= ilog2(freeable) + (freeable >> priority) Accumulation resolution: ilog2(freeable) Typical delta calculations from different code (ignoring seek scaling), keeping in mind that batch size is 128 by default and 1024 for superblock shrinkers. freeable = 1 ratio 4.15 priority 4.16 4.18 new 1:100 1 12 0 batch 1 1.32 1 9 0 batch 1 1:12 1 6 0 batch 1 1:6 1 3 0 batch 1 1:1 1 1 1 batch 1 freeable = 10 ratio 4.15 priority 4.16 4.18 new 1:100 1 12 0 batch 3 1.32 1 9 0 batch 3 1:12 1 6 0 batch 3 1:6 2 3 0 batch 3 1:1 10 1 10 batch 10 freeable = 100 ratio 4.15 priority 4.16 4.18 new 1:100 1 12 0 batch 6 1.32 3 9 0 batch 6 1:12 6 6 1 batch 7 1:6 16 3 12 batch 18 1:1 100 1 100 batch 100 freeable = 1000 ratio 4.15 priority 4.16 4.18 new 1:100 10 12 0 batch 9 1.32 32 9 1 batch 10 1:12 60 6 16 batch 26 1:6 160 3 120 batch 130 1:1 1000 1 1000 max(1000,batch) 1000 freeable = 10000 ratio 4.15 priority 4.16 4.18 new 1:100 100 12 2 batch 16 1.32 320 9 19 batch 35 1:12 600 6 160 max(160,batch) 175 1:6 1600 3 1250 1250 1265 1:1 10000 1 10000 10000 10000 It's pretty clear why the 4.18 algorithm caused such a problem - it massively changed the balance of reclaim when all that was actually required was a small tweak to always accumulating a small delta for caches with very small freeable counts. Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets") Signed-off-by: Dave Chinner <dchinner@redhat.com> --- mm/vmscan.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e979705bbf32..9cc58e9f1f54 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, total_scan = nr; if (shrinker->seeks) { - delta = freeable >> priority; + /* + * Use a small non-zero offset for delta so that if the scan + * priority is low we always accumulate some pressure on caches + * that have few freeable objects in them. This allows light + * memory pressure to turn over caches with few freeable objects + * slowly without the need for memory pressure priority to wind + * up to the point where (freeable >> priority) is non-zero. + */ + delta = ilog2(freeable); + delta += freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); } else { ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-20 7:27 ` Dave Chinner @ 2019-02-20 16:20 ` Johannes Weiner 2019-02-21 22:46 ` Roman Gushchin 1 sibling, 0 replies; 11+ messages in thread From: Johannes Weiner @ 2019-02-20 16:20 UTC (permalink / raw) To: Dave Chinner Cc: Roman Gushchin, lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > freeable = 1 > > ratio 4.15 priority 4.16 4.18 new > 1:100 1 12 0 batch 1 > 1.32 1 9 0 batch 1 > 1:12 1 6 0 batch 1 > 1:6 1 3 0 batch 1 > 1:1 1 1 1 batch 1 > @@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > total_scan = nr; > if (shrinker->seeks) { > - delta = freeable >> priority; > + /* > + * Use a small non-zero offset for delta so that if the scan > + * priority is low we always accumulate some pressure on caches > + * that have few freeable objects in them. This allows light > + * memory pressure to turn over caches with few freeable objects > + * slowly without the need for memory pressure priority to wind > + * up to the point where (freeable >> priority) is non-zero. > + */ > + delta = ilog2(freeable); The idea makes sense to me, but log2 fails us when freeable is 1. fls() should work, though. > + delta += freeable >> priority; > delta *= 4; > do_div(delta, shrinker->seeks); > } else { ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-20 7:27 ` Dave Chinner 2019-02-20 16:20 ` Johannes Weiner @ 2019-02-21 22:46 ` Roman Gushchin 2019-02-22 1:48 ` Rik van Riel 2019-02-28 20:30 ` Roman Gushchin 1 sibling, 2 replies; 11+ messages in thread From: Roman Gushchin @ 2019-02-21 22:46 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > > I'm just going to fix the original regression in the shrinker > > algorithm by restoring the gradual accumulation behaviour, and this > > whole series of problems can be put to bed. > > Something like this lightly smoke tested patch below. It may be > slightly more agressive than the original code for really small > freeable values (i.e. < 100) but otherwise should be roughly > equivalent to historic accumulation behaviour. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > mm: fix shrinker scan accumulation regression > > From: Dave Chinner <dchinner@redhat.com> JFYI: I'm testing this patch in our environment for fixing the memcg memory leak. It will take a couple of days to get reliable results. Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-21 22:46 ` Roman Gushchin @ 2019-02-22 1:48 ` Rik van Riel 2019-02-22 1:57 ` Roman Gushchin 2019-02-28 20:30 ` Roman Gushchin 1 sibling, 1 reply; 11+ messages in thread From: Rik van Riel @ 2019-02-22 1:48 UTC (permalink / raw) To: Roman Gushchin, Dave Chinner Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org [-- Attachment #1: Type: text/plain, Size: 1156 bytes --] On Thu, 2019-02-21 at 17:46 -0500, Roman Gushchin wrote: > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > > > I'm just going to fix the original regression in the shrinker > > > algorithm by restoring the gradual accumulation behaviour, and > > > this > > > whole series of problems can be put to bed. > > > > Something like this lightly smoke tested patch below. It may be > > slightly more agressive than the original code for really small > > freeable values (i.e. < 100) but otherwise should be roughly > > equivalent to historic accumulation behaviour. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > mm: fix shrinker scan accumulation regression > > > > From: Dave Chinner <dchinner@redhat.com> > > JFYI: I'm testing this patch in our environment for fixing > the memcg memory leak. > > It will take a couple of days to get reliable results. Just to clarify, is this test with fls instead of ilog2, so the last item in a slab cache can get reclaimed as well? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-22 1:48 ` Rik van Riel @ 2019-02-22 1:57 ` Roman Gushchin 0 siblings, 0 replies; 11+ messages in thread From: Roman Gushchin @ 2019-02-22 1:57 UTC (permalink / raw) To: Rik van Riel Cc: Dave Chinner, lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Thu, Feb 21, 2019 at 08:48:27PM -0500, Rik van Riel wrote: > On Thu, 2019-02-21 at 17:46 -0500, Roman Gushchin wrote: > > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > > > > I'm just going to fix the original regression in the shrinker > > > > algorithm by restoring the gradual accumulation behaviour, and > > > > this > > > > whole series of problems can be put to bed. > > > > > > Something like this lightly smoke tested patch below. It may be > > > slightly more agressive than the original code for really small > > > freeable values (i.e. < 100) but otherwise should be roughly > > > equivalent to historic accumulation behaviour. > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > > > mm: fix shrinker scan accumulation regression > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > JFYI: I'm testing this patch in our environment for fixing > > the memcg memory leak. > > > > It will take a couple of days to get reliable results. > > Just to clarify, is this test with fls instead of ilog2, > so the last item in a slab cache can get reclaimed as > well? I'm testing both version. Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-21 22:46 ` Roman Gushchin 2019-02-22 1:48 ` Rik van Riel @ 2019-02-28 20:30 ` Roman Gushchin 2019-02-28 21:30 ` Dave Chinner 1 sibling, 1 reply; 11+ messages in thread From: Roman Gushchin @ 2019-02-28 20:30 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote: > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > > > I'm just going to fix the original regression in the shrinker > > > algorithm by restoring the gradual accumulation behaviour, and this > > > whole series of problems can be put to bed. > > > > Something like this lightly smoke tested patch below. It may be > > slightly more agressive than the original code for really small > > freeable values (i.e. < 100) but otherwise should be roughly > > equivalent to historic accumulation behaviour. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > mm: fix shrinker scan accumulation regression > > > > From: Dave Chinner <dchinner@redhat.com> > > JFYI: I'm testing this patch in our environment for fixing > the memcg memory leak. > > It will take a couple of days to get reliable results. > So unfortunately the proposed patch is not solving the dying memcg reclaim issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes, and also with more a aggressive zero-seek slabs reclaim (always scanning at least SHRINK_BATCH for zero-seeks shrinkers). In all cases the number of outstanding memory cgroups grew almost linearly with time and didn't show any signs of plateauing. Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-28 20:30 ` Roman Gushchin @ 2019-02-28 21:30 ` Dave Chinner 2019-02-28 22:29 ` Roman Gushchin 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2019-02-28 21:30 UTC (permalink / raw) To: Roman Gushchin Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Thu, Feb 28, 2019 at 08:30:49PM +0000, Roman Gushchin wrote: > On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote: > > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > > > > I'm just going to fix the original regression in the shrinker > > > > algorithm by restoring the gradual accumulation behaviour, and this > > > > whole series of problems can be put to bed. > > > > > > Something like this lightly smoke tested patch below. It may be > > > slightly more agressive than the original code for really small > > > freeable values (i.e. < 100) but otherwise should be roughly > > > equivalent to historic accumulation behaviour. > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > > > mm: fix shrinker scan accumulation regression > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > JFYI: I'm testing this patch in our environment for fixing > > the memcg memory leak. > > > > It will take a couple of days to get reliable results. > > > > So unfortunately the proposed patch is not solving the dying memcg reclaim > issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes, > and also with more a aggressive zero-seek slabs reclaim (always scanning > at least SHRINK_BATCH for zero-seeks shrinkers). Which makes sense if it's inodes and/or dentries shared across multiple memcgs and actively referenced by non-owner memcgs that prevent dying memcg reclaim. i.e. the shrinkers will not reclaim frequently referenced objects unless there is extreme memory pressure put on them. > In all cases the number > of outstanding memory cgroups grew almost linearly with time and didn't show > any signs of plateauing. What happend to the amount of memory pinned by those dying memcgs? Did that change in any way? Did the rate of reclaim of objects referencing dying memcgs improve? What type of objects are still pinning those dying memcgs? did you run any traces to see how big those pinned caches were and how much deferal and scanning work was actually being done on them? i.e. if all you measured is the number of memcgs over time, then we don't have any information that tells us whether this patch has had any effect on the reclaimable memory footprint of those dying memcgs or what is actually pinning them in memory. IOWs, we need to know if this patch reduces the dying memcg references down to just the objects that non-owner memcgs are keeping active in cache and hence preventing the dying memcgs from being freed. If this patch does that, then the shrinkers are doing exactly what they should be doing, and the remaining problem to solve is reparenting actively referenced objects pinning the dying memcgs... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues 2019-02-28 21:30 ` Dave Chinner @ 2019-02-28 22:29 ` Roman Gushchin 0 siblings, 0 replies; 11+ messages in thread From: Roman Gushchin @ 2019-02-28 22:29 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, dchinner@redhat.com, guroan@gmail.com, Kernel Team, hannes@cmpxchg.org On Fri, Mar 01, 2019 at 08:30:32AM +1100, Dave Chinner wrote: > On Thu, Feb 28, 2019 at 08:30:49PM +0000, Roman Gushchin wrote: > > On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote: > > > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote: > > > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote: > > > > > I'm just going to fix the original regression in the shrinker > > > > > algorithm by restoring the gradual accumulation behaviour, and this > > > > > whole series of problems can be put to bed. > > > > > > > > Something like this lightly smoke tested patch below. It may be > > > > slightly more agressive than the original code for really small > > > > freeable values (i.e. < 100) but otherwise should be roughly > > > > equivalent to historic accumulation behaviour. > > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com > > > > > > > > mm: fix shrinker scan accumulation regression > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > JFYI: I'm testing this patch in our environment for fixing > > > the memcg memory leak. > > > > > > It will take a couple of days to get reliable results. > > > > > > > So unfortunately the proposed patch is not solving the dying memcg reclaim > > issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes, > > and also with more a aggressive zero-seek slabs reclaim (always scanning > > at least SHRINK_BATCH for zero-seeks shrinkers). > > Which makes sense if it's inodes and/or dentries shared across > multiple memcgs and actively referenced by non-owner memcgs that > prevent dying memcg reclaim. i.e. the shrinkers will not reclaim > frequently referenced objects unless there is extreme memory > pressure put on them. > > > In all cases the number > > of outstanding memory cgroups grew almost linearly with time and didn't show > > any signs of plateauing. > > What happend to the amount of memory pinned by those dying memcgs? > Did that change in any way? Did the rate of reclaim of objects > referencing dying memcgs improve? What type of objects are still > pinning those dying memcgs? did you run any traces to see how big > those pinned caches were and how much deferal and scanning work was > actually being done on them? The amount of pinned memory is approximately proportional to the number of dying cgroups, in other words it also grows almost linearly. The rate of reclaim is better than without any patches, and it's approximately on pair with a version with Rik's patches. > > i.e. if all you measured is the number of memcgs over time, then we > don't have any information that tells us whether this patch has had > any effect on the reclaimable memory footprint of those dying memcgs > or what is actually pinning them in memory. I'm not saying that the patch is bad, I'm saying it's not sufficient in our environment. > > IOWs, we need to know if this patch reduces the dying memcg > references down to just the objects that non-owner memcgs are > keeping active in cache and hence preventing the dying memcgs from > being freed. If this patch does that, then the shrinkers are doing > exactly what they should be doing, and the remaining problem to > solve is reparenting actively referenced objects pinning the dying > memcgs... Yes, I agree. I'll take a look. Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-02-28 22:30 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-19 7:13 [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues Roman Gushchin 2019-02-20 2:47 ` Dave Chinner 2019-02-20 5:50 ` Dave Chinner 2019-02-20 7:27 ` Dave Chinner 2019-02-20 16:20 ` Johannes Weiner 2019-02-21 22:46 ` Roman Gushchin 2019-02-22 1:48 ` Rik van Riel 2019-02-22 1:57 ` Roman Gushchin 2019-02-28 20:30 ` Roman Gushchin 2019-02-28 21:30 ` Dave Chinner 2019-02-28 22:29 ` Roman Gushchin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).