* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines [not found] <CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com> @ 2018-07-18 4:23 ` Andrew Morton 2018-07-18 10:42 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2018-07-18 4:23 UTC (permalink / raw) To: Bruce Merry; +Cc: linux-kernel, linux-mm (cc linux-mm) On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote: > Hi > > I've run into an odd performance issue in the kernel, and not being a > kernel dev or knowing terribly much about cgroups, am looking for > advice on diagnosing the problem further (I discovered this while > trying to pin down high CPU load in cadvisor). > > On some machines in our production system, cat > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one > machine), while on other nominally identical machines it is fast > (2ms). > > One other thing I've noticed is that the affected machines generally > have much larger values for SUnreclaim in /proc/memstat (up to several > GB), and slabtop reports >1GB of dentry. > > Before I tracked the original problem (high CPU usage in cadvisor) > down to this, I rebooted one of the machines and the original problem > went away, so it seems to be cleared by a reboot; I'm reluctant to > reboot more machines to confirm since I don't have a sure-fire way to > reproduce the problem again to debug it. > > The machines are running Ubuntu 16.04 with kernel 4.13.0-41-generic. > They're running Docker, which creates a bunch of cgroups, but not an > excessive number: there are 106 memory.stat files in > /sys/fs/cgroup/memory. > > Digging a bit further, cat > /sys/fs/cgroup/memory/system.slice/memory.stat also takes ~500ms, but > "find /sys/fs/cgroup/memory/system.slice -mindepth 2 -name memory.stat > | xargs cat" takes only 8ms. > > Any thoughts, particularly on what I should compare between the good > and bad machines to narrow down the cause, or even better, how to > prevent it happening? > > Thanks > Bruce > -- > Bruce Merry > Senior Science Processing Developer > SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 4:23 ` Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines Andrew Morton @ 2018-07-18 10:42 ` Michal Hocko 2018-07-18 14:29 ` Bruce Merry 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2018-07-18 10:42 UTC (permalink / raw) To: Bruce Merry Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner, Vladimir Davydov [CC some more people] On Tue 17-07-18 21:23:07, Andrew Morton wrote: > (cc linux-mm) > > On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote: > > > Hi > > > > I've run into an odd performance issue in the kernel, and not being a > > kernel dev or knowing terribly much about cgroups, am looking for > > advice on diagnosing the problem further (I discovered this while > > trying to pin down high CPU load in cadvisor). > > > > On some machines in our production system, cat > > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one > > machine), while on other nominally identical machines it is fast > > (2ms). Could you try to use ftrace to see where the time is spent? memory_stat_show should only scale with the depth of the cgroup hierarchy for memory.stat to get cumulative numbers. All the rest should be simply reads of gathered counters. There is no locking involved in the current kernel. What is the kernel version you are using, btw? Keeping the reset of the email for new people on the CC > > > > One other thing I've noticed is that the affected machines generally > > have much larger values for SUnreclaim in /proc/memstat (up to several > > GB), and slabtop reports >1GB of dentry. > > > > Before I tracked the original problem (high CPU usage in cadvisor) > > down to this, I rebooted one of the machines and the original problem > > went away, so it seems to be cleared by a reboot; I'm reluctant to > > reboot more machines to confirm since I don't have a sure-fire way to > > reproduce the problem again to debug it. > > > > The machines are running Ubuntu 16.04 with kernel 4.13.0-41-generic. > > They're running Docker, which creates a bunch of cgroups, but not an > > excessive number: there are 106 memory.stat files in > > /sys/fs/cgroup/memory. > > > > Digging a bit further, cat > > /sys/fs/cgroup/memory/system.slice/memory.stat also takes ~500ms, but > > "find /sys/fs/cgroup/memory/system.slice -mindepth 2 -name memory.stat > > | xargs cat" takes only 8ms. > > > > Any thoughts, particularly on what I should compare between the good > > and bad machines to narrow down the cause, or even better, how to > > prevent it happening? > > > > Thanks > > Bruce > > -- > > Bruce Merry > > Senior Science Processing Developer > > SKA South Africa -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 10:42 ` Michal Hocko @ 2018-07-18 14:29 ` Bruce Merry 2018-07-18 14:47 ` Michal Hocko 2018-07-18 15:26 ` Shakeel Butt 0 siblings, 2 replies; 23+ messages in thread From: Bruce Merry @ 2018-07-18 14:29 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 12:42, Michal Hocko <mhocko@kernel.org> wrote: > [CC some more people] > > On Tue 17-07-18 21:23:07, Andrew Morton wrote: >> (cc linux-mm) >> >> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote: >> >> > Hi >> > >> > I've run into an odd performance issue in the kernel, and not being a >> > kernel dev or knowing terribly much about cgroups, am looking for >> > advice on diagnosing the problem further (I discovered this while >> > trying to pin down high CPU load in cadvisor). >> > >> > On some machines in our production system, cat >> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one >> > machine), while on other nominally identical machines it is fast >> > (2ms). > > Could you try to use ftrace to see where the time is spent? Thanks for looking into this. I'm not familiar with ftrace. Can you give me a specific command line to run? Based on "perf record cat /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following: 42.09% cat [kernel.kallsyms] [k] memcg_stat_show 29.19% cat [kernel.kallsyms] [k] memcg_sum_events.isra.22 12.41% cat [kernel.kallsyms] [k] mem_cgroup_iter 5.42% cat [kernel.kallsyms] [k] _find_next_bit 4.14% cat [kernel.kallsyms] [k] css_next_descendant_pre 3.44% cat [kernel.kallsyms] [k] find_next_bit 2.84% cat [kernel.kallsyms] [k] mem_cgroup_node_nr_lru_pages > memory_stat_show should only scale with the depth of the cgroup > hierarchy for memory.stat to get cumulative numbers. All the rest should > be simply reads of gathered counters. There is no locking involved in > the current kernel. What is the kernel version you are using, btw? Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes some Ubuntu special sauce). Some new information: when this occurred on another machine I ran "echo 2 > /proc/sys/vm/drop_caches" to drop the dentry cache, and performance immediately improved. Unfortunately, I've not been able to deliberately reproduce the issue. I've tried doing the following 10^7 times in a loop and while it inflates the dentry cache, it doesn't cause any significant slowdown: 1. Create a temporary cgroup: mkdir /sys/fs/cgroup/memory/<name>. 2. stat /sys/fs/cgroup/memory/<name>/memory.stat 3. rmdir /sys/fs/cgroup/memory/<name> I've also tried inflating the dentry cache just by stat-ing millions of non-existent files, and again, no slowdown. So I'm not sure exactly how dentry cache is related. Regards Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 14:29 ` Bruce Merry @ 2018-07-18 14:47 ` Michal Hocko 2018-07-18 15:27 ` Bruce Merry 2018-07-18 15:26 ` Shakeel Butt 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2018-07-18 14:47 UTC (permalink / raw) To: Bruce Merry Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner, Vladimir Davydov On Wed 18-07-18 16:29:20, Bruce Merry wrote: > On 18 July 2018 at 12:42, Michal Hocko <mhocko@kernel.org> wrote: > > [CC some more people] > > > > On Tue 17-07-18 21:23:07, Andrew Morton wrote: > >> (cc linux-mm) > >> > >> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote: > >> > >> > Hi > >> > > >> > I've run into an odd performance issue in the kernel, and not being a > >> > kernel dev or knowing terribly much about cgroups, am looking for > >> > advice on diagnosing the problem further (I discovered this while > >> > trying to pin down high CPU load in cadvisor). > >> > > >> > On some machines in our production system, cat > >> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one > >> > machine), while on other nominally identical machines it is fast > >> > (2ms). > > > > Could you try to use ftrace to see where the time is spent? > > Thanks for looking into this. I'm not familiar with ftrace. Can you > give me a specific command line to run? Based on "perf record cat > /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following: > > 42.09% cat [kernel.kallsyms] [k] memcg_stat_show > 29.19% cat [kernel.kallsyms] [k] memcg_sum_events.isra.22 > 12.41% cat [kernel.kallsyms] [k] mem_cgroup_iter > 5.42% cat [kernel.kallsyms] [k] _find_next_bit > 4.14% cat [kernel.kallsyms] [k] css_next_descendant_pre > 3.44% cat [kernel.kallsyms] [k] find_next_bit > 2.84% cat [kernel.kallsyms] [k] mem_cgroup_node_nr_lru_pages I would just use perf record as you did. How long did the call take? Also is the excessive time an outlier or a more consistent thing? If the former does perf record show any difference? > > memory_stat_show should only scale with the depth of the cgroup > > hierarchy for memory.stat to get cumulative numbers. All the rest should > > be simply reads of gathered counters. There is no locking involved in > > the current kernel. What is the kernel version you are using, btw? > > Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes > some Ubuntu special sauce). Do you see the same whe running with the vanilla kernel? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 14:47 ` Michal Hocko @ 2018-07-18 15:27 ` Bruce Merry 2018-07-18 15:33 ` Shakeel Butt 0 siblings, 1 reply; 23+ messages in thread From: Bruce Merry @ 2018-07-18 15:27 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 16:47, Michal Hocko <mhocko@kernel.org> wrote: >> Thanks for looking into this. I'm not familiar with ftrace. Can you >> give me a specific command line to run? Based on "perf record cat >> /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following: >> >> 42.09% cat [kernel.kallsyms] [k] memcg_stat_show >> 29.19% cat [kernel.kallsyms] [k] memcg_sum_events.isra.22 >> 12.41% cat [kernel.kallsyms] [k] mem_cgroup_iter >> 5.42% cat [kernel.kallsyms] [k] _find_next_bit >> 4.14% cat [kernel.kallsyms] [k] css_next_descendant_pre >> 3.44% cat [kernel.kallsyms] [k] find_next_bit >> 2.84% cat [kernel.kallsyms] [k] mem_cgroup_node_nr_lru_pages > > I would just use perf record as you did. How long did the call take? > Also is the excessive time an outlier or a more consistent thing? If the > former does perf record show any difference? I didn't note the exact time for that particular run, but it's pretty consistently 372-377ms on the machine that has that perf report. The times differ between machines showing the symptom (anywhere from 200-500ms), but are consistent (within a few ms) in back-to-back runs on each machine. >> Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes >> some Ubuntu special sauce). > > Do you see the same whe running with the vanilla kernel? We don't currently have any boxes running vanilla kernels. While I could install a test box with a vanilla kernel, I don't know how to reproduce the problem, what piece of our production environment is triggering it, or even why some machines are unaffected, so if the problem didn't re-occur on the test box I wouldn't be able to conclude anything useful. Do you have suggestions on things I could try that might trigger this? e.g. are there cases where a cgroup no longer shows up in the filesystem but is still lingering while waiting for its refcount to hit zero? Does every child cgroup contribute to the stat_show cost of its parent or does it have to have some non-trivial variation from its parent? Thanks Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 15:27 ` Bruce Merry @ 2018-07-18 15:33 ` Shakeel Butt 0 siblings, 0 replies; 23+ messages in thread From: Shakeel Butt @ 2018-07-18 15:33 UTC (permalink / raw) To: bmerry Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Wed, Jul 18, 2018 at 8:27 AM Bruce Merry <bmerry@ska.ac.za> wrote: > > On 18 July 2018 at 16:47, Michal Hocko <mhocko@kernel.org> wrote: > >> Thanks for looking into this. I'm not familiar with ftrace. Can you > >> give me a specific command line to run? Based on "perf record cat > >> /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following: > >> > >> 42.09% cat [kernel.kallsyms] [k] memcg_stat_show > >> 29.19% cat [kernel.kallsyms] [k] memcg_sum_events.isra.22 > >> 12.41% cat [kernel.kallsyms] [k] mem_cgroup_iter > >> 5.42% cat [kernel.kallsyms] [k] _find_next_bit > >> 4.14% cat [kernel.kallsyms] [k] css_next_descendant_pre > >> 3.44% cat [kernel.kallsyms] [k] find_next_bit > >> 2.84% cat [kernel.kallsyms] [k] mem_cgroup_node_nr_lru_pages > > > > I would just use perf record as you did. How long did the call take? > > Also is the excessive time an outlier or a more consistent thing? If the > > former does perf record show any difference? > > I didn't note the exact time for that particular run, but it's pretty > consistently 372-377ms on the machine that has that perf report. The > times differ between machines showing the symptom (anywhere from > 200-500ms), but are consistent (within a few ms) in back-to-back runs > on each machine. > > >> Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes > >> some Ubuntu special sauce). > > > > Do you see the same whe running with the vanilla kernel? > > We don't currently have any boxes running vanilla kernels. While I > could install a test box with a vanilla kernel, I don't know how to > reproduce the problem, what piece of our production environment is > triggering it, or even why some machines are unaffected, so if the > problem didn't re-occur on the test box I wouldn't be able to conclude > anything useful. > > Do you have suggestions on things I could try that might trigger this? > e.g. are there cases where a cgroup no longer shows up in the > filesystem but is still lingering while waiting for its refcount to > hit zero? Does every child cgroup contribute to the stat_show cost of > its parent or does it have to have some non-trivial variation from its > parent? > The memcg tree does include all zombie memcgs and these zombies does contribute to the memcg_stat_show cost. Shakeel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 14:29 ` Bruce Merry 2018-07-18 14:47 ` Michal Hocko @ 2018-07-18 15:26 ` Shakeel Butt 2018-07-18 15:37 ` Bruce Merry 1 sibling, 1 reply; 23+ messages in thread From: Shakeel Butt @ 2018-07-18 15:26 UTC (permalink / raw) To: bmerry Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Wed, Jul 18, 2018 at 7:29 AM Bruce Merry <bmerry@ska.ac.za> wrote: > > On 18 July 2018 at 12:42, Michal Hocko <mhocko@kernel.org> wrote: > > [CC some more people] > > > > On Tue 17-07-18 21:23:07, Andrew Morton wrote: > >> (cc linux-mm) > >> > >> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote: > >> > >> > Hi > >> > > >> > I've run into an odd performance issue in the kernel, and not being a > >> > kernel dev or knowing terribly much about cgroups, am looking for > >> > advice on diagnosing the problem further (I discovered this while > >> > trying to pin down high CPU load in cadvisor). > >> > > >> > On some machines in our production system, cat > >> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one > >> > machine), while on other nominally identical machines it is fast > >> > (2ms). > > > > Could you try to use ftrace to see where the time is spent? > > Thanks for looking into this. I'm not familiar with ftrace. Can you > give me a specific command line to run? Based on "perf record cat > /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following: > > 42.09% cat [kernel.kallsyms] [k] memcg_stat_show > 29.19% cat [kernel.kallsyms] [k] memcg_sum_events.isra.22 > 12.41% cat [kernel.kallsyms] [k] mem_cgroup_iter > 5.42% cat [kernel.kallsyms] [k] _find_next_bit > 4.14% cat [kernel.kallsyms] [k] css_next_descendant_pre > 3.44% cat [kernel.kallsyms] [k] find_next_bit > 2.84% cat [kernel.kallsyms] [k] mem_cgroup_node_nr_lru_pages > It seems like you are using cgroup-v1. How many nodes are there in your memcg tree and also how many cpus does the system have? Please note that memcg_stat_show or reading memory.stat in cgroup-v1 is not optimized as cgroup-v2. The function memcg_stat_show() in 4.13 does ~17 tree walks and then for ~12 of those tree walks, it goes through all cpus for each node in the memcg tree. In 4.16, a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") optimizes aways the cpu traversal at the expense of some accuracy. Next optimization would be to do just one memcg tree traversal similar to cgroup-v2's memory_stat_show(). Anyways, is it possible for you to try 4.16 kernel? Shakeel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 15:26 ` Shakeel Butt @ 2018-07-18 15:37 ` Bruce Merry 2018-07-18 15:49 ` Shakeel Butt 0 siblings, 1 reply; 23+ messages in thread From: Bruce Merry @ 2018-07-18 15:37 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 17:26, Shakeel Butt <shakeelb@google.com> wrote: > On Wed, Jul 18, 2018 at 7:29 AM Bruce Merry <bmerry@ska.ac.za> wrote: > It seems like you are using cgroup-v1. How many nodes are there in > your memcg tree and also how many cpus does the system have? >From my original email: "there are 106 memory.stat files in /sys/fs/cgroup/memory." - is that what you mean by the number of nodes? The affected systems all have 8 CPU cores (hyperthreading is disabled). > Please note that memcg_stat_show or reading memory.stat in cgroup-v1 > is not optimized as cgroup-v2. The function memcg_stat_show() in 4.13 > does ~17 tree walks and then for ~12 of those tree walks, it goes > through all cpus for each node in the memcg tree. In 4.16, > a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat > reporting") optimizes aways the cpu traversal at the expense of some > accuracy. Next optimization would be to do just one memcg tree > traversal similar to cgroup-v2's memory_stat_show(). On most machines it is still fast (1-2ms), and there is no difference in the number of CPUs and only very small differences in the number of live memory cgroups, so presumably something else is going on. > The memcg tree does include all zombie memcgs and these zombies does > contribute to the memcg_stat_show cost. That sounds promising. Is there any way to tell how many zombies there are, and is there any way to deliberately create zombies? If I can produce zombies that might give me a reliable way to reproduce the problem, which could then sensibly be tested against newer kernel versions. Thanks Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 15:37 ` Bruce Merry @ 2018-07-18 15:49 ` Shakeel Butt 2018-07-18 17:40 ` Bruce Merry 0 siblings, 1 reply; 23+ messages in thread From: Shakeel Butt @ 2018-07-18 15:49 UTC (permalink / raw) To: bmerry Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote: > > On 18 July 2018 at 17:26, Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Jul 18, 2018 at 7:29 AM Bruce Merry <bmerry@ska.ac.za> wrote: > > It seems like you are using cgroup-v1. How many nodes are there in > > your memcg tree and also how many cpus does the system have? > > From my original email: "there are 106 memory.stat files in > /sys/fs/cgroup/memory." - is that what you mean by the number of > nodes? Yes but it seems like your system might be suffering with zombies. > > The affected systems all have 8 CPU cores (hyperthreading is disabled). > > > Please note that memcg_stat_show or reading memory.stat in cgroup-v1 > > is not optimized as cgroup-v2. The function memcg_stat_show() in 4.13 > > does ~17 tree walks and then for ~12 of those tree walks, it goes > > through all cpus for each node in the memcg tree. In 4.16, > > a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat > > reporting") optimizes aways the cpu traversal at the expense of some > > accuracy. Next optimization would be to do just one memcg tree > > traversal similar to cgroup-v2's memory_stat_show(). > > On most machines it is still fast (1-2ms), and there is no difference > in the number of CPUs and only very small differences in the number of > live memory cgroups, so presumably something else is going on. > > > The memcg tree does include all zombie memcgs and these zombies does > > contribute to the memcg_stat_show cost. > > That sounds promising. Is there any way to tell how many zombies there > are, and is there any way to deliberately create zombies? If I can > produce zombies that might give me a reliable way to reproduce the > problem, which could then sensibly be tested against newer kernel > versions. > Yes, very easy to produce zombies, though I don't think kernel provides any way to tell how many zombies exist on the system. To create a zombie, first create a memcg node, enter that memcg, create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. That memcg will be a zombie until you delete that tmpfs file. Shakeel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 15:49 ` Shakeel Butt @ 2018-07-18 17:40 ` Bruce Merry 2018-07-18 17:48 ` Shakeel Butt ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Bruce Merry @ 2018-07-18 17:40 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 17:49, Shakeel Butt <shakeelb@google.com> wrote: > On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote: >> That sounds promising. Is there any way to tell how many zombies there >> are, and is there any way to deliberately create zombies? If I can >> produce zombies that might give me a reliable way to reproduce the >> problem, which could then sensibly be tested against newer kernel >> versions. >> > > Yes, very easy to produce zombies, though I don't think kernel > provides any way to tell how many zombies exist on the system. > > To create a zombie, first create a memcg node, enter that memcg, > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. > That memcg will be a zombie until you delete that tmpfs file. Thanks, that makes sense. I'll see if I can reproduce the issue. Do you expect the same thing to happen with normal (non-tmpfs) files that are sitting in the page cache, and/or dentries? Cheers Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 17:40 ` Bruce Merry @ 2018-07-18 17:48 ` Shakeel Butt 2018-07-18 17:58 ` Bruce Merry 2018-07-24 10:05 ` Bruce Merry 2018-07-26 0:55 ` Singh, Balbir 2 siblings, 1 reply; 23+ messages in thread From: Shakeel Butt @ 2018-07-18 17:48 UTC (permalink / raw) To: bmerry Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Wed, Jul 18, 2018 at 10:40 AM Bruce Merry <bmerry@ska.ac.za> wrote: > > On 18 July 2018 at 17:49, Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote: > >> That sounds promising. Is there any way to tell how many zombies there > >> are, and is there any way to deliberately create zombies? If I can > >> produce zombies that might give me a reliable way to reproduce the > >> problem, which could then sensibly be tested against newer kernel > >> versions. > >> > > > > Yes, very easy to produce zombies, though I don't think kernel > > provides any way to tell how many zombies exist on the system. > > > > To create a zombie, first create a memcg node, enter that memcg, > > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. > > That memcg will be a zombie until you delete that tmpfs file. > > Thanks, that makes sense. I'll see if I can reproduce the issue. Do > you expect the same thing to happen with normal (non-tmpfs) files that > are sitting in the page cache, and/or dentries? > Normal files and their dentries can get reclaimed while tmpfs will stick and even if the data of tmpfs goes to swap, the kmem related to tmpfs files will remain in memory. Shakeel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 17:48 ` Shakeel Butt @ 2018-07-18 17:58 ` Bruce Merry 2018-07-18 18:13 ` Shakeel Butt 0 siblings, 1 reply; 23+ messages in thread From: Bruce Merry @ 2018-07-18 17:58 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 19:48, Shakeel Butt <shakeelb@google.com> wrote: > On Wed, Jul 18, 2018 at 10:40 AM Bruce Merry <bmerry@ska.ac.za> wrote: >> > Yes, very easy to produce zombies, though I don't think kernel >> > provides any way to tell how many zombies exist on the system. >> > >> > To create a zombie, first create a memcg node, enter that memcg, >> > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. >> > That memcg will be a zombie until you delete that tmpfs file. >> >> Thanks, that makes sense. I'll see if I can reproduce the issue. Do >> you expect the same thing to happen with normal (non-tmpfs) files that >> are sitting in the page cache, and/or dentries? >> > > Normal files and their dentries can get reclaimed while tmpfs will > stick and even if the data of tmpfs goes to swap, the kmem related to > tmpfs files will remain in memory. Sure, page cache and dentries are reclaimable given memory pressure. These machines all have more memory than they need though (64GB+) and generally don't come under any memory pressure. I'm just wondering if the behaviour we're seeing can be explained as a result of a lot of dentries sticking around (because there is no memory pressure) and in turn causing a lot of zombie cgroups to stay present until something forces reclamation of dentries. Cheers Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 17:58 ` Bruce Merry @ 2018-07-18 18:13 ` Shakeel Butt 2018-07-18 18:43 ` Bruce Merry 0 siblings, 1 reply; 23+ messages in thread From: Shakeel Butt @ 2018-07-18 18:13 UTC (permalink / raw) To: bmerry Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Wed, Jul 18, 2018 at 10:58 AM Bruce Merry <bmerry@ska.ac.za> wrote: > > On 18 July 2018 at 19:48, Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Jul 18, 2018 at 10:40 AM Bruce Merry <bmerry@ska.ac.za> wrote: > >> > Yes, very easy to produce zombies, though I don't think kernel > >> > provides any way to tell how many zombies exist on the system. > >> > > >> > To create a zombie, first create a memcg node, enter that memcg, > >> > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. > >> > That memcg will be a zombie until you delete that tmpfs file. > >> > >> Thanks, that makes sense. I'll see if I can reproduce the issue. Do > >> you expect the same thing to happen with normal (non-tmpfs) files that > >> are sitting in the page cache, and/or dentries? > >> > > > > Normal files and their dentries can get reclaimed while tmpfs will > > stick and even if the data of tmpfs goes to swap, the kmem related to > > tmpfs files will remain in memory. > > Sure, page cache and dentries are reclaimable given memory pressure. > These machines all have more memory than they need though (64GB+) and > generally don't come under any memory pressure. I'm just wondering if > the behaviour we're seeing can be explained as a result of a lot of > dentries sticking around (because there is no memory pressure) and in > turn causing a lot of zombie cgroups to stay present until something > forces reclamation of dentries. > Yes, if there is no memory pressure such memory can stay around. On your production machine, before deleting memory containers, you can try force_empty to reclaim such memory from them. See if that helps. Shakeel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 18:13 ` Shakeel Butt @ 2018-07-18 18:43 ` Bruce Merry 0 siblings, 0 replies; 23+ messages in thread From: Bruce Merry @ 2018-07-18 18:43 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 20:13, Shakeel Butt <shakeelb@google.com> wrote: > On Wed, Jul 18, 2018 at 10:58 AM Bruce Merry <bmerry@ska.ac.za> wrote: > Yes, if there is no memory pressure such memory can stay around. > > On your production machine, before deleting memory containers, you can > try force_empty to reclaim such memory from them. See if that helps. Thanks. At the moment the cgroups are all managed by systemd and docker, but I'll keep that in mind while experimenting. Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 17:40 ` Bruce Merry 2018-07-18 17:48 ` Shakeel Butt @ 2018-07-24 10:05 ` Bruce Merry 2018-07-24 10:50 ` Marinko Catovic ` (2 more replies) 2018-07-26 0:55 ` Singh, Balbir 2 siblings, 3 replies; 23+ messages in thread From: Bruce Merry @ 2018-07-24 10:05 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 18 July 2018 at 19:40, Bruce Merry <bmerry@ska.ac.za> wrote: >> Yes, very easy to produce zombies, though I don't think kernel >> provides any way to tell how many zombies exist on the system. >> >> To create a zombie, first create a memcg node, enter that memcg, >> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. >> That memcg will be a zombie until you delete that tmpfs file. > > Thanks, that makes sense. I'll see if I can reproduce the issue. Hi I've had some time to experiment with this issue, and I've now got a way to reproduce it fairly reliably, including with a stock 4.17.8 kernel. However, it's very phase-of-the-moon stuff, and even apparently trivial changes (like switching the order in which the files are statted) makes the issue disappear. To reproduce: 1. Start cadvisor running. I use the 0.30.2 binary from Github, and run it with sudo ./cadvisor-0.30.2 --logtostderr=true 2. Run the Python 3 script below, which repeatedly creates a cgroup, enters it, stats some files in it, and leaves it again (and removes it). It takes a few minutes to run. 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms for me. 4. sudo sysctl vm.drop_caches=2 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms. I've also added some code to memcg_stat_show to report the number of cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree). Running the script increases it from ~700 to ~41000. The script iterates 250,000 times, so only some fraction of the cgroups become zombies. I also tried the suggestion of force_empty: it makes the problem go away, but is also very, very slow (about 0.5s per iteration), and given the sensitivity of the test to small changes I don't know how meaningful that is. Reproduction code (if you have tqdm installed you get a nice progress bar, but not required). Hopefully Gmail doesn't do any format mangling: #!/usr/bin/env python3 import os try: from tqdm import trange as range except ImportError: pass def clean(): try: os.rmdir(name) except FileNotFoundError: pass def move_to(cgroup): with open(cgroup + '/tasks', 'w') as f: print(pid, file=f) pid = os.getpid() os.chdir('/sys/fs/cgroup/memory') name = 'dummy' N = 250000 clean() try: for i in range(N): os.mkdir(name) move_to(name) for filename in ['memory.stat', 'memory.swappiness']: os.stat(os.path.join(name, filename)) move_to('user.slice') os.rmdir(name) finally: move_to('user.slice') clean() Regards Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-24 10:05 ` Bruce Merry @ 2018-07-24 10:50 ` Marinko Catovic 2018-07-25 12:29 ` Michal Hocko 2018-07-25 12:32 ` Michal Hocko 2018-07-26 12:35 ` Bruce Merry 2 siblings, 1 reply; 23+ messages in thread From: Marinko Catovic @ 2018-07-24 10:50 UTC (permalink / raw) To: linux-mm [-- Attachment #1: Type: text/plain, Size: 4508 bytes --] hello guys excuse me please for dropping in, but I can not ignore the fact that all this sounds like 99%+ the same as the issue I am going nuts with for the past 2 months, since I switched kernels from version 3 to 4. Please look at the topic `Caching/buffers become useless after some time`. What I did not mention there is that cgroups are also mounted and used, but not actively since I have some scripting issue with setting them up correctly, but there is active data in /sys/fs/cgroup/memory/memory.stat so it might be related to cgroups - I did not think of that until now. same story here as well, 2> into drop_caches solves the issue temporarily, for maybe 2-4 days with lots of I/O. I can however test and play around with cgroups - if one may want to suggest to disable them I'd gladly monitor the behavior (please tell me what and how to do it, if necessary). Also I am curious: could you disable cgroups as well, just to see whether it helps and is actually associated with cgroups? my sysctl regarding vm is: vm.dirty_ratio = 15 vm.dirty_background_ratio = 3 vm.vfs_cache_pressure = 1 I may tell (not for sure) that this issue is less significant since I lowered these values, previously I had 90/80 on dirty_ratio and dirty_background_ratio, not sure about the cache pressue any more. Still there is lots of ram unallocated, usually at least half, mostly even more totally unused, the hosts have 64GB of RAM as well. I hope this is kinda related, so we can work together on pinpointing this, that issue is not going away for me and causes lots of headache slowing down my entire business. 2018-07-24 12:05 GMT+02:00 Bruce Merry <bmerry@ska.ac.za>: > On 18 July 2018 at 19:40, Bruce Merry <bmerry@ska.ac.za> wrote: > >> Yes, very easy to produce zombies, though I don't think kernel > >> provides any way to tell how many zombies exist on the system. > >> > >> To create a zombie, first create a memcg node, enter that memcg, > >> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. > >> That memcg will be a zombie until you delete that tmpfs file. > > > > Thanks, that makes sense. I'll see if I can reproduce the issue. > > Hi > > I've had some time to experiment with this issue, and I've now got a > way to reproduce it fairly reliably, including with a stock 4.17.8 > kernel. However, it's very phase-of-the-moon stuff, and even > apparently trivial changes (like switching the order in which the > files are statted) makes the issue disappear. > > To reproduce: > 1. Start cadvisor running. I use the 0.30.2 binary from Github, and > run it with sudo ./cadvisor-0.30.2 --logtostderr=true > 2. Run the Python 3 script below, which repeatedly creates a cgroup, > enters it, stats some files in it, and leaves it again (and removes > it). It takes a few minutes to run. > 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms > for me. > 4. sudo sysctl vm.drop_caches=2 > 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms. > > I've also added some code to memcg_stat_show to report the number of > cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree). > Running the script increases it from ~700 to ~41000. The script > iterates 250,000 times, so only some fraction of the cgroups become > zombies. > > I also tried the suggestion of force_empty: it makes the problem go > away, but is also very, very slow (about 0.5s per iteration), and > given the sensitivity of the test to small changes I don't know how > meaningful that is. > > Reproduction code (if you have tqdm installed you get a nice progress > bar, but not required). Hopefully Gmail doesn't do any format > mangling: > > > #!/usr/bin/env python3 > import os > > try: > from tqdm import trange as range > except ImportError: > pass > > > def clean(): > try: > os.rmdir(name) > except FileNotFoundError: > pass > > > def move_to(cgroup): > with open(cgroup + '/tasks', 'w') as f: > print(pid, file=f) > > > pid = os.getpid() > os.chdir('/sys/fs/cgroup/memory') > name = 'dummy' > N = 250000 > clean() > try: > for i in range(N): > os.mkdir(name) > move_to(name) > for filename in ['memory.stat', 'memory.swappiness']: > os.stat(os.path.join(name, filename)) > move_to('user.slice') > os.rmdir(name) > finally: > move_to('user.slice') > clean() > > > Regards > Bruce > -- > Bruce Merry > Senior Science Processing Developer > SKA South Africa > > [-- Attachment #2: Type: text/html, Size: 5719 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-24 10:50 ` Marinko Catovic @ 2018-07-25 12:29 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2018-07-25 12:29 UTC (permalink / raw) To: Marinko Catovic; +Cc: linux-mm On Tue 24-07-18 12:50:27, Marinko Catovic wrote: > I hope this is kinda related, so we can work together on pinpointing this, > that issue is not going away > for me and causes lots of headache slowing down my entire business. It think your problem is not really related. I still didn't get to your collected data but this issue is more related to the laziness of the cgroup objects tear down. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-24 10:05 ` Bruce Merry 2018-07-24 10:50 ` Marinko Catovic @ 2018-07-25 12:32 ` Michal Hocko 2018-07-26 12:35 ` Bruce Merry 2 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2018-07-25 12:32 UTC (permalink / raw) To: Bruce Merry Cc: Shakeel Butt, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Tue 24-07-18 12:05:35, Bruce Merry wrote: [...] > I've also added some code to memcg_stat_show to report the number of > cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree). > Running the script increases it from ~700 to ~41000. The script > iterates 250,000 times, so only some fraction of the cgroups become > zombies. So this is definitely "too many zombies" to delay the collecting of cumulative stats. Maybe we need to limit the number of zombies and reclaim them more actively. I have seen Shakeel has posted something but it looked more on the accounting side from a quick glance. I can see you are using cgroup v1 so your workaround would be to memory.force_empty before you remove the group. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-24 10:05 ` Bruce Merry 2018-07-24 10:50 ` Marinko Catovic 2018-07-25 12:32 ` Michal Hocko @ 2018-07-26 12:35 ` Bruce Merry 2018-07-26 12:48 ` Michal Hocko 2 siblings, 1 reply; 23+ messages in thread From: Bruce Merry @ 2018-07-26 12:35 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 24 July 2018 at 12:05, Bruce Merry <bmerry@ska.ac.za> wrote: > To reproduce: > 1. Start cadvisor running. I use the 0.30.2 binary from Github, and > run it with sudo ./cadvisor-0.30.2 --logtostderr=true > 2. Run the Python 3 script below, which repeatedly creates a cgroup, > enters it, stats some files in it, and leaves it again (and removes > it). It takes a few minutes to run. > 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms for me. > 4. sudo sysctl vm.drop_caches=2 > 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms. > > I've also added some code to memcg_stat_show to report the number of > cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree). > Running the script increases it from ~700 to ~41000. The script > iterates 250,000 times, so only some fraction of the cgroups become > zombies. I've discovered that I'd messed up that instrumentation code (it was incrementing inside a loop so counted 5x too many cgroups), so some of the things I said turn out to be wrong. Let me try again: - Running the script generates about 8000 zombies (not 40000), with or without Shakeel's patch (for 250,000 cgroups created/destroyed - so possibly there is some timing condition that makes them into zombies. I've only measured it with 4.17, but based on timing results I have no particular reason to think it's wildly different to older kernels. - After running the script 5 times (to generate 40K zombies), getting the stats takes 20ms with Shakeel's patch and 80ms without it (on 4.17.9) - which is a speedup of the same order of magnitude as Shakeel observed with non-zombies. - 4.17.9 already seems to be an improvement over 4.15: with 40K (non-zombie) cgroups, memory.stat time decreases from 200ms to 75ms. So with 4.15 -> 4.17.9 plus Shakeel's patch, the effects are reduced by an order of magnitude, which is good news. Of course, that doesn't solve the fundamental issue of why the zombies get generated in the first place. I'm not a kernel developer and I very much doubt I'll have the time to try to debug what may turn out to be a race condition, but let me know if I can help with testing things. Regards Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-26 12:35 ` Bruce Merry @ 2018-07-26 12:48 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2018-07-26 12:48 UTC (permalink / raw) To: Bruce Merry Cc: Shakeel Butt, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Thu 26-07-18 14:35:34, Bruce Merry wrote: > On 24 July 2018 at 12:05, Bruce Merry <bmerry@ska.ac.za> wrote: > > To reproduce: > > 1. Start cadvisor running. I use the 0.30.2 binary from Github, and > > run it with sudo ./cadvisor-0.30.2 --logtostderr=true > > 2. Run the Python 3 script below, which repeatedly creates a cgroup, > > enters it, stats some files in it, and leaves it again (and removes > > it). It takes a few minutes to run. > > 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms for me. > > 4. sudo sysctl vm.drop_caches=2 > > 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms. > > > > I've also added some code to memcg_stat_show to report the number of > > cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree). > > Running the script increases it from ~700 to ~41000. The script > > iterates 250,000 times, so only some fraction of the cgroups become > > zombies. > > I've discovered that I'd messed up that instrumentation code (it was > incrementing inside a loop so counted 5x too many cgroups), so some of > the things I said turn out to be wrong. Let me try again: > - Running the script generates about 8000 zombies (not 40000), with or > without Shakeel's patch (for 250,000 cgroups created/destroyed - so > possibly there is some timing condition that makes them into zombies. > I've only measured it with 4.17, but based on timing results I have no > particular reason to think it's wildly different to older kernels. > - After running the script 5 times (to generate 40K zombies), getting > the stats takes 20ms with Shakeel's patch and 80ms without it (on > 4.17.9) - which is a speedup of the same order of magnitude as Shakeel > observed with non-zombies. > - 4.17.9 already seems to be an improvement over 4.15: with 40K > (non-zombie) cgroups, memory.stat time decreases from 200ms to 75ms. > > So with 4.15 -> 4.17.9 plus Shakeel's patch, the effects are reduced > by an order of magnitude, which is good news. Of course, that doesn't > solve the fundamental issue of why the zombies get generated in the > first place. I'm not a kernel developer and I very much doubt I'll > have the time to try to debug what may turn out to be a race > condition, but let me know if I can help with testing things. As already explained. This is not a race. We just simply keep pages charged to a memcg we are removing and rely on the memory reclaim to free them when we need that memory for something else. The problem you are seeing is a side effect of this because a large number of zombies adds up when we need to get cumulative stats for their parent. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-18 17:40 ` Bruce Merry 2018-07-18 17:48 ` Shakeel Butt 2018-07-24 10:05 ` Bruce Merry @ 2018-07-26 0:55 ` Singh, Balbir 2018-07-26 6:41 ` Bruce Merry 2 siblings, 1 reply; 23+ messages in thread From: Singh, Balbir @ 2018-07-26 0:55 UTC (permalink / raw) To: Bruce Merry, Shakeel Butt Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 7/19/18 3:40 AM, Bruce Merry wrote: > On 18 July 2018 at 17:49, Shakeel Butt <shakeelb@google.com> wrote: >> On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote: >>> That sounds promising. Is there any way to tell how many zombies there >>> are, and is there any way to deliberately create zombies? If I can >>> produce zombies that might give me a reliable way to reproduce the >>> problem, which could then sensibly be tested against newer kernel >>> versions. >>> >> >> Yes, very easy to produce zombies, though I don't think kernel >> provides any way to tell how many zombies exist on the system. >> >> To create a zombie, first create a memcg node, enter that memcg, >> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg. >> That memcg will be a zombie until you delete that tmpfs file. > > Thanks, that makes sense. I'll see if I can reproduce the issue. Do > you expect the same thing to happen with normal (non-tmpfs) files that > are sitting in the page cache, and/or dentries? > Do you by any chance have use_hierarch=1? memcg_stat_show should just rely on counters inside the memory cgroup and the the LRU sizes for each node. Balbir Singh. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-26 0:55 ` Singh, Balbir @ 2018-07-26 6:41 ` Bruce Merry 2018-07-26 8:19 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Bruce Merry @ 2018-07-26 6:41 UTC (permalink / raw) To: Singh, Balbir Cc: Shakeel Butt, Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On 26 July 2018 at 02:55, Singh, Balbir <bsingharora@gmail.com> wrote: > Do you by any chance have use_hierarch=1? memcg_stat_show should just rely on counters inside the memory cgroup and the the LRU sizes for each node. Yes, /sys/fs/cgroup/memory/memory.use_hierarchy is 1. I assume systemd is doing that. Bruce -- Bruce Merry Senior Science Processing Developer SKA South Africa ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines 2018-07-26 6:41 ` Bruce Merry @ 2018-07-26 8:19 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2018-07-26 8:19 UTC (permalink / raw) To: Bruce Merry Cc: Singh, Balbir, Shakeel Butt, Andrew Morton, LKML, Linux MM, Johannes Weiner, Vladimir Davydov On Thu 26-07-18 08:41:35, Bruce Merry wrote: > On 26 July 2018 at 02:55, Singh, Balbir <bsingharora@gmail.com> wrote: > > Do you by any chance have use_hierarch=1? memcg_stat_show should just rely on counters inside the memory cgroup and the the LRU sizes for each node. > > Yes, /sys/fs/cgroup/memory/memory.use_hierarchy is 1. I assume systemd > is doing that. And this is actually good. Non hierarchical behavior is discouraged. The real problem is that we are keeping way too many zombie memcgs around and waiting for memory pressure to reclaim them and so they go away on their own. As I've tried to explain in other email force_empty before removing the memcg should help. Fixing this properly would require quite some heavy lifting AFAICS. We would basically have to move zombies out of the way which is not hard but we do not want to hide their current memory consumption so we would have to somehow move their stats to the parent. And then we are back to reparenting which has been removed by b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2018-07-26 12:48 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com>
2018-07-18 4:23 ` Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines Andrew Morton
2018-07-18 10:42 ` Michal Hocko
2018-07-18 14:29 ` Bruce Merry
2018-07-18 14:47 ` Michal Hocko
2018-07-18 15:27 ` Bruce Merry
2018-07-18 15:33 ` Shakeel Butt
2018-07-18 15:26 ` Shakeel Butt
2018-07-18 15:37 ` Bruce Merry
2018-07-18 15:49 ` Shakeel Butt
2018-07-18 17:40 ` Bruce Merry
2018-07-18 17:48 ` Shakeel Butt
2018-07-18 17:58 ` Bruce Merry
2018-07-18 18:13 ` Shakeel Butt
2018-07-18 18:43 ` Bruce Merry
2018-07-24 10:05 ` Bruce Merry
2018-07-24 10:50 ` Marinko Catovic
2018-07-25 12:29 ` Michal Hocko
2018-07-25 12:32 ` Michal Hocko
2018-07-26 12:35 ` Bruce Merry
2018-07-26 12:48 ` Michal Hocko
2018-07-26 0:55 ` Singh, Balbir
2018-07-26 6:41 ` Bruce Merry
2018-07-26 8:19 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).