Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
       [not found] <CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com>
@ 2018-07-18  4:23 ` Andrew Morton
  2018-07-18 10:42   ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2018-07-18  4:23 UTC (permalink / raw)
  To: Bruce Merry; +Cc: linux-kernel, linux-mm

(cc linux-mm)

On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote:

> Hi
> 
> I've run into an odd performance issue in the kernel, and not being a
> kernel dev or knowing terribly much about cgroups, am looking for
> advice on diagnosing the problem further (I discovered this while
> trying to pin down high CPU load in cadvisor).
> 
> On some machines in our production system, cat
> /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one
> machine), while on other nominally identical machines it is fast
> (2ms).
> 
> One other thing I've noticed is that the affected machines generally
> have much larger values for SUnreclaim in /proc/memstat (up to several
> GB), and slabtop reports >1GB of dentry.
> 
> Before I tracked the original problem (high CPU usage in cadvisor)
> down to this, I rebooted one of the machines and the original problem
> went away, so it seems to be cleared by a reboot; I'm reluctant to
> reboot more machines to confirm since I don't have a sure-fire way to
> reproduce the problem again to debug it.
> 
> The machines are running Ubuntu 16.04 with kernel 4.13.0-41-generic.
> They're running Docker, which creates a bunch of cgroups, but not an
> excessive number: there are 106 memory.stat files in
> /sys/fs/cgroup/memory.
> 
> Digging a bit further, cat
> /sys/fs/cgroup/memory/system.slice/memory.stat also takes ~500ms, but
> "find /sys/fs/cgroup/memory/system.slice -mindepth 2 -name memory.stat
> | xargs cat" takes only 8ms.
> 
> Any thoughts, particularly on what I should compare between the good
> and bad machines to narrow down the cause, or even better, how to
> prevent it happening?
> 
> Thanks
> Bruce
> -- 
> Bruce Merry
> Senior Science Processing Developer
> SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18  4:23 ` Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines Andrew Morton
@ 2018-07-18 10:42   ` Michal Hocko
  2018-07-18 14:29     ` Bruce Merry
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2018-07-18 10:42 UTC (permalink / raw)
  To: Bruce Merry
  Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner,
	Vladimir Davydov

[CC some more people]

On Tue 17-07-18 21:23:07, Andrew Morton wrote:
> (cc linux-mm)
> 
> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote:
> 
> > Hi
> > 
> > I've run into an odd performance issue in the kernel, and not being a
> > kernel dev or knowing terribly much about cgroups, am looking for
> > advice on diagnosing the problem further (I discovered this while
> > trying to pin down high CPU load in cadvisor).
> > 
> > On some machines in our production system, cat
> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one
> > machine), while on other nominally identical machines it is fast
> > (2ms).

Could you try to use ftrace to see where the time is spent?
memory_stat_show should only scale with the depth of the cgroup
hierarchy for memory.stat to get cumulative numbers. All the rest should
be simply reads of gathered counters. There is no locking involved in
the current kernel. What is the kernel version you are using, btw?

Keeping the reset of the email for new people on the CC

> > 
> > One other thing I've noticed is that the affected machines generally
> > have much larger values for SUnreclaim in /proc/memstat (up to several
> > GB), and slabtop reports >1GB of dentry.
> > 
> > Before I tracked the original problem (high CPU usage in cadvisor)
> > down to this, I rebooted one of the machines and the original problem
> > went away, so it seems to be cleared by a reboot; I'm reluctant to
> > reboot more machines to confirm since I don't have a sure-fire way to
> > reproduce the problem again to debug it.
> > 
> > The machines are running Ubuntu 16.04 with kernel 4.13.0-41-generic.
> > They're running Docker, which creates a bunch of cgroups, but not an
> > excessive number: there are 106 memory.stat files in
> > /sys/fs/cgroup/memory.
> > 
> > Digging a bit further, cat
> > /sys/fs/cgroup/memory/system.slice/memory.stat also takes ~500ms, but
> > "find /sys/fs/cgroup/memory/system.slice -mindepth 2 -name memory.stat
> > | xargs cat" takes only 8ms.
> > 
> > Any thoughts, particularly on what I should compare between the good
> > and bad machines to narrow down the cause, or even better, how to
> > prevent it happening?
> > 
> > Thanks
> > Bruce
> > -- 
> > Bruce Merry
> > Senior Science Processing Developer
> > SKA South Africa

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 10:42   ` Michal Hocko
@ 2018-07-18 14:29     ` Bruce Merry
  2018-07-18 14:47       ` Michal Hocko
  2018-07-18 15:26       ` Shakeel Butt
  0 siblings, 2 replies; 23+ messages in thread
From: Bruce Merry @ 2018-07-18 14:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 12:42, Michal Hocko <mhocko@kernel.org> wrote:
> [CC some more people]
>
> On Tue 17-07-18 21:23:07, Andrew Morton wrote:
>> (cc linux-mm)
>>
>> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote:
>>
>> > Hi
>> >
>> > I've run into an odd performance issue in the kernel, and not being a
>> > kernel dev or knowing terribly much about cgroups, am looking for
>> > advice on diagnosing the problem further (I discovered this while
>> > trying to pin down high CPU load in cadvisor).
>> >
>> > On some machines in our production system, cat
>> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one
>> > machine), while on other nominally identical machines it is fast
>> > (2ms).
>
> Could you try to use ftrace to see where the time is spent?

Thanks for looking into this. I'm not familiar with ftrace. Can you
give me a specific command line to run? Based on "perf record cat
/sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following:

  42.09%  cat      [kernel.kallsyms]  [k] memcg_stat_show
  29.19%  cat      [kernel.kallsyms]  [k] memcg_sum_events.isra.22
  12.41%  cat      [kernel.kallsyms]  [k] mem_cgroup_iter
   5.42%  cat      [kernel.kallsyms]  [k] _find_next_bit
   4.14%  cat      [kernel.kallsyms]  [k] css_next_descendant_pre
   3.44%  cat      [kernel.kallsyms]  [k] find_next_bit
   2.84%  cat      [kernel.kallsyms]  [k] mem_cgroup_node_nr_lru_pages

> memory_stat_show should only scale with the depth of the cgroup
> hierarchy for memory.stat to get cumulative numbers. All the rest should
> be simply reads of gathered counters. There is no locking involved in
> the current kernel. What is the kernel version you are using, btw?

Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes
some Ubuntu special sauce).

Some new information: when this occurred on another machine I ran
"echo 2 > /proc/sys/vm/drop_caches" to drop the dentry cache, and
performance immediately improved. Unfortunately, I've not been able to
deliberately reproduce the issue. I've tried doing the following 10^7
times in a loop and while it inflates the dentry cache, it doesn't
cause any significant slowdown:
1. Create a temporary cgroup: mkdir /sys/fs/cgroup/memory/<name>.
2. stat /sys/fs/cgroup/memory/<name>/memory.stat
3. rmdir /sys/fs/cgroup/memory/<name>

I've also tried inflating the dentry cache just by stat-ing millions
of non-existent files, and again, no slowdown. So I'm not sure exactly
how dentry cache is related.

Regards
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 14:29     ` Bruce Merry
@ 2018-07-18 14:47       ` Michal Hocko
  2018-07-18 15:27         ` Bruce Merry
  2018-07-18 15:26       ` Shakeel Butt
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2018-07-18 14:47 UTC (permalink / raw)
  To: Bruce Merry
  Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner,
	Vladimir Davydov

On Wed 18-07-18 16:29:20, Bruce Merry wrote:
> On 18 July 2018 at 12:42, Michal Hocko <mhocko@kernel.org> wrote:
> > [CC some more people]
> >
> > On Tue 17-07-18 21:23:07, Andrew Morton wrote:
> >> (cc linux-mm)
> >>
> >> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote:
> >>
> >> > Hi
> >> >
> >> > I've run into an odd performance issue in the kernel, and not being a
> >> > kernel dev or knowing terribly much about cgroups, am looking for
> >> > advice on diagnosing the problem further (I discovered this while
> >> > trying to pin down high CPU load in cadvisor).
> >> >
> >> > On some machines in our production system, cat
> >> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one
> >> > machine), while on other nominally identical machines it is fast
> >> > (2ms).
> >
> > Could you try to use ftrace to see where the time is spent?
> 
> Thanks for looking into this. I'm not familiar with ftrace. Can you
> give me a specific command line to run? Based on "perf record cat
> /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following:
> 
>   42.09%  cat      [kernel.kallsyms]  [k] memcg_stat_show
>   29.19%  cat      [kernel.kallsyms]  [k] memcg_sum_events.isra.22
>   12.41%  cat      [kernel.kallsyms]  [k] mem_cgroup_iter
>    5.42%  cat      [kernel.kallsyms]  [k] _find_next_bit
>    4.14%  cat      [kernel.kallsyms]  [k] css_next_descendant_pre
>    3.44%  cat      [kernel.kallsyms]  [k] find_next_bit
>    2.84%  cat      [kernel.kallsyms]  [k] mem_cgroup_node_nr_lru_pages

I would just use perf record as you did. How long did the call take?
Also is the excessive time an outlier or a more consistent thing? If the
former does perf record show any difference?

> > memory_stat_show should only scale with the depth of the cgroup
> > hierarchy for memory.stat to get cumulative numbers. All the rest should
> > be simply reads of gathered counters. There is no locking involved in
> > the current kernel. What is the kernel version you are using, btw?
> 
> Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes
> some Ubuntu special sauce).

Do you see the same whe running with the vanilla kernel?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 14:47       ` Michal Hocko
@ 2018-07-18 15:27         ` Bruce Merry
  2018-07-18 15:33           ` Shakeel Butt
  0 siblings, 1 reply; 23+ messages in thread
From: Bruce Merry @ 2018-07-18 15:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 16:47, Michal Hocko <mhocko@kernel.org> wrote:
>> Thanks for looking into this. I'm not familiar with ftrace. Can you
>> give me a specific command line to run? Based on "perf record cat
>> /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following:
>>
>>   42.09%  cat      [kernel.kallsyms]  [k] memcg_stat_show
>>   29.19%  cat      [kernel.kallsyms]  [k] memcg_sum_events.isra.22
>>   12.41%  cat      [kernel.kallsyms]  [k] mem_cgroup_iter
>>    5.42%  cat      [kernel.kallsyms]  [k] _find_next_bit
>>    4.14%  cat      [kernel.kallsyms]  [k] css_next_descendant_pre
>>    3.44%  cat      [kernel.kallsyms]  [k] find_next_bit
>>    2.84%  cat      [kernel.kallsyms]  [k] mem_cgroup_node_nr_lru_pages
>
> I would just use perf record as you did. How long did the call take?
> Also is the excessive time an outlier or a more consistent thing? If the
> former does perf record show any difference?

I didn't note the exact time for that particular run, but it's pretty
consistently 372-377ms on the machine that has that perf report. The
times differ between machines showing the symptom (anywhere from
200-500ms), but are consistent (within a few ms) in back-to-back runs
on each machine.

>> Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes
>> some Ubuntu special sauce).
>
> Do you see the same whe running with the vanilla kernel?

We don't currently have any boxes running vanilla kernels. While I
could install a test box with a vanilla kernel, I don't know how to
reproduce the problem, what piece of our production environment is
triggering it, or even why some machines are unaffected, so if the
problem didn't re-occur on the test box I wouldn't be able to conclude
anything useful.

Do you have suggestions on things I could try that might trigger this?
e.g. are there cases where a cgroup no longer shows up in the
filesystem but is still lingering while waiting for its refcount to
hit zero? Does every child cgroup contribute to the stat_show cost of
its parent or does it have to have some non-trivial variation from its
parent?

Thanks
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 15:27         ` Bruce Merry
@ 2018-07-18 15:33           ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2018-07-18 15:33 UTC (permalink / raw)
  To: bmerry
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Wed, Jul 18, 2018 at 8:27 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>
> On 18 July 2018 at 16:47, Michal Hocko <mhocko@kernel.org> wrote:
> >> Thanks for looking into this. I'm not familiar with ftrace. Can you
> >> give me a specific command line to run? Based on "perf record cat
> >> /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following:
> >>
> >>   42.09%  cat      [kernel.kallsyms]  [k] memcg_stat_show
> >>   29.19%  cat      [kernel.kallsyms]  [k] memcg_sum_events.isra.22
> >>   12.41%  cat      [kernel.kallsyms]  [k] mem_cgroup_iter
> >>    5.42%  cat      [kernel.kallsyms]  [k] _find_next_bit
> >>    4.14%  cat      [kernel.kallsyms]  [k] css_next_descendant_pre
> >>    3.44%  cat      [kernel.kallsyms]  [k] find_next_bit
> >>    2.84%  cat      [kernel.kallsyms]  [k] mem_cgroup_node_nr_lru_pages
> >
> > I would just use perf record as you did. How long did the call take?
> > Also is the excessive time an outlier or a more consistent thing? If the
> > former does perf record show any difference?
>
> I didn't note the exact time for that particular run, but it's pretty
> consistently 372-377ms on the machine that has that perf report. The
> times differ between machines showing the symptom (anywhere from
> 200-500ms), but are consistent (within a few ms) in back-to-back runs
> on each machine.
>
> >> Ubuntu 16.04 with kernel 4.13.0-41-generic (so presumably includes
> >> some Ubuntu special sauce).
> >
> > Do you see the same whe running with the vanilla kernel?
>
> We don't currently have any boxes running vanilla kernels. While I
> could install a test box with a vanilla kernel, I don't know how to
> reproduce the problem, what piece of our production environment is
> triggering it, or even why some machines are unaffected, so if the
> problem didn't re-occur on the test box I wouldn't be able to conclude
> anything useful.
>
> Do you have suggestions on things I could try that might trigger this?
> e.g. are there cases where a cgroup no longer shows up in the
> filesystem but is still lingering while waiting for its refcount to
> hit zero? Does every child cgroup contribute to the stat_show cost of
> its parent or does it have to have some non-trivial variation from its
> parent?
>

The memcg tree does include all zombie memcgs and these zombies does
contribute to the memcg_stat_show cost.

Shakeel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 14:29     ` Bruce Merry
  2018-07-18 14:47       ` Michal Hocko
@ 2018-07-18 15:26       ` Shakeel Butt
  2018-07-18 15:37         ` Bruce Merry
  1 sibling, 1 reply; 23+ messages in thread
From: Shakeel Butt @ 2018-07-18 15:26 UTC (permalink / raw)
  To: bmerry
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Wed, Jul 18, 2018 at 7:29 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>
> On 18 July 2018 at 12:42, Michal Hocko <mhocko@kernel.org> wrote:
> > [CC some more people]
> >
> > On Tue 17-07-18 21:23:07, Andrew Morton wrote:
> >> (cc linux-mm)
> >>
> >> On Tue, 3 Jul 2018 08:43:23 +0200 Bruce Merry <bmerry@ska.ac.za> wrote:
> >>
> >> > Hi
> >> >
> >> > I've run into an odd performance issue in the kernel, and not being a
> >> > kernel dev or knowing terribly much about cgroups, am looking for
> >> > advice on diagnosing the problem further (I discovered this while
> >> > trying to pin down high CPU load in cadvisor).
> >> >
> >> > On some machines in our production system, cat
> >> > /sys/fs/cgroup/memory/memory.stat is extremely slow (500ms on one
> >> > machine), while on other nominally identical machines it is fast
> >> > (2ms).
> >
> > Could you try to use ftrace to see where the time is spent?
>
> Thanks for looking into this. I'm not familiar with ftrace. Can you
> give me a specific command line to run? Based on "perf record cat
> /sys/fs/cgroup/memory/memory.stat"/"perf report", I see the following:
>
>   42.09%  cat      [kernel.kallsyms]  [k] memcg_stat_show
>   29.19%  cat      [kernel.kallsyms]  [k] memcg_sum_events.isra.22
>   12.41%  cat      [kernel.kallsyms]  [k] mem_cgroup_iter
>    5.42%  cat      [kernel.kallsyms]  [k] _find_next_bit
>    4.14%  cat      [kernel.kallsyms]  [k] css_next_descendant_pre
>    3.44%  cat      [kernel.kallsyms]  [k] find_next_bit
>    2.84%  cat      [kernel.kallsyms]  [k] mem_cgroup_node_nr_lru_pages
>

It seems like you are using cgroup-v1. How many nodes are there in
your memcg tree and also how many cpus does the system have?

Please note that memcg_stat_show or reading memory.stat in cgroup-v1
is not optimized as cgroup-v2. The function memcg_stat_show() in 4.13
does ~17 tree walks and then for ~12 of those tree walks, it goes
through all cpus for each node in the memcg tree. In 4.16,
a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat
reporting") optimizes aways the cpu traversal at the expense of some
accuracy. Next optimization would be to do just one memcg tree
traversal similar to cgroup-v2's memory_stat_show().

Anyways, is it possible for you to try 4.16 kernel?

Shakeel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 15:26       ` Shakeel Butt
@ 2018-07-18 15:37         ` Bruce Merry
  2018-07-18 15:49           ` Shakeel Butt
  0 siblings, 1 reply; 23+ messages in thread
From: Bruce Merry @ 2018-07-18 15:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 17:26, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jul 18, 2018 at 7:29 AM Bruce Merry <bmerry@ska.ac.za> wrote:
> It seems like you are using cgroup-v1. How many nodes are there in
> your memcg tree and also how many cpus does the system have?

>From my original email: "there are 106 memory.stat files in
/sys/fs/cgroup/memory." - is that what you mean by the number of
nodes?

The affected systems all have 8 CPU cores (hyperthreading is disabled).

> Please note that memcg_stat_show or reading memory.stat in cgroup-v1
> is not optimized as cgroup-v2. The function memcg_stat_show() in 4.13
> does ~17 tree walks and then for ~12 of those tree walks, it goes
> through all cpus for each node in the memcg tree. In 4.16,
> a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat
> reporting") optimizes aways the cpu traversal at the expense of some
> accuracy. Next optimization would be to do just one memcg tree
> traversal similar to cgroup-v2's memory_stat_show().

On most machines it is still fast (1-2ms), and there is no difference
in the number of CPUs and only very small differences in the number of
live memory cgroups, so presumably something else is going on.

> The memcg tree does include all zombie memcgs and these zombies does
> contribute to the memcg_stat_show cost.

That sounds promising. Is there any way to tell how many zombies there
are, and is there any way to deliberately create zombies? If I can
produce zombies that might give me a reliable way to reproduce the
problem, which could then sensibly be tested against newer kernel
versions.

Thanks
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 15:37         ` Bruce Merry
@ 2018-07-18 15:49           ` Shakeel Butt
  2018-07-18 17:40             ` Bruce Merry
  0 siblings, 1 reply; 23+ messages in thread
From: Shakeel Butt @ 2018-07-18 15:49 UTC (permalink / raw)
  To: bmerry
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>
> On 18 July 2018 at 17:26, Shakeel Butt <shakeelb@google.com> wrote:
> > On Wed, Jul 18, 2018 at 7:29 AM Bruce Merry <bmerry@ska.ac.za> wrote:
> > It seems like you are using cgroup-v1. How many nodes are there in
> > your memcg tree and also how many cpus does the system have?
>
> From my original email: "there are 106 memory.stat files in
> /sys/fs/cgroup/memory." - is that what you mean by the number of
> nodes?

Yes but it seems like your system might be suffering with zombies.

>
> The affected systems all have 8 CPU cores (hyperthreading is disabled).
>
> > Please note that memcg_stat_show or reading memory.stat in cgroup-v1
> > is not optimized as cgroup-v2. The function memcg_stat_show() in 4.13
> > does ~17 tree walks and then for ~12 of those tree walks, it goes
> > through all cpus for each node in the memcg tree. In 4.16,
> > a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat
> > reporting") optimizes aways the cpu traversal at the expense of some
> > accuracy. Next optimization would be to do just one memcg tree
> > traversal similar to cgroup-v2's memory_stat_show().
>
> On most machines it is still fast (1-2ms), and there is no difference
> in the number of CPUs and only very small differences in the number of
> live memory cgroups, so presumably something else is going on.
>
> > The memcg tree does include all zombie memcgs and these zombies does
> > contribute to the memcg_stat_show cost.
>
> That sounds promising. Is there any way to tell how many zombies there
> are, and is there any way to deliberately create zombies? If I can
> produce zombies that might give me a reliable way to reproduce the
> problem, which could then sensibly be tested against newer kernel
> versions.
>

Yes, very easy to produce zombies, though I don't think kernel
provides any way to tell how many zombies exist on the system.

To create a zombie, first create a memcg node, enter that memcg,
create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
That memcg will be a zombie until you delete that tmpfs file.

Shakeel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 15:49           ` Shakeel Butt
@ 2018-07-18 17:40             ` Bruce Merry
  2018-07-18 17:48               ` Shakeel Butt
                                 ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Bruce Merry @ 2018-07-18 17:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 17:49, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>> That sounds promising. Is there any way to tell how many zombies there
>> are, and is there any way to deliberately create zombies? If I can
>> produce zombies that might give me a reliable way to reproduce the
>> problem, which could then sensibly be tested against newer kernel
>> versions.
>>
>
> Yes, very easy to produce zombies, though I don't think kernel
> provides any way to tell how many zombies exist on the system.
>
> To create a zombie, first create a memcg node, enter that memcg,
> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
> That memcg will be a zombie until you delete that tmpfs file.

Thanks, that makes sense. I'll see if I can reproduce the issue. Do
you expect the same thing to happen with normal (non-tmpfs) files that
are sitting in the page cache, and/or dentries?

Cheers
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 17:40             ` Bruce Merry
@ 2018-07-18 17:48               ` Shakeel Butt
  2018-07-18 17:58                 ` Bruce Merry
  2018-07-24 10:05               ` Bruce Merry
  2018-07-26  0:55               ` Singh, Balbir
  2 siblings, 1 reply; 23+ messages in thread
From: Shakeel Butt @ 2018-07-18 17:48 UTC (permalink / raw)
  To: bmerry
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Wed, Jul 18, 2018 at 10:40 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>
> On 18 July 2018 at 17:49, Shakeel Butt <shakeelb@google.com> wrote:
> > On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote:
> >> That sounds promising. Is there any way to tell how many zombies there
> >> are, and is there any way to deliberately create zombies? If I can
> >> produce zombies that might give me a reliable way to reproduce the
> >> problem, which could then sensibly be tested against newer kernel
> >> versions.
> >>
> >
> > Yes, very easy to produce zombies, though I don't think kernel
> > provides any way to tell how many zombies exist on the system.
> >
> > To create a zombie, first create a memcg node, enter that memcg,
> > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
> > That memcg will be a zombie until you delete that tmpfs file.
>
> Thanks, that makes sense. I'll see if I can reproduce the issue. Do
> you expect the same thing to happen with normal (non-tmpfs) files that
> are sitting in the page cache, and/or dentries?
>

Normal files and their dentries can get reclaimed while tmpfs will
stick and even if the data of tmpfs goes to swap, the kmem related to
tmpfs files will remain in memory.

Shakeel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 17:48               ` Shakeel Butt
@ 2018-07-18 17:58                 ` Bruce Merry
  2018-07-18 18:13                   ` Shakeel Butt
  0 siblings, 1 reply; 23+ messages in thread
From: Bruce Merry @ 2018-07-18 17:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 19:48, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jul 18, 2018 at 10:40 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>> > Yes, very easy to produce zombies, though I don't think kernel
>> > provides any way to tell how many zombies exist on the system.
>> >
>> > To create a zombie, first create a memcg node, enter that memcg,
>> > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
>> > That memcg will be a zombie until you delete that tmpfs file.
>>
>> Thanks, that makes sense. I'll see if I can reproduce the issue. Do
>> you expect the same thing to happen with normal (non-tmpfs) files that
>> are sitting in the page cache, and/or dentries?
>>
>
> Normal files and their dentries can get reclaimed while tmpfs will
> stick and even if the data of tmpfs goes to swap, the kmem related to
> tmpfs files will remain in memory.

Sure, page cache and dentries are reclaimable given memory pressure.
These machines all have more memory than they need though (64GB+) and
generally don't come under any memory pressure. I'm just wondering if
the behaviour we're seeing can be explained as a result of a lot of
dentries sticking around (because there is no memory pressure) and in
turn causing a lot of zombie cgroups to stay present until something
forces reclamation of dentries.

Cheers
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 17:58                 ` Bruce Merry
@ 2018-07-18 18:13                   ` Shakeel Butt
  2018-07-18 18:43                     ` Bruce Merry
  0 siblings, 1 reply; 23+ messages in thread
From: Shakeel Butt @ 2018-07-18 18:13 UTC (permalink / raw)
  To: bmerry
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Wed, Jul 18, 2018 at 10:58 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>
> On 18 July 2018 at 19:48, Shakeel Butt <shakeelb@google.com> wrote:
> > On Wed, Jul 18, 2018 at 10:40 AM Bruce Merry <bmerry@ska.ac.za> wrote:
> >> > Yes, very easy to produce zombies, though I don't think kernel
> >> > provides any way to tell how many zombies exist on the system.
> >> >
> >> > To create a zombie, first create a memcg node, enter that memcg,
> >> > create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
> >> > That memcg will be a zombie until you delete that tmpfs file.
> >>
> >> Thanks, that makes sense. I'll see if I can reproduce the issue. Do
> >> you expect the same thing to happen with normal (non-tmpfs) files that
> >> are sitting in the page cache, and/or dentries?
> >>
> >
> > Normal files and their dentries can get reclaimed while tmpfs will
> > stick and even if the data of tmpfs goes to swap, the kmem related to
> > tmpfs files will remain in memory.
>
> Sure, page cache and dentries are reclaimable given memory pressure.
> These machines all have more memory than they need though (64GB+) and
> generally don't come under any memory pressure. I'm just wondering if
> the behaviour we're seeing can be explained as a result of a lot of
> dentries sticking around (because there is no memory pressure) and in
> turn causing a lot of zombie cgroups to stay present until something
> forces reclamation of dentries.
>

Yes, if there is no memory pressure such memory can stay around.

On your production machine, before deleting memory containers, you can
try force_empty to reclaim such memory from them. See if that helps.

Shakeel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 18:13                   ` Shakeel Butt
@ 2018-07-18 18:43                     ` Bruce Merry
  0 siblings, 0 replies; 23+ messages in thread
From: Bruce Merry @ 2018-07-18 18:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 20:13, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jul 18, 2018 at 10:58 AM Bruce Merry <bmerry@ska.ac.za> wrote:
> Yes, if there is no memory pressure such memory can stay around.
>
> On your production machine, before deleting memory containers, you can
> try force_empty to reclaim such memory from them. See if that helps.

Thanks. At the moment the cgroups are all managed by systemd and
docker, but I'll keep that in mind while experimenting.

Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 17:40             ` Bruce Merry
  2018-07-18 17:48               ` Shakeel Butt
@ 2018-07-24 10:05               ` Bruce Merry
  2018-07-24 10:50                 ` Marinko Catovic
                                   ` (2 more replies)
  2018-07-26  0:55               ` Singh, Balbir
  2 siblings, 3 replies; 23+ messages in thread
From: Bruce Merry @ 2018-07-24 10:05 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On 18 July 2018 at 19:40, Bruce Merry <bmerry@ska.ac.za> wrote:
>> Yes, very easy to produce zombies, though I don't think kernel
>> provides any way to tell how many zombies exist on the system.
>>
>> To create a zombie, first create a memcg node, enter that memcg,
>> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
>> That memcg will be a zombie until you delete that tmpfs file.
>
> Thanks, that makes sense. I'll see if I can reproduce the issue.

Hi

I've had some time to experiment with this issue, and I've now got a
way to reproduce it fairly reliably, including with a stock 4.17.8
kernel. However, it's very phase-of-the-moon stuff, and even
apparently trivial changes (like switching the order in which the
files are statted) makes the issue disappear.

To reproduce:
1. Start cadvisor running. I use the 0.30.2 binary from Github, and
run it with sudo ./cadvisor-0.30.2 --logtostderr=true
2. Run the Python 3 script below, which repeatedly creates a cgroup,
enters it, stats some files in it, and leaves it again (and removes
it). It takes a few minutes to run.
3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms for me.
4. sudo sysctl vm.drop_caches=2
5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms.

I've also added some code to memcg_stat_show to report the number of
cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree).
Running the script increases it from ~700 to ~41000. The script
iterates 250,000 times, so only some fraction of the cgroups become
zombies.

I also tried the suggestion of force_empty: it makes the problem go
away, but is also very, very slow (about 0.5s per iteration), and
given the sensitivity of the test to small changes I don't know how
meaningful that is.

Reproduction code (if you have tqdm installed you get a nice progress
bar, but not required). Hopefully Gmail doesn't do any format
mangling:

#!/usr/bin/env python3
import os

try:
    from tqdm import trange as range
except ImportError:
    pass

def clean():
    try:
        os.rmdir(name)
    except FileNotFoundError:
        pass

def move_to(cgroup):
    with open(cgroup + '/tasks', 'w') as f:
        print(pid, file=f)

pid = os.getpid()
os.chdir('/sys/fs/cgroup/memory')
name = 'dummy'
N = 250000
clean()
try:
    for i in range(N):
        os.mkdir(name)
        move_to(name)
        for filename in ['memory.stat', 'memory.swappiness']:
            os.stat(os.path.join(name, filename))
        move_to('user.slice')
        os.rmdir(name)
finally:
    move_to('user.slice')
    clean()

Regards
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-24 10:05               ` Bruce Merry
@ 2018-07-24 10:50                 ` Marinko Catovic
  2018-07-25 12:29                   ` Michal Hocko
  2018-07-25 12:32                 ` Michal Hocko
  2018-07-26 12:35                 ` Bruce Merry
  2 siblings, 1 reply; 23+ messages in thread
From: Marinko Catovic @ 2018-07-24 10:50 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 4508 bytes --]

hello guys


excuse me please for dropping in, but I can not ignore the fact that all
this sounds like 99%+ the same
as the issue I am going nuts with for the past 2 months, since I switched
kernels from version 3 to 4.

Please look at the topic `Caching/buffers become useless after some time`.
What I did not mention there
is that cgroups are also mounted and used, but not actively since I have
some scripting issue with setting
them up correctly, but there is active data in
/sys/fs/cgroup/memory/memory.stat so it might be related to
cgroups - I did not think of that until now.

same story here as well, 2> into drop_caches solves the issue temporarily,
for maybe 2-4 days with lots of I/O.

I can however test and play around with cgroups - if one may want to
suggest to disable them I'd gladly
monitor the behavior (please tell me what and how to do it, if necessary).
Also I am curious: could you disable
cgroups as well, just to see whether it helps and is actually associated
with cgroups? my sysctl regarding vm is:

vm.dirty_ratio = 15
vm.dirty_background_ratio = 3
vm.vfs_cache_pressure = 1

I may tell (not for sure) that this issue is less significant since I
lowered these values, previously I had
90/80 on dirty_ratio and dirty_background_ratio, not sure about the cache
pressue any more.
Still there is lots of ram unallocated, usually at least half, mostly even
more totally unused, the hosts
have 64GB of RAM as well.

I hope this is kinda related, so we can work together on pinpointing this,
that issue is not going away
for me and causes lots of headache slowing down my entire business.

2018-07-24 12:05 GMT+02:00 Bruce Merry <bmerry@ska.ac.za>:

> On 18 July 2018 at 19:40, Bruce Merry <bmerry@ska.ac.za> wrote:
> >> Yes, very easy to produce zombies, though I don't think kernel
> >> provides any way to tell how many zombies exist on the system.
> >>
> >> To create a zombie, first create a memcg node, enter that memcg,
> >> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
> >> That memcg will be a zombie until you delete that tmpfs file.
> >
> > Thanks, that makes sense. I'll see if I can reproduce the issue.
>
> Hi
>
> I've had some time to experiment with this issue, and I've now got a
> way to reproduce it fairly reliably, including with a stock 4.17.8
> kernel. However, it's very phase-of-the-moon stuff, and even
> apparently trivial changes (like switching the order in which the
> files are statted) makes the issue disappear.
>
> To reproduce:
> 1. Start cadvisor running. I use the 0.30.2 binary from Github, and
> run it with sudo ./cadvisor-0.30.2 --logtostderr=true
> 2. Run the Python 3 script below, which repeatedly creates a cgroup,
> enters it, stats some files in it, and leaves it again (and removes
> it). It takes a few minutes to run.
> 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms
> for me.
> 4. sudo sysctl vm.drop_caches=2
> 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms.
>
> I've also added some code to memcg_stat_show to report the number of
> cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree).
> Running the script increases it from ~700 to ~41000. The script
> iterates 250,000 times, so only some fraction of the cgroups become
> zombies.
>
> I also tried the suggestion of force_empty: it makes the problem go
> away, but is also very, very slow (about 0.5s per iteration), and
> given the sensitivity of the test to small changes I don't know how
> meaningful that is.
>
> Reproduction code (if you have tqdm installed you get a nice progress
> bar, but not required). Hopefully Gmail doesn't do any format
> mangling:
>
>
> #!/usr/bin/env python3
> import os
>
> try:
>     from tqdm import trange as range
> except ImportError:
>     pass
>
>
> def clean():
>     try:
>         os.rmdir(name)
>     except FileNotFoundError:
>         pass
>
>
> def move_to(cgroup):
>     with open(cgroup + '/tasks', 'w') as f:
>         print(pid, file=f)
>
>
> pid = os.getpid()
> os.chdir('/sys/fs/cgroup/memory')
> name = 'dummy'
> N = 250000
> clean()
> try:
>     for i in range(N):
>         os.mkdir(name)
>         move_to(name)
>         for filename in ['memory.stat', 'memory.swappiness']:
>             os.stat(os.path.join(name, filename))
>         move_to('user.slice')
>         os.rmdir(name)
> finally:
>     move_to('user.slice')
>     clean()
>
>
> Regards
> Bruce
> --
> Bruce Merry
> Senior Science Processing Developer
> SKA South Africa
>
>

[-- Attachment #2: Type: text/html, Size: 5719 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-24 10:50                 ` Marinko Catovic
@ 2018-07-25 12:29                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2018-07-25 12:29 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm

On Tue 24-07-18 12:50:27, Marinko Catovic wrote:
> I hope this is kinda related, so we can work together on pinpointing this,
> that issue is not going away
> for me and causes lots of headache slowing down my entire business.

It think your problem is not really related. I still didn't get to your
collected data but this issue is more related to the laziness of the
cgroup objects tear down.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-24 10:05               ` Bruce Merry
  2018-07-24 10:50                 ` Marinko Catovic
@ 2018-07-25 12:32                 ` Michal Hocko
  2018-07-26 12:35                 ` Bruce Merry
  2 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2018-07-25 12:32 UTC (permalink / raw)
  To: Bruce Merry
  Cc: Shakeel Butt, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Tue 24-07-18 12:05:35, Bruce Merry wrote:
[...]
> I've also added some code to memcg_stat_show to report the number of
> cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree).
> Running the script increases it from ~700 to ~41000. The script
> iterates 250,000 times, so only some fraction of the cgroups become
> zombies.

So this is definitely "too many zombies" to delay the collecting of
cumulative stats. Maybe we need to limit the number of zombies and
reclaim them more actively. I have seen Shakeel has posted something but
it looked more on the accounting side from a quick glance.

I can see you are using cgroup v1 so your workaround would be to
memory.force_empty before you remove the group.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-24 10:05               ` Bruce Merry
  2018-07-24 10:50                 ` Marinko Catovic
  2018-07-25 12:32                 ` Michal Hocko
@ 2018-07-26 12:35                 ` Bruce Merry
  2018-07-26 12:48                   ` Michal Hocko
  2 siblings, 1 reply; 23+ messages in thread
From: Bruce Merry @ 2018-07-26 12:35 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On 24 July 2018 at 12:05, Bruce Merry <bmerry@ska.ac.za> wrote:
> To reproduce:
> 1. Start cadvisor running. I use the 0.30.2 binary from Github, and
> run it with sudo ./cadvisor-0.30.2 --logtostderr=true
> 2. Run the Python 3 script below, which repeatedly creates a cgroup,
> enters it, stats some files in it, and leaves it again (and removes
> it). It takes a few minutes to run.
> 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms for me.
> 4. sudo sysctl vm.drop_caches=2
> 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms.
>
> I've also added some code to memcg_stat_show to report the number of
> cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree).
> Running the script increases it from ~700 to ~41000. The script
> iterates 250,000 times, so only some fraction of the cgroups become
> zombies.

I've discovered that I'd messed up that instrumentation code (it was
incrementing inside a loop so counted 5x too many cgroups), so some of
the things I said turn out to be wrong. Let me try again:
- Running the script generates about 8000 zombies (not 40000), with or
without Shakeel's patch (for 250,000 cgroups created/destroyed - so
possibly there is some timing condition that makes them into zombies.
I've only measured it with 4.17, but based on timing results I have no
particular reason to think it's wildly different to older kernels.
- After running the script 5 times (to generate 40K zombies), getting
the stats takes 20ms with Shakeel's patch and 80ms without it (on
4.17.9) - which is a speedup of the same order of magnitude as Shakeel
observed with non-zombies.
- 4.17.9 already seems to be an improvement over 4.15: with 40K
(non-zombie) cgroups, memory.stat time decreases from 200ms to 75ms.

So with 4.15 -> 4.17.9 plus Shakeel's patch, the effects are reduced
by an order of magnitude, which is good news. Of course, that doesn't
solve the fundamental issue of why the zombies get generated in the
first place. I'm not a kernel developer and I very much doubt I'll
have the time to try to debug what may turn out to be a race
condition, but let me know if I can help with testing things.

Regards
Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-26 12:35                 ` Bruce Merry
@ 2018-07-26 12:48                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2018-07-26 12:48 UTC (permalink / raw)
  To: Bruce Merry
  Cc: Shakeel Butt, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov

On Thu 26-07-18 14:35:34, Bruce Merry wrote:
> On 24 July 2018 at 12:05, Bruce Merry <bmerry@ska.ac.za> wrote:
> > To reproduce:
> > 1. Start cadvisor running. I use the 0.30.2 binary from Github, and
> > run it with sudo ./cadvisor-0.30.2 --logtostderr=true
> > 2. Run the Python 3 script below, which repeatedly creates a cgroup,
> > enters it, stats some files in it, and leaves it again (and removes
> > it). It takes a few minutes to run.
> > 3. time cat /sys/fs/cgroup/memory/memory.stat. It now takes about 20ms for me.
> > 4. sudo sysctl vm.drop_caches=2
> > 5. time cat /sys/fs/cgroup/memory/memory.stat. It is back to 1-2ms.
> >
> > I've also added some code to memcg_stat_show to report the number of
> > cgroups in the hierarchy (iterations in for_each_mem_cgroup_tree).
> > Running the script increases it from ~700 to ~41000. The script
> > iterates 250,000 times, so only some fraction of the cgroups become
> > zombies.
> 
> I've discovered that I'd messed up that instrumentation code (it was
> incrementing inside a loop so counted 5x too many cgroups), so some of
> the things I said turn out to be wrong. Let me try again:
> - Running the script generates about 8000 zombies (not 40000), with or
> without Shakeel's patch (for 250,000 cgroups created/destroyed - so
> possibly there is some timing condition that makes them into zombies.
> I've only measured it with 4.17, but based on timing results I have no
> particular reason to think it's wildly different to older kernels.
> - After running the script 5 times (to generate 40K zombies), getting
> the stats takes 20ms with Shakeel's patch and 80ms without it (on
> 4.17.9) - which is a speedup of the same order of magnitude as Shakeel
> observed with non-zombies.
> - 4.17.9 already seems to be an improvement over 4.15: with 40K
> (non-zombie) cgroups, memory.stat time decreases from 200ms to 75ms.
> 
> So with 4.15 -> 4.17.9 plus Shakeel's patch, the effects are reduced
> by an order of magnitude, which is good news. Of course, that doesn't
> solve the fundamental issue of why the zombies get generated in the
> first place. I'm not a kernel developer and I very much doubt I'll
> have the time to try to debug what may turn out to be a race
> condition, but let me know if I can help with testing things.

As already explained. This is not a race. We just simply keep pages
charged to a memcg we are removing and rely on the memory reclaim to
free them when we need that memory for something else. The problem you
are seeing is a side effect of this because a large number of zombies
adds up when we need to get cumulative stats for their parent.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-18 17:40             ` Bruce Merry
  2018-07-18 17:48               ` Shakeel Butt
  2018-07-24 10:05               ` Bruce Merry
@ 2018-07-26  0:55               ` Singh, Balbir
  2018-07-26  6:41                 ` Bruce Merry
  2 siblings, 1 reply; 23+ messages in thread
From: Singh, Balbir @ 2018-07-26  0:55 UTC (permalink / raw)
  To: Bruce Merry, Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, Johannes Weiner,
	Vladimir Davydov



On 7/19/18 3:40 AM, Bruce Merry wrote:
> On 18 July 2018 at 17:49, Shakeel Butt <shakeelb@google.com> wrote:
>> On Wed, Jul 18, 2018 at 8:37 AM Bruce Merry <bmerry@ska.ac.za> wrote:
>>> That sounds promising. Is there any way to tell how many zombies there
>>> are, and is there any way to deliberately create zombies? If I can
>>> produce zombies that might give me a reliable way to reproduce the
>>> problem, which could then sensibly be tested against newer kernel
>>> versions.
>>>
>>
>> Yes, very easy to produce zombies, though I don't think kernel
>> provides any way to tell how many zombies exist on the system.
>>
>> To create a zombie, first create a memcg node, enter that memcg,
>> create a tmpfs file of few KiBs, exit the memcg and rmdir the memcg.
>> That memcg will be a zombie until you delete that tmpfs file.
> 
> Thanks, that makes sense. I'll see if I can reproduce the issue. Do
> you expect the same thing to happen with normal (non-tmpfs) files that
> are sitting in the page cache, and/or dentries?
> 

Do you by any chance have use_hierarch=1? memcg_stat_show should just rely on counters inside the memory cgroup and the the LRU sizes for each node.

Balbir Singh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-26  0:55               ` Singh, Balbir
@ 2018-07-26  6:41                 ` Bruce Merry
  2018-07-26  8:19                   ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Bruce Merry @ 2018-07-26  6:41 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Shakeel Butt, Michal Hocko, Andrew Morton, LKML, Linux MM,
	Johannes Weiner, Vladimir Davydov

On 26 July 2018 at 02:55, Singh, Balbir <bsingharora@gmail.com> wrote:
> Do you by any chance have use_hierarch=1? memcg_stat_show should just rely on counters inside the memory cgroup and the the LRU sizes for each node.

Yes, /sys/fs/cgroup/memory/memory.use_hierarchy is 1. I assume systemd
is doing that.

Bruce
-- 
Bruce Merry
Senior Science Processing Developer
SKA South Africa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines
  2018-07-26  6:41                 ` Bruce Merry
@ 2018-07-26  8:19                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2018-07-26  8:19 UTC (permalink / raw)
  To: Bruce Merry
  Cc: Singh, Balbir, Shakeel Butt, Andrew Morton, LKML, Linux MM,
	Johannes Weiner, Vladimir Davydov

On Thu 26-07-18 08:41:35, Bruce Merry wrote:
> On 26 July 2018 at 02:55, Singh, Balbir <bsingharora@gmail.com> wrote:
> > Do you by any chance have use_hierarch=1? memcg_stat_show should just rely on counters inside the memory cgroup and the the LRU sizes for each node.
> 
> Yes, /sys/fs/cgroup/memory/memory.use_hierarchy is 1. I assume systemd
> is doing that.

And this is actually good. Non hierarchical behavior is discouraged.
The real problem is that we are keeping way too many zombie memcgs
around and waiting for memory pressure to reclaim them and so they go
away on their own.

As I've tried to explain in other email force_empty before removing the
memcg should help.

Fixing this properly would require quite some heavy lifting AFAICS. We
would basically have to move zombies out of the way which is not hard
but we do not want to hide their current memory consumption so we would
have to somehow move their stats to the parent. And then we are back to
reparenting which has been removed by b2052564e66d ("mm: memcontrol:
continue cache reclaim from offlined groups").
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-07-26 12:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com>
2018-07-18  4:23 ` Showing /sys/fs/cgroup/memory/memory.stat very slow on some machines Andrew Morton
2018-07-18 10:42   ` Michal Hocko
2018-07-18 14:29     ` Bruce Merry
2018-07-18 14:47       ` Michal Hocko
2018-07-18 15:27         ` Bruce Merry
2018-07-18 15:33           ` Shakeel Butt
2018-07-18 15:26       ` Shakeel Butt
2018-07-18 15:37         ` Bruce Merry
2018-07-18 15:49           ` Shakeel Butt
2018-07-18 17:40             ` Bruce Merry
2018-07-18 17:48               ` Shakeel Butt
2018-07-18 17:58                 ` Bruce Merry
2018-07-18 18:13                   ` Shakeel Butt
2018-07-18 18:43                     ` Bruce Merry
2018-07-24 10:05               ` Bruce Merry
2018-07-24 10:50                 ` Marinko Catovic
2018-07-25 12:29                   ` Michal Hocko
2018-07-25 12:32                 ` Michal Hocko
2018-07-26 12:35                 ` Bruce Merry
2018-07-26 12:48                   ` Michal Hocko
2018-07-26  0:55               ` Singh, Balbir
2018-07-26  6:41                 ` Bruce Merry
2018-07-26  8:19                   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).