All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Bruno Prémont" <bonbons-ud5FBsm0p/xEiooADzr8i9i2O/JbrIOy@public.gmane.org>
To: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Chris Down <chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Vladimir Davydov
	<vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Date: Wed, 15 Apr 2020 12:17:53 +0200	[thread overview]
Message-ID: <20200415121753.3c8d700b@hemera.lan.sysophe.eu> (raw)
In-Reply-To: <20200414163134.GQ4629-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hi Michal,

On Tue, 14 Apr 2020 18:31:34 Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Fri 10-04-20 11:50:10, Bruno Prémont wrote:
> > Hi Michal, Chris,
> > 
> > Sending ephemeral link to (now properly) captured cgroup details off-list.

Re-adding the list and other readers.

> > It contains:
> >   snapshots of the cgroup contents at 1s interval
> > backup was running through the full captured period.
> > 
> > You can see memory.max changes at
> >   26-21  (to high + 128M)
> > and
> >   30-24  (back to high)  
> 
> OK, so you have started with high = max and do not see any stalls. As
> soon as you activate the high limit reclaim by increasing the max limit
> you get your stalls because those tasks are put into sleep. There are
> only 3 tasks in the cgroup and it seems to be shell, tar and some
> subshell or is there anything else that could charge any memory?

On production system it's just backup software (3-4 processes)

On basic reproducer it's bash, tar and tar's gzip subprocess.

> Let's just focus on the time prior to switch and after
> * prior
> $ for i in $(seq -w 21); do cat 26-$i/memory.current ; done | calc_min_max.awk
> min: 2371895296.00 max: 2415874048.00 avg: 2404059136.00 std: 13469809.78 nr: 20
> 
> high = hard limit = 2415919104
> 
> * after
> $ for i in $(seq -w 22 59); do cat 26-$i/memory.current ; done | calc_min_max.awk
> min: 2409172992.00 max: 2415828992.00 avg: 2415420793.26 std: 1475181.24 nr: 38
> 
> high limit = 2415919104
> hard limit = 2550136832
> 
> Nothing interesting here. The charged memory stays below the high limit.
> This might be a matter of timing of course because your snapshot might
> hit the window when the situation was ok. But 90K is a larger margin
> than I would expect in such a case.
> 
> $ cat 27-*/memory.current | calc_min_max.awk 
> min: 2408161280.00 max: 2415910912.00 avg: 2415709583.19 std: 993983.28 nr: 59
> 
> Still under the high limit but closer 8K so this looks much more like
> seeing high limit reclaim in action.
> 
> $ cat 28-*/memory.current | calc_min_max.awk
> min: 2409123840.00 max: 2415910912.00 avg: 2415633019.59 std: 870311.11 nr: 58
> 
> same here.
> 
> $ cat 29-*/memory.current | calc_min_max.awk
> min: 2400464896.00 max: 2415828992.00 avg: 2414819015.59 std: 3133978.89 nr: 59
> 
> quite below high limit.
> 
> So I do not see any large high limit excess here but it could be a
> matter of timing as mentioned above. Let's have a look at the reclaim
> activity.
> 
> 27-00/memory.stat:pgscan 82811883
> 27-00/memory.stat:pgsteal 80223813
> 27-01/memory.stat:pgscan 82811883
> 27-01/memory.stat:pgsteal 80223813
> 
> No scanning 
> 
> 27-02/memory.stat:pgscan 82811947
> 27-02/memory.stat:pgsteal 80223877
> 
> 64 pages scanned and reclaimed
> 
> 27-03/memory.stat:pgscan 82811947
> 27-03/memory.stat:pgsteal 80223877
> 27-04/memory.stat:pgscan 82811947
> 27-04/memory.stat:pgsteal 80223877
> 27-05/memory.stat:pgscan 82811947
> 27-05/memory.stat:pgsteal 80223877
> 
> No scanning
> 
> 27-06/memory.stat:pgscan 82812011
> 27-06/memory.stat:pgsteal 80223941
> 
> 64 pages scanned and reclaimed
> 
> 27-07/memory.stat:pgscan 82812011
> 27-07/memory.stat:pgsteal 80223941
> 27-08/memory.stat:pgscan 82812011
> 27-08/memory.stat:pgsteal 80223941
> 27-09/memory.stat:pgscan 82812011
> 27-09/memory.stat:pgsteal 80223941
> 
> No scanning
> 
> 27-11/memory.stat:pgscan 82812075
> 27-11/memory.stat:pgsteal 80224005
> 
> 64 pages scanned
> 
> 27-12/memory.stat:pgscan 82812075
> 27-12/memory.stat:pgsteal 80224005
> 27-13/memory.stat:pgscan 82812075
> 27-13/memory.stat:pgsteal 80224005
> 27-14/memory.stat:pgscan 82812075
> 27-14/memory.stat:pgsteal 80224005
> 
> No scanning. etc...
> 
> So it seems there were two rounds of scanning (we usually do 32pages per
> batch and the reclaim was really effective at reclaiming that memory but
> then the task is put into sleep for 2-3s. This is quite unexpected
> because collected stats do not show the high limit excess during the
> sleeping time.
> 
> It would be interesting to see more detailed information on the
> throttling itself. Which kernel version are you testing this on?
> 5.6+ kernel needs http://lkml.kernel.org/r/20200331152424.GA1019937-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org
> but please note that e26733e0d0ec ("mm, memcg: throttle allocators based
> on ancestral memory.high") has been marked for stable so 5.4+ kernels
> might have it as well and you would need the same fix for them as well.
> I wouldn't be really surprised if this was the actual problem that you
> are hitting because the reclaim could simply made usage < high and the
> math in calculate_high_delay doesn't work properly.

I'm on 5.6.2. Seems neither e26733e0d0ec nor the fix hit 5.6.2 (nor
current 5.6.4).

> Anyway the following simply tracing patch should give a better clue.
> The output will appear in the trace buffer (mount tracefs and read
> trace_pipe file).

This is the output I get on 5.6.4 with simple tar -zc call (max=high+4096):
  tar-16943 [000] ....  1098.796955: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1100.876794: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1102.956636: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262120 high:262144
  tar-16943 [000] ....  1105.037388: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262121 high:262144
  tar-16943 [000] ....  1107.117246: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144

With 5.7-rc1 it runs just fine, pressure remains zero and no output in trace_pipe or throttling.

So the fixes that went in there do fix it.
Now matter of cherry-picking the right ones... e26733e0d0ec and its fixe's fix,
maybe some others (will start with those tagged for stable).


Thanks,
Bruno

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05b4ec2c6499..dcee3030309d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2417,6 +2417,10 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <= HZ / 100)
>  		goto out;
>  
> +	trace_printk("memcg_nr_pages_over_high:%d penalty_jiffies:%ld current:%lu high:%lu\n",
> +			nr_pages, penalty_jiffies,
> +			page_counter_read(&memcg->memory), READ_ONCE(memcg->high));
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't


WARNING: multiple messages have this Message-ID (diff)
From: "Bruno Prémont" <bonbons@linux-vserver.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	Johannes Weiner  <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Date: Wed, 15 Apr 2020 12:17:53 +0200	[thread overview]
Message-ID: <20200415121753.3c8d700b@hemera.lan.sysophe.eu> (raw)
In-Reply-To: <20200414163134.GQ4629@dhcp22.suse.cz>

Hi Michal,

On Tue, 14 Apr 2020 18:31:34 Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 10-04-20 11:50:10, Bruno Prémont wrote:
> > Hi Michal, Chris,
> > 
> > Sending ephemeral link to (now properly) captured cgroup details off-list.

Re-adding the list and other readers.

> > It contains:
> >   snapshots of the cgroup contents at 1s interval
> > backup was running through the full captured period.
> > 
> > You can see memory.max changes at
> >   26-21  (to high + 128M)
> > and
> >   30-24  (back to high)  
> 
> OK, so you have started with high = max and do not see any stalls. As
> soon as you activate the high limit reclaim by increasing the max limit
> you get your stalls because those tasks are put into sleep. There are
> only 3 tasks in the cgroup and it seems to be shell, tar and some
> subshell or is there anything else that could charge any memory?

On production system it's just backup software (3-4 processes)

On basic reproducer it's bash, tar and tar's gzip subprocess.

> Let's just focus on the time prior to switch and after
> * prior
> $ for i in $(seq -w 21); do cat 26-$i/memory.current ; done | calc_min_max.awk
> min: 2371895296.00 max: 2415874048.00 avg: 2404059136.00 std: 13469809.78 nr: 20
> 
> high = hard limit = 2415919104
> 
> * after
> $ for i in $(seq -w 22 59); do cat 26-$i/memory.current ; done | calc_min_max.awk
> min: 2409172992.00 max: 2415828992.00 avg: 2415420793.26 std: 1475181.24 nr: 38
> 
> high limit = 2415919104
> hard limit = 2550136832
> 
> Nothing interesting here. The charged memory stays below the high limit.
> This might be a matter of timing of course because your snapshot might
> hit the window when the situation was ok. But 90K is a larger margin
> than I would expect in such a case.
> 
> $ cat 27-*/memory.current | calc_min_max.awk 
> min: 2408161280.00 max: 2415910912.00 avg: 2415709583.19 std: 993983.28 nr: 59
> 
> Still under the high limit but closer 8K so this looks much more like
> seeing high limit reclaim in action.
> 
> $ cat 28-*/memory.current | calc_min_max.awk
> min: 2409123840.00 max: 2415910912.00 avg: 2415633019.59 std: 870311.11 nr: 58
> 
> same here.
> 
> $ cat 29-*/memory.current | calc_min_max.awk
> min: 2400464896.00 max: 2415828992.00 avg: 2414819015.59 std: 3133978.89 nr: 59
> 
> quite below high limit.
> 
> So I do not see any large high limit excess here but it could be a
> matter of timing as mentioned above. Let's have a look at the reclaim
> activity.
> 
> 27-00/memory.stat:pgscan 82811883
> 27-00/memory.stat:pgsteal 80223813
> 27-01/memory.stat:pgscan 82811883
> 27-01/memory.stat:pgsteal 80223813
> 
> No scanning 
> 
> 27-02/memory.stat:pgscan 82811947
> 27-02/memory.stat:pgsteal 80223877
> 
> 64 pages scanned and reclaimed
> 
> 27-03/memory.stat:pgscan 82811947
> 27-03/memory.stat:pgsteal 80223877
> 27-04/memory.stat:pgscan 82811947
> 27-04/memory.stat:pgsteal 80223877
> 27-05/memory.stat:pgscan 82811947
> 27-05/memory.stat:pgsteal 80223877
> 
> No scanning
> 
> 27-06/memory.stat:pgscan 82812011
> 27-06/memory.stat:pgsteal 80223941
> 
> 64 pages scanned and reclaimed
> 
> 27-07/memory.stat:pgscan 82812011
> 27-07/memory.stat:pgsteal 80223941
> 27-08/memory.stat:pgscan 82812011
> 27-08/memory.stat:pgsteal 80223941
> 27-09/memory.stat:pgscan 82812011
> 27-09/memory.stat:pgsteal 80223941
> 
> No scanning
> 
> 27-11/memory.stat:pgscan 82812075
> 27-11/memory.stat:pgsteal 80224005
> 
> 64 pages scanned
> 
> 27-12/memory.stat:pgscan 82812075
> 27-12/memory.stat:pgsteal 80224005
> 27-13/memory.stat:pgscan 82812075
> 27-13/memory.stat:pgsteal 80224005
> 27-14/memory.stat:pgscan 82812075
> 27-14/memory.stat:pgsteal 80224005
> 
> No scanning. etc...
> 
> So it seems there were two rounds of scanning (we usually do 32pages per
> batch and the reclaim was really effective at reclaiming that memory but
> then the task is put into sleep for 2-3s. This is quite unexpected
> because collected stats do not show the high limit excess during the
> sleeping time.
> 
> It would be interesting to see more detailed information on the
> throttling itself. Which kernel version are you testing this on?
> 5.6+ kernel needs http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name
> but please note that e26733e0d0ec ("mm, memcg: throttle allocators based
> on ancestral memory.high") has been marked for stable so 5.4+ kernels
> might have it as well and you would need the same fix for them as well.
> I wouldn't be really surprised if this was the actual problem that you
> are hitting because the reclaim could simply made usage < high and the
> math in calculate_high_delay doesn't work properly.

I'm on 5.6.2. Seems neither e26733e0d0ec nor the fix hit 5.6.2 (nor
current 5.6.4).

> Anyway the following simply tracing patch should give a better clue.
> The output will appear in the trace buffer (mount tracefs and read
> trace_pipe file).

This is the output I get on 5.6.4 with simple tar -zc call (max=high+4096):
  tar-16943 [000] ....  1098.796955: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1100.876794: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1102.956636: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262120 high:262144
  tar-16943 [000] ....  1105.037388: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262121 high:262144
  tar-16943 [000] ....  1107.117246: mem_cgroup_handle_over_high: memcg_nr_pages_over_high:1 penalty_jiffies:200 current:262122 high:262144

With 5.7-rc1 it runs just fine, pressure remains zero and no output in trace_pipe or throttling.

So the fixes that went in there do fix it.
Now matter of cherry-picking the right ones... e26733e0d0ec and its fixe's fix,
maybe some others (will start with those tagged for stable).


Thanks,
Bruno

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05b4ec2c6499..dcee3030309d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2417,6 +2417,10 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <= HZ / 100)
>  		goto out;
>  
> +	trace_printk("memcg_nr_pages_over_high:%d penalty_jiffies:%ld current:%lu high:%lu\n",
> +			nr_pages, penalty_jiffies,
> +			page_counter_read(&memcg->memory), READ_ONCE(memcg->high));
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't



  parent reply	other threads:[~2020-04-15 10:17 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-09  9:25 Memory CG and 5.1 to 5.6 uprade slows backup Bruno Prémont
2020-04-09  9:25 ` Bruno Prémont
     [not found] ` <20200409112505.2e1fc150-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09  9:46   ` Michal Hocko
2020-04-09  9:46     ` Michal Hocko
     [not found]     ` <20200409094615.GE18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-09 10:17       ` Bruno Prémont
2020-04-09 10:17         ` Bruno Prémont
     [not found]         ` <20200409121733.1a5ba17c-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09 10:34           ` Michal Hocko
2020-04-09 10:34             ` Michal Hocko
     [not found]             ` <20200409103400.GF18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-09 15:09               ` Bruno Prémont
2020-04-09 15:09                 ` Bruno Prémont
     [not found]                 ` <20200409170926.182354c3-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09 15:24                   ` Chris Down
2020-04-09 15:24                     ` Chris Down
     [not found]                     ` <20200409152417.GB1040020-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-04-09 15:40                       ` Bruno Prémont
2020-04-09 15:40                         ` Bruno Prémont
     [not found]                         ` <20200409174042.2a3389ba-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09 17:50                           ` Chris Down
2020-04-09 17:50                             ` Chris Down
     [not found]                             ` <20200409175044.GC1040020-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-04-09 17:56                               ` Chris Down
2020-04-09 17:56                                 ` Chris Down
2020-04-09 15:25                   ` Michal Hocko
2020-04-09 15:25                     ` Michal Hocko
2020-04-10  7:15                     ` Bruno Prémont
     [not found]                       ` <20200410091525.287062fa-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-10  8:43                         ` Bruno Prémont
2020-04-10  8:43                           ` Bruno Prémont
     [not found]                           ` <20200410115010.1d9f6a3f@hemera.lan.sysophe.eu>
     [not found]                             ` <20200414163134.GQ4629@dhcp22.suse.cz>
     [not found]                               ` <20200414163134.GQ4629-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-15 10:17                                 ` Bruno Prémont [this message]
2020-04-15 10:17                                   ` Bruno Prémont
     [not found]                                   ` <20200415121753.3c8d700b-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-15 10:24                                     ` Michal Hocko
2020-04-15 10:24                                       ` Michal Hocko
     [not found]                                       ` <20200415102442.GE4629-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-15 11:37                                         ` Bruno Prémont
2020-04-15 11:37                                           ` Bruno Prémont
     [not found]                     ` <20200409152540.GP18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-14 15:09                       ` Bruno Prémont
2020-04-14 15:09                         ` Bruno Prémont
2020-04-09 10:50   ` Chris Down
2020-04-09 10:50     ` Chris Down
     [not found]     ` <20200409105048.GA1040020-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-04-09 11:58       ` Bruno Prémont
2020-04-09 11:58         ` Bruno Prémont

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200415121753.3c8d700b@hemera.lan.sysophe.eu \
    --to=bonbons-ud5fbsm0p/xeiooadzr8i9i2o/jbrioy@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.