All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Bruno Prémont" <bonbons-ud5FBsm0p/xEiooADzr8i9i2O/JbrIOy@public.gmane.org>
To: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Vladimir Davydov
	<vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Chris Down <chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Date: Tue, 14 Apr 2020 17:09:03 +0200	[thread overview]
Message-ID: <20200414170903.4f28c29f@hemera.lan.sysophe.eu> (raw)
In-Reply-To: <20200409152540.GP18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hi Michal, Chris,

I can reproduce very easily with basic commands on a idle system with
just a reasonably filled partition and lots of (free) RAM and running:
  bash -c 'echo $$ > $path/to/cgroup/cgroup.procs; tar -zc -C /export . > /dev/null'
where tar is running all alone in its cgroup with
  memory.high = 1024M
  memory.max  = 1152M   (high + 128M)

At the start
  memory.stat:pgscan 0
  memory.stat:pgsteal 0
once pressure is "high" and tar gets throttled both values increase
concurrently by 64 once every 2 seconds.

Cgroup's memory.current starts 0 and grows up to memory.high and then
pressure starts.
  memory.stat:inactive_file 910192640
  memory.stat:active_file 61501440
active_file remains low (64M) while inactive_file is high (most of the
1024M allowed)

Somehow reclaim does not consider the inactive_file or tries to reclaim
in too small pieces compared to memory turnover in the cgroup.


Event having memory.max being just a single page (4096 bytes) larger
than memory.high brings the same throttling behavior.
Changing memory.max to match memory.high gets reclaim to work without
throttling.


Bruno


On Thu, 9 Apr 2020 17:25:40 Michal Hocko wrote:
> On Thu 09-04-20 17:09:26, Bruno Prémont wrote:
> > On Thu, 9 Apr 2020 12:34:00 +0200Michal Hocko wrote:
> >   
> > > On Thu 09-04-20 12:17:33, Bruno Prémont wrote:  
> > > > On Thu, 9 Apr 2020 11:46:15 Michal Hocko wrote:    
> > > > > [Cc Chris]
> > > > > 
> > > > > On Thu 09-04-20 11:25:05, Bruno Prémont wrote:    
> > > > > > Hi,
> > > > > > 
> > > > > > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > > > > > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > > > > > sees backup being highly throttled (there are about 1.5T to be
> > > > > > backuped).      
> > > > > 
> > > > > What does /proc/sys/vm/dirty_* say?    
> > > > 
> > > > /proc/sys/vm/dirty_background_bytes:0
> > > > /proc/sys/vm/dirty_background_ratio:10
> > > > /proc/sys/vm/dirty_bytes:0
> > > > /proc/sys/vm/dirty_expire_centisecs:3000
> > > > /proc/sys/vm/dirty_ratio:20
> > > > /proc/sys/vm/dirty_writeback_centisecs:500    
> > > 
> > > Sorry, but I forgot ask for the total amount of memory. But it seems
> > > this is 64GB and 10% dirty ration might mean a lot of dirty memory.
> > > Does the same happen if you reduce those knobs to something smaller than
> > > 2G? _bytes alternatives should be useful for that purpose.  
> > 
> > Well, tuning it to /proc/sys/vm/dirty_background_bytes:268435456
> > /proc/sys/vm/dirty_background_ratio:0
> > /proc/sys/vm/dirty_bytes:536870912
> > /proc/sys/vm/dirty_expire_centisecs:3000
> > /proc/sys/vm/dirty_ratio:0
> > /proc/sys/vm/dirty_writeback_centisecs:500
> > does not make any difference.  
> 
> OK, it was a wild guess because cgroup v2 should be able to throttle
> heavy writers and be memcg aware AFAIR. But good to have it confirmed.
> 
> [...]
> 
> > > > > Is it possible that the reclaim is not making progress on too many
> > > > > dirty pages and that triggers the back off mechanism that has been
> > > > > implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> > > > > memcg: throttle allocators when failing reclaim over memory.high")
> > > > > and e26733e0d0ec ("mm, memcg: throttle allocators based on
> > > > > ancestral memory.high").    
> > > > 
> > > > Could be though in that case it's throttling the wrong task/cgroup
> > > > as far as I can see (at least from cgroup's memory stats) or being
> > > > blocked by state external to the cgroup.
> > > > Will have a look at those patches so get a better idea at what they
> > > > change.    
> > > 
> > > Could you check where is the task of your interest throttled?
> > > /proc/<pid>/stack should give you a clue.  
> > 
> > As guessed by Chris, it's
> > [<0>] mem_cgroup_handle_over_high+0x121/0x170
> > [<0>] exit_to_usermode_loop+0x67/0xa0
> > [<0>] do_syscall_64+0x149/0x170
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > And I know no way to tell kernel "drop all caches" for a specific cgroup
> > nor how to list the inactive files assigned to a given cgroup (knowing
> > which ones they are and their idle state could help understanding why
> > they aren't being reclaimed).
> > 
> > 
> > 
> > Could it be that cache is being prevented from being reclaimed by a task
> > in another cgroup?
> > 
> > e.g.
> >   cgroup/system/backup
> >     first reads $files (reads each once)
> >   cgroup/workload/bla
> >     second&more reads $files
> > 
> > Would $files remain associated to cgroup/system/backup and not
> > reclaimed there instead of being reassigned to cgroup/workload/bla?  
> 
> No, page cache is first-touch-gets-charged. But there is certainly a
> interference possible if the memory is somehow pinned - e.g. mlock - by
> a task from another cgroup or internally by FS.
> 
> Your earlier stat snapshot doesn't indicate a big problem with the
> reclaim though:
> 
> memory.stat:pgscan 47519855
> memory.stat:pgsteal 44933838
> 
> This tells the overall reclaim effectiveness was 94%. Could you try to
> gather snapshots with a 1s granularity starting before your run your
> backup to see how those numbers evolve? Ideally with timestamps to
> compare with the actual stall information.
> 
> Another option would be to enable vmscan tracepoints but let's try with
> stats first.


WARNING: multiple messages have this Message-ID (diff)
From: "Bruno Prémont" <bonbons@linux-vserver.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: cgroups@vger.kernel.org, linux-mm@kvack.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Chris Down <chris@chrisdown.name>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Date: Tue, 14 Apr 2020 17:09:03 +0200	[thread overview]
Message-ID: <20200414170903.4f28c29f@hemera.lan.sysophe.eu> (raw)
In-Reply-To: <20200409152540.GP18386@dhcp22.suse.cz>

Hi Michal, Chris,

I can reproduce very easily with basic commands on a idle system with
just a reasonably filled partition and lots of (free) RAM and running:
  bash -c 'echo $$ > $path/to/cgroup/cgroup.procs; tar -zc -C /export . > /dev/null'
where tar is running all alone in its cgroup with
  memory.high = 1024M
  memory.max  = 1152M   (high + 128M)

At the start
  memory.stat:pgscan 0
  memory.stat:pgsteal 0
once pressure is "high" and tar gets throttled both values increase
concurrently by 64 once every 2 seconds.

Cgroup's memory.current starts 0 and grows up to memory.high and then
pressure starts.
  memory.stat:inactive_file 910192640
  memory.stat:active_file 61501440
active_file remains low (64M) while inactive_file is high (most of the
1024M allowed)

Somehow reclaim does not consider the inactive_file or tries to reclaim
in too small pieces compared to memory turnover in the cgroup.


Event having memory.max being just a single page (4096 bytes) larger
than memory.high brings the same throttling behavior.
Changing memory.max to match memory.high gets reclaim to work without
throttling.


Bruno


On Thu, 9 Apr 2020 17:25:40 Michal Hocko wrote:
> On Thu 09-04-20 17:09:26, Bruno Prémont wrote:
> > On Thu, 9 Apr 2020 12:34:00 +0200Michal Hocko wrote:
> >   
> > > On Thu 09-04-20 12:17:33, Bruno Prémont wrote:  
> > > > On Thu, 9 Apr 2020 11:46:15 Michal Hocko wrote:    
> > > > > [Cc Chris]
> > > > > 
> > > > > On Thu 09-04-20 11:25:05, Bruno Prémont wrote:    
> > > > > > Hi,
> > > > > > 
> > > > > > Upgrading from 5.1 kernel to 5.6 kernel on a production system using
> > > > > > cgroups (v2) and having backup process in a memory.high=2G cgroup
> > > > > > sees backup being highly throttled (there are about 1.5T to be
> > > > > > backuped).      
> > > > > 
> > > > > What does /proc/sys/vm/dirty_* say?    
> > > > 
> > > > /proc/sys/vm/dirty_background_bytes:0
> > > > /proc/sys/vm/dirty_background_ratio:10
> > > > /proc/sys/vm/dirty_bytes:0
> > > > /proc/sys/vm/dirty_expire_centisecs:3000
> > > > /proc/sys/vm/dirty_ratio:20
> > > > /proc/sys/vm/dirty_writeback_centisecs:500    
> > > 
> > > Sorry, but I forgot ask for the total amount of memory. But it seems
> > > this is 64GB and 10% dirty ration might mean a lot of dirty memory.
> > > Does the same happen if you reduce those knobs to something smaller than
> > > 2G? _bytes alternatives should be useful for that purpose.  
> > 
> > Well, tuning it to /proc/sys/vm/dirty_background_bytes:268435456
> > /proc/sys/vm/dirty_background_ratio:0
> > /proc/sys/vm/dirty_bytes:536870912
> > /proc/sys/vm/dirty_expire_centisecs:3000
> > /proc/sys/vm/dirty_ratio:0
> > /proc/sys/vm/dirty_writeback_centisecs:500
> > does not make any difference.  
> 
> OK, it was a wild guess because cgroup v2 should be able to throttle
> heavy writers and be memcg aware AFAIR. But good to have it confirmed.
> 
> [...]
> 
> > > > > Is it possible that the reclaim is not making progress on too many
> > > > > dirty pages and that triggers the back off mechanism that has been
> > > > > implemented recently in  5.4 (have a look at 0e4b01df8659 ("mm,
> > > > > memcg: throttle allocators when failing reclaim over memory.high")
> > > > > and e26733e0d0ec ("mm, memcg: throttle allocators based on
> > > > > ancestral memory.high").    
> > > > 
> > > > Could be though in that case it's throttling the wrong task/cgroup
> > > > as far as I can see (at least from cgroup's memory stats) or being
> > > > blocked by state external to the cgroup.
> > > > Will have a look at those patches so get a better idea at what they
> > > > change.    
> > > 
> > > Could you check where is the task of your interest throttled?
> > > /proc/<pid>/stack should give you a clue.  
> > 
> > As guessed by Chris, it's
> > [<0>] mem_cgroup_handle_over_high+0x121/0x170
> > [<0>] exit_to_usermode_loop+0x67/0xa0
> > [<0>] do_syscall_64+0x149/0x170
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > 
> > And I know no way to tell kernel "drop all caches" for a specific cgroup
> > nor how to list the inactive files assigned to a given cgroup (knowing
> > which ones they are and their idle state could help understanding why
> > they aren't being reclaimed).
> > 
> > 
> > 
> > Could it be that cache is being prevented from being reclaimed by a task
> > in another cgroup?
> > 
> > e.g.
> >   cgroup/system/backup
> >     first reads $files (reads each once)
> >   cgroup/workload/bla
> >     second&more reads $files
> > 
> > Would $files remain associated to cgroup/system/backup and not
> > reclaimed there instead of being reassigned to cgroup/workload/bla?  
> 
> No, page cache is first-touch-gets-charged. But there is certainly a
> interference possible if the memory is somehow pinned - e.g. mlock - by
> a task from another cgroup or internally by FS.
> 
> Your earlier stat snapshot doesn't indicate a big problem with the
> reclaim though:
> 
> memory.stat:pgscan 47519855
> memory.stat:pgsteal 44933838
> 
> This tells the overall reclaim effectiveness was 94%. Could you try to
> gather snapshots with a 1s granularity starting before your run your
> backup to see how those numbers evolve? Ideally with timestamps to
> compare with the actual stall information.
> 
> Another option would be to enable vmscan tracepoints but let's try with
> stats first.



  parent reply	other threads:[~2020-04-14 15:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-09  9:25 Memory CG and 5.1 to 5.6 uprade slows backup Bruno Prémont
2020-04-09  9:25 ` Bruno Prémont
     [not found] ` <20200409112505.2e1fc150-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09  9:46   ` Michal Hocko
2020-04-09  9:46     ` Michal Hocko
     [not found]     ` <20200409094615.GE18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-09 10:17       ` Bruno Prémont
2020-04-09 10:17         ` Bruno Prémont
     [not found]         ` <20200409121733.1a5ba17c-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09 10:34           ` Michal Hocko
2020-04-09 10:34             ` Michal Hocko
     [not found]             ` <20200409103400.GF18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-09 15:09               ` Bruno Prémont
2020-04-09 15:09                 ` Bruno Prémont
     [not found]                 ` <20200409170926.182354c3-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09 15:24                   ` Chris Down
2020-04-09 15:24                     ` Chris Down
     [not found]                     ` <20200409152417.GB1040020-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-04-09 15:40                       ` Bruno Prémont
2020-04-09 15:40                         ` Bruno Prémont
     [not found]                         ` <20200409174042.2a3389ba-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-09 17:50                           ` Chris Down
2020-04-09 17:50                             ` Chris Down
     [not found]                             ` <20200409175044.GC1040020-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-04-09 17:56                               ` Chris Down
2020-04-09 17:56                                 ` Chris Down
2020-04-09 15:25                   ` Michal Hocko
2020-04-09 15:25                     ` Michal Hocko
2020-04-10  7:15                     ` Bruno Prémont
     [not found]                       ` <20200410091525.287062fa-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-10  8:43                         ` Bruno Prémont
2020-04-10  8:43                           ` Bruno Prémont
     [not found]                           ` <20200410115010.1d9f6a3f@hemera.lan.sysophe.eu>
     [not found]                             ` <20200414163134.GQ4629@dhcp22.suse.cz>
     [not found]                               ` <20200414163134.GQ4629-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-15 10:17                                 ` Bruno Prémont
2020-04-15 10:17                                   ` Bruno Prémont
     [not found]                                   ` <20200415121753.3c8d700b-pDZhbqX7CfkoGc32E1+a2S4z1YicLaQ4@public.gmane.org>
2020-04-15 10:24                                     ` Michal Hocko
2020-04-15 10:24                                       ` Michal Hocko
     [not found]                                       ` <20200415102442.GE4629-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-15 11:37                                         ` Bruno Prémont
2020-04-15 11:37                                           ` Bruno Prémont
     [not found]                     ` <20200409152540.GP18386-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-04-14 15:09                       ` Bruno Prémont [this message]
2020-04-14 15:09                         ` Bruno Prémont
2020-04-09 10:50   ` Chris Down
2020-04-09 10:50     ` Chris Down
     [not found]     ` <20200409105048.GA1040020-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-04-09 11:58       ` Bruno Prémont
2020-04-09 11:58         ` Bruno Prémont

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200414170903.4f28c29f@hemera.lan.sysophe.eu \
    --to=bonbons-ud5fbsm0p/xeiooadzr8i9i2o/jbrioy@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.