* [PATCH mm v2 0/9] memcg: accounting for objects allocated by mkdir cgroup [not found] <Yn6aL3cO7VdrmHHp@carbon> @ 2022-05-21 16:37 ` Vasily Averin 2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin [not found] ` <cover.1653899364.git.vvs@openvz.org> 2022-05-21 16:37 ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin ` (8 subsequent siblings) 9 siblings, 2 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 4cpu VM with Fedora and self-complied upstream kernel. The calculations are not precise, it depends on kernel config options, number of cpus, enabled controllers, ignores possible page allocations etc. However this is enough to clarify the general situation. All allocations are splited into: - common part, always called for each cgroup type - per-cgroup allocations In each group we consider 2 corner cases: - usual allocations, important for 1-2 CPU nodes/Vms - percpu allocations, important for 'big irons' common part: ~11Kb + 318 bytes percpu memcg: ~17Kb + 4692 bytes percpu cpu: ~2.5Kb + 1036 bytes percpu cpuset: ~3Kb + 12 bytes percpu blkcg: ~3Kb + 12 bytes percpu pid: ~1.5Kb + 12 bytes percpu perf: ~320b + 60 bytes percpu ------------------------------------------- total: ~38Kb + 6142 bytes percpu currently accounted: 4668 bytes percpu - it's important to account usual allocations called in common part, because almost all of cgroup-specific allocations are small. One exception here is memory cgroup, it allocates a few huge objects that should be accounted. - Percpu allocation called in common part, in memcg and cpu cgroups should be accounted, rest ones are small an can be ignored. - KERNFS objects are allocated both in common part and in most of cgroups Details can be found here: https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ I checked other cgroups types was found that they all can be ignored. Additionally I found allocation of struct rt_rq called in cpu cgroup if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) percpu structure and should be accounted too. v2: 1) re-split to simplify possible bisect, re-ordered 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu, allocated in common part 3) added accounting for percpu allocation of struct rt_rq (actual if CONFIG_RT_GROUP_SCHED is enabled) 4) improved patches descriptions Vasily Averin (9): memcg: enable accounting for struct cgroup memcg: enable accounting for kernfs nodes memcg: enable accounting for kernfs iattrs memcg: enable accounting for struct simple_xattr memcg: enable accounting for percpu allocation of struct psi_group_cpu memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu memcg: enable accounting for large allocations in mem_cgroup_css_alloc memcg: enable accounting for allocations in alloc_fair_sched_group memcg: enable accounting for percpu allocation of struct rt_rq fs/kernfs/mount.c | 6 ++++-- fs/xattr.c | 2 +- kernel/cgroup/cgroup.c | 2 +- kernel/cgroup/rstat.c | 3 ++- kernel/sched/fair.c | 4 ++-- kernel/sched/psi.c | 3 ++- kernel/sched/rt.c | 2 +- mm/memcontrol.c | 4 ++-- 8 files changed, 15 insertions(+), 11 deletions(-) -- 2.36.1 ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-21 16:37 ` [PATCH mm v2 0/9] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin @ 2022-05-30 11:25 ` Vasily Averin 2022-05-30 11:55 ` Michal Hocko ` (4 more replies) [not found] ` <cover.1653899364.git.vvs@openvz.org> 1 sibling, 5 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:25 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 4cpu VM with Fedora and self-complied upstream kernel. The calculations are not precise, it depends on kernel config options, number of cpus, enabled controllers, ignores possible page allocations etc. However this is enough to clarify the general situation. All allocations are splited into: - common part, always called for each cgroup type - per-cgroup allocations In each group we consider 2 corner cases: - usual allocations, important for 1-2 CPU nodes/Vms - percpu allocations, important for 'big irons' common part: ~11Kb + 318 bytes percpu memcg: ~17Kb + 4692 bytes percpu cpu: ~2.5Kb + 1036 bytes percpu cpuset: ~3Kb + 12 bytes percpu blkcg: ~3Kb + 12 bytes percpu pid: ~1.5Kb + 12 bytes percpu perf: ~320b + 60 bytes percpu ------------------------------------------- total: ~38Kb + 6142 bytes percpu currently accounted: 4668 bytes percpu - it's important to account usual allocations called in common part, because almost all of cgroup-specific allocations are small. One exception here is memory cgroup, it allocates a few huge objects that should be accounted. - Percpu allocation called in common part, in memcg and cpu cgroups should be accounted, rest ones are small an can be ignored. - KERNFS objects are allocated both in common part and in most of cgroups Details can be found here: https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ I checked other cgroups types was found that they all can be ignored. Additionally I found allocation of struct rt_rq called in cpu cgroup if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) percpu structure and should be accounted too. v3: 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7) 2) fixed few typos 3) added received approvals v2: 1) re-split to simplify possible bisect, re-ordered 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu, allocated in common part 3) added accounting for percpu allocation of struct rt_rq (actual if CONFIG_RT_GROUP_SCHED is enabled) 4) improved patches descriptions Vasily Averin (9): memcg: enable accounting for struct cgroup memcg: enable accounting for kernfs nodes memcg: enable accounting for kernfs iattrs memcg: enable accounting for struct simple_xattr memcg: enable accounting for percpu allocation of struct psi_group_cpu memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu memcg: enable accounting for large allocations in mem_cgroup_css_alloc memcg: enable accounting for allocations in alloc_fair_sched_group memcg: enable accounting for perpu allocation of struct rt_rq fs/kernfs/mount.c | 6 ++++-- fs/xattr.c | 2 +- kernel/cgroup/cgroup.c | 2 +- kernel/cgroup/rstat.c | 3 ++- kernel/sched/fair.c | 4 ++-- kernel/sched/psi.c | 3 ++- kernel/sched/rt.c | 2 +- mm/memcontrol.c | 4 ++-- 8 files changed, 15 insertions(+), 11 deletions(-) -- 2.36.1 ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin @ 2022-05-30 11:55 ` Michal Hocko 2022-05-30 13:09 ` Vasily Averin 2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin ` (3 subsequent siblings) 4 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-05-30 11:55 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On Mon 30-05-22 14:25:45, Vasily Averin wrote: > Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on > 4cpu VM with Fedora and self-complied upstream kernel. The calculations > are not precise, it depends on kernel config options, number of cpus, > enabled controllers, ignores possible page allocations etc. > However this is enough to clarify the general situation. > All allocations are splited into: > - common part, always called for each cgroup type > - per-cgroup allocations > > In each group we consider 2 corner cases: > - usual allocations, important for 1-2 CPU nodes/Vms > - percpu allocations, important for 'big irons' > > common part: ~11Kb + 318 bytes percpu > memcg: ~17Kb + 4692 bytes percpu > cpu: ~2.5Kb + 1036 bytes percpu > cpuset: ~3Kb + 12 bytes percpu > blkcg: ~3Kb + 12 bytes percpu > pid: ~1.5Kb + 12 bytes percpu > perf: ~320b + 60 bytes percpu > ------------------------------------------- > total: ~38Kb + 6142 bytes percpu > currently accounted: 4668 bytes percpu > > - it's important to account usual allocations called > in common part, because almost all of cgroup-specific allocations > are small. One exception here is memory cgroup, it allocates a few > huge objects that should be accounted. > - Percpu allocation called in common part, in memcg and cpu cgroups > should be accounted, rest ones are small an can be ignored. > - KERNFS objects are allocated both in common part and in most of > cgroups > > Details can be found here: > https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ > > I checked other cgroups types was found that they all can be ignored. > Additionally I found allocation of struct rt_rq called in cpu cgroup > if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) > percpu structure and should be accounted too. One thing that the changelog is missing is an explanation why do we need to account those objects. Users are usually not empowered to create cgroups arbitrarily. Or at least they shouldn't because we can expect more problems to happen. Could you clarify this please? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-30 11:55 ` Michal Hocko @ 2022-05-30 13:09 ` Vasily Averin 2022-05-30 14:22 ` Michal Hocko 0 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-30 13:09 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On 5/30/22 14:55, Michal Hocko wrote: > On Mon 30-05-22 14:25:45, Vasily Averin wrote: >> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on >> 4cpu VM with Fedora and self-complied upstream kernel. The calculations >> are not precise, it depends on kernel config options, number of cpus, >> enabled controllers, ignores possible page allocations etc. >> However this is enough to clarify the general situation. >> All allocations are splited into: >> - common part, always called for each cgroup type >> - per-cgroup allocations >> >> In each group we consider 2 corner cases: >> - usual allocations, important for 1-2 CPU nodes/Vms >> - percpu allocations, important for 'big irons' >> >> common part: ~11Kb + 318 bytes percpu >> memcg: ~17Kb + 4692 bytes percpu >> cpu: ~2.5Kb + 1036 bytes percpu >> cpuset: ~3Kb + 12 bytes percpu >> blkcg: ~3Kb + 12 bytes percpu >> pid: ~1.5Kb + 12 bytes percpu >> perf: ~320b + 60 bytes percpu >> ------------------------------------------- >> total: ~38Kb + 6142 bytes percpu >> currently accounted: 4668 bytes percpu >> >> - it's important to account usual allocations called >> in common part, because almost all of cgroup-specific allocations >> are small. One exception here is memory cgroup, it allocates a few >> huge objects that should be accounted. >> - Percpu allocation called in common part, in memcg and cpu cgroups >> should be accounted, rest ones are small an can be ignored. >> - KERNFS objects are allocated both in common part and in most of >> cgroups >> >> Details can be found here: >> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ >> >> I checked other cgroups types was found that they all can be ignored. >> Additionally I found allocation of struct rt_rq called in cpu cgroup >> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) >> percpu structure and should be accounted too. > > One thing that the changelog is missing is an explanation why do we need > to account those objects. Users are usually not empowered to create > cgroups arbitrarily. Or at least they shouldn't because we can expect > more problems to happen. > > Could you clarify this please? The problem is actual for OS-level containers: LXC or OpenVz. They are widely used for hosting and allow to run containers by untrusted end-users. Root inside such containers is able to create groups inside own container and consume host memory without its proper accounting. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-30 13:09 ` Vasily Averin @ 2022-05-30 14:22 ` Michal Hocko 2022-05-30 19:58 ` Vasily Averin 0 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-05-30 14:22 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On Mon 30-05-22 16:09:00, Vasily Averin wrote: > On 5/30/22 14:55, Michal Hocko wrote: > > On Mon 30-05-22 14:25:45, Vasily Averin wrote: > >> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on > >> 4cpu VM with Fedora and self-complied upstream kernel. The calculations > >> are not precise, it depends on kernel config options, number of cpus, > >> enabled controllers, ignores possible page allocations etc. > >> However this is enough to clarify the general situation. > >> All allocations are splited into: > >> - common part, always called for each cgroup type > >> - per-cgroup allocations > >> > >> In each group we consider 2 corner cases: > >> - usual allocations, important for 1-2 CPU nodes/Vms > >> - percpu allocations, important for 'big irons' > >> > >> common part: ~11Kb + 318 bytes percpu > >> memcg: ~17Kb + 4692 bytes percpu > >> cpu: ~2.5Kb + 1036 bytes percpu > >> cpuset: ~3Kb + 12 bytes percpu > >> blkcg: ~3Kb + 12 bytes percpu > >> pid: ~1.5Kb + 12 bytes percpu > >> perf: ~320b + 60 bytes percpu > >> ------------------------------------------- > >> total: ~38Kb + 6142 bytes percpu > >> currently accounted: 4668 bytes percpu > >> > >> - it's important to account usual allocations called > >> in common part, because almost all of cgroup-specific allocations > >> are small. One exception here is memory cgroup, it allocates a few > >> huge objects that should be accounted. > >> - Percpu allocation called in common part, in memcg and cpu cgroups > >> should be accounted, rest ones are small an can be ignored. > >> - KERNFS objects are allocated both in common part and in most of > >> cgroups > >> > >> Details can be found here: > >> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ > >> > >> I checked other cgroups types was found that they all can be ignored. > >> Additionally I found allocation of struct rt_rq called in cpu cgroup > >> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) > >> percpu structure and should be accounted too. > > > > One thing that the changelog is missing is an explanation why do we need > > to account those objects. Users are usually not empowered to create > > cgroups arbitrarily. Or at least they shouldn't because we can expect > > more problems to happen. > > > > Could you clarify this please? > > The problem is actual for OS-level containers: LXC or OpenVz. > They are widely used for hosting and allow to run containers > by untrusted end-users. Root inside such containers is able > to create groups inside own container and consume host memory > without its proper accounting. Is the unaccounted memory really the biggest problem here? IIRC having really huge cgroup trees can hurt quite some controllers. E.g. how does the cpu controller deal with too many or too deep hierarchies? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-30 14:22 ` Michal Hocko @ 2022-05-30 19:58 ` Vasily Averin 2022-05-31 7:16 ` Michal Hocko 0 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-30 19:58 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On 5/30/22 17:22, Michal Hocko wrote: > On Mon 30-05-22 16:09:00, Vasily Averin wrote: >> On 5/30/22 14:55, Michal Hocko wrote: >>> On Mon 30-05-22 14:25:45, Vasily Averin wrote: >>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on >>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations >>>> are not precise, it depends on kernel config options, number of cpus, >>>> enabled controllers, ignores possible page allocations etc. >>>> However this is enough to clarify the general situation. >>>> All allocations are splited into: >>>> - common part, always called for each cgroup type >>>> - per-cgroup allocations >>>> >>>> In each group we consider 2 corner cases: >>>> - usual allocations, important for 1-2 CPU nodes/Vms >>>> - percpu allocations, important for 'big irons' >>>> >>>> common part: ~11Kb + 318 bytes percpu >>>> memcg: ~17Kb + 4692 bytes percpu >>>> cpu: ~2.5Kb + 1036 bytes percpu >>>> cpuset: ~3Kb + 12 bytes percpu >>>> blkcg: ~3Kb + 12 bytes percpu >>>> pid: ~1.5Kb + 12 bytes percpu >>>> perf: ~320b + 60 bytes percpu >>>> ------------------------------------------- >>>> total: ~38Kb + 6142 bytes percpu >>>> currently accounted: 4668 bytes percpu >>>> >>>> - it's important to account usual allocations called >>>> in common part, because almost all of cgroup-specific allocations >>>> are small. One exception here is memory cgroup, it allocates a few >>>> huge objects that should be accounted. >>>> - Percpu allocation called in common part, in memcg and cpu cgroups >>>> should be accounted, rest ones are small an can be ignored. >>>> - KERNFS objects are allocated both in common part and in most of >>>> cgroups >>>> >>>> Details can be found here: >>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ >>>> >>>> I checked other cgroups types was found that they all can be ignored. >>>> Additionally I found allocation of struct rt_rq called in cpu cgroup >>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) >>>> percpu structure and should be accounted too. >>> >>> One thing that the changelog is missing is an explanation why do we need >>> to account those objects. Users are usually not empowered to create >>> cgroups arbitrarily. Or at least they shouldn't because we can expect >>> more problems to happen. >>> >>> Could you clarify this please? >> >> The problem is actual for OS-level containers: LXC or OpenVz. >> They are widely used for hosting and allow to run containers >> by untrusted end-users. Root inside such containers is able >> to create groups inside own container and consume host memory >> without its proper accounting. > > Is the unaccounted memory really the biggest problem here? > IIRC having really huge cgroup trees can hurt quite some controllers. > E.g. how does the cpu controller deal with too many or too deep > hierarchies? Could you please describe it in more details? Maybe it was passed me by, maybe I messed or forgot something, however I cannot remember any other practical cgroup-related issues. Maybe deep hierarchies does not work well. however, I have not heard that the internal configuration of cgroup can affect the upper level too. Please let me know if this can happen, this is very interesting for us. In our case, the hoster configures only the top level of the cgroup and does not worry about possible misconfiguration inside containers if it does not affect other containers or the host itself. Unaccounted memory, contrary, can affects both neighbor containers and host system, we saw it many times, and therefore we pay special attention to such issues. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-30 19:58 ` Vasily Averin @ 2022-05-31 7:16 ` Michal Hocko 2022-06-01 3:43 ` Vasily Averin 0 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-05-31 7:16 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On Mon 30-05-22 22:58:30, Vasily Averin wrote: > On 5/30/22 17:22, Michal Hocko wrote: > > On Mon 30-05-22 16:09:00, Vasily Averin wrote: > >> On 5/30/22 14:55, Michal Hocko wrote: > >>> On Mon 30-05-22 14:25:45, Vasily Averin wrote: > >>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on > >>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations > >>>> are not precise, it depends on kernel config options, number of cpus, > >>>> enabled controllers, ignores possible page allocations etc. > >>>> However this is enough to clarify the general situation. > >>>> All allocations are splited into: > >>>> - common part, always called for each cgroup type > >>>> - per-cgroup allocations > >>>> > >>>> In each group we consider 2 corner cases: > >>>> - usual allocations, important for 1-2 CPU nodes/Vms > >>>> - percpu allocations, important for 'big irons' > >>>> > >>>> common part: ~11Kb + 318 bytes percpu > >>>> memcg: ~17Kb + 4692 bytes percpu > >>>> cpu: ~2.5Kb + 1036 bytes percpu > >>>> cpuset: ~3Kb + 12 bytes percpu > >>>> blkcg: ~3Kb + 12 bytes percpu > >>>> pid: ~1.5Kb + 12 bytes percpu > >>>> perf: ~320b + 60 bytes percpu > >>>> ------------------------------------------- > >>>> total: ~38Kb + 6142 bytes percpu > >>>> currently accounted: 4668 bytes percpu > >>>> > >>>> - it's important to account usual allocations called > >>>> in common part, because almost all of cgroup-specific allocations > >>>> are small. One exception here is memory cgroup, it allocates a few > >>>> huge objects that should be accounted. > >>>> - Percpu allocation called in common part, in memcg and cpu cgroups > >>>> should be accounted, rest ones are small an can be ignored. > >>>> - KERNFS objects are allocated both in common part and in most of > >>>> cgroups > >>>> > >>>> Details can be found here: > >>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ > >>>> > >>>> I checked other cgroups types was found that they all can be ignored. > >>>> Additionally I found allocation of struct rt_rq called in cpu cgroup > >>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) > >>>> percpu structure and should be accounted too. > >>> > >>> One thing that the changelog is missing is an explanation why do we need > >>> to account those objects. Users are usually not empowered to create > >>> cgroups arbitrarily. Or at least they shouldn't because we can expect > >>> more problems to happen. > >>> > >>> Could you clarify this please? > >> > >> The problem is actual for OS-level containers: LXC or OpenVz. > >> They are widely used for hosting and allow to run containers > >> by untrusted end-users. Root inside such containers is able > >> to create groups inside own container and consume host memory > >> without its proper accounting. > > > > Is the unaccounted memory really the biggest problem here? > > IIRC having really huge cgroup trees can hurt quite some controllers. > > E.g. how does the cpu controller deal with too many or too deep > > hierarchies? > > Could you please describe it in more details? > Maybe it was passed me by, maybe I messed or forgot something, > however I cannot remember any other practical cgroup-related issues. > > Maybe deep hierarchies does not work well. > however, I have not heard that the internal configuration of cgroup > can affect the upper level too. My first thought was any controller with a fixed math constrains like cpu controller. But I have to admit that I haven't really checked whether imprecision can accumulate and propagate outside of the hierarchy. Another concern I would have is a id space depletion. At least memory controller depends on idr ids which have a space that is rather limited #define MEM_CGROUP_ID_MAX USHRT_MAX Also the runtime overhead would increase with a large number of cgroups. Take a global memory reclaim as an example. All the cgroups have to be iterated. This will have an impact outside of the said hierarchy. One could argue that limiting untrusted top level cgroups would be a certain mitigation but I can imagine this could get very non trivial easily. Anyway, let me just be explicit. I am not against these patches. In fact I cannot really judge their overhead. But right now I am not really sure they are going to help much against untrusted users. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-31 7:16 ` Michal Hocko @ 2022-06-01 3:43 ` Vasily Averin 2022-06-01 9:15 ` Michal Koutný 2022-06-01 9:26 ` Michal Hocko 0 siblings, 2 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-01 3:43 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On 5/31/22 10:16, Michal Hocko wrote: > On Mon 30-05-22 22:58:30, Vasily Averin wrote: >> On 5/30/22 17:22, Michal Hocko wrote: >>> On Mon 30-05-22 16:09:00, Vasily Averin wrote: >>>> On 5/30/22 14:55, Michal Hocko wrote: >>>>> On Mon 30-05-22 14:25:45, Vasily Averin wrote: >>>>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on >>>>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations >>>>>> are not precise, it depends on kernel config options, number of cpus, >>>>>> enabled controllers, ignores possible page allocations etc. >>>>>> However this is enough to clarify the general situation. >>>>>> All allocations are splited into: >>>>>> - common part, always called for each cgroup type >>>>>> - per-cgroup allocations >>>>>> >>>>>> In each group we consider 2 corner cases: >>>>>> - usual allocations, important for 1-2 CPU nodes/Vms >>>>>> - percpu allocations, important for 'big irons' >>>>>> >>>>>> common part: ~11Kb + 318 bytes percpu >>>>>> memcg: ~17Kb + 4692 bytes percpu >>>>>> cpu: ~2.5Kb + 1036 bytes percpu >>>>>> cpuset: ~3Kb + 12 bytes percpu >>>>>> blkcg: ~3Kb + 12 bytes percpu >>>>>> pid: ~1.5Kb + 12 bytes percpu >>>>>> perf: ~320b + 60 bytes percpu >>>>>> ------------------------------------------- >>>>>> total: ~38Kb + 6142 bytes percpu >>>>>> currently accounted: 4668 bytes percpu >>>>>> >>>>>> - it's important to account usual allocations called >>>>>> in common part, because almost all of cgroup-specific allocations >>>>>> are small. One exception here is memory cgroup, it allocates a few >>>>>> huge objects that should be accounted. >>>>>> - Percpu allocation called in common part, in memcg and cpu cgroups >>>>>> should be accounted, rest ones are small an can be ignored. >>>>>> - KERNFS objects are allocated both in common part and in most of >>>>>> cgroups >>>>>> >>>>>> Details can be found here: >>>>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ >>>>>> >>>>>> I checked other cgroups types was found that they all can be ignored. >>>>>> Additionally I found allocation of struct rt_rq called in cpu cgroup >>>>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) >>>>>> percpu structure and should be accounted too. >>>>> >>>>> One thing that the changelog is missing is an explanation why do we need >>>>> to account those objects. Users are usually not empowered to create >>>>> cgroups arbitrarily. Or at least they shouldn't because we can expect >>>>> more problems to happen. >>>>> >>>>> Could you clarify this please? >>>> >>>> The problem is actual for OS-level containers: LXC or OpenVz. >>>> They are widely used for hosting and allow to run containers >>>> by untrusted end-users. Root inside such containers is able >>>> to create groups inside own container and consume host memory >>>> without its proper accounting. >>> >>> Is the unaccounted memory really the biggest problem here? >>> IIRC having really huge cgroup trees can hurt quite some controllers. >>> E.g. how does the cpu controller deal with too many or too deep >>> hierarchies? >> >> Could you please describe it in more details? >> Maybe it was passed me by, maybe I messed or forgot something, >> however I cannot remember any other practical cgroup-related issues. >> >> Maybe deep hierarchies does not work well. >> however, I have not heard that the internal configuration of cgroup >> can affect the upper level too. > > My first thought was any controller with a fixed math constrains like > cpu controller. But I have to admit that I haven't really checked > whether imprecision can accumulate and propagate outside of the > hierarchy. > > Another concern I would have is a id space depletion. At least memory > controller depends on idr ids which have a space that is rather limited > #define MEM_CGROUP_ID_MAX USHRT_MAX > > Also the runtime overhead would increase with a large number of cgroups. > Take a global memory reclaim as an example. All the cgroups have to be > iterated. This will have an impact outside of the said hierarchy. One > could argue that limiting untrusted top level cgroups would be a certain > mitigation but I can imagine this could get very non trivial easily. > > Anyway, let me just be explicit. I am not against these patches. In fact > I cannot really judge their overhead. But right now I am not really sure > they are going to help much against untrusted users. Thank you very much, this information is very valuable for us. I'm understand your scepticism, the problem looks critical for upstream-based LXC, and I don't understand well how to properly protected it right now. However, it isn't critical for OpenVz. Our kernel does not allow to change of cgroup.subgroups_limit from inside containers. CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 512 CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit -bash: echo: write error: Operation not permitted CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit -bash: echo: write error: Operation not permitted I doubt this way can be accepted in upstream, however for OpenVz something like this it is mandatory because it much better than nothing. The number can be adjusted by host admin. The current default limit looks too small for me, however it is not difficult to increase it to a reasonable 10,000. My experiments show that ~10000 cgroups consumes 0.5 Gb memory on 4cpu VM. On "big irons" it can easily grow up to several Gb. This is quite a lot to ignore its accounting. I agree, highly qualified people like you can find many other ways of abuse anyway. However, OpenVz is trying to somehow prevent this, not in upstream, unfortunately, but at least in our own kernel. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-06-01 3:43 ` Vasily Averin @ 2022-06-01 9:15 ` Michal Koutný 2022-06-01 9:32 ` Michal Hocko 2022-06-01 9:26 ` Michal Hocko 1 sibling, 1 reply; 65+ messages in thread From: Michal Koutný @ 2022-06-01 9:15 UTC (permalink / raw) To: Vasily Averin Cc: Michal Hocko, Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Vlastimil Babka, Muchun Song, cgroups On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote: > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit > 512 > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > -bash: echo: write error: Operation not permitted > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > -bash: echo: write error: Operation not permitted > > I doubt this way can be accepted in upstream, however for OpenVz > something like this it is mandatory because it much better > than nothing. Is this customization of yours something like cgroup.max.descendants on the unified (v2) hierarchy? (Just curious.) (It can be made inaccessible from within the subtree either with cgroup ns or good old FS permissions.) Michal ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-06-01 9:15 ` Michal Koutný @ 2022-06-01 9:32 ` Michal Hocko 2022-06-01 13:05 ` Michal Hocko 0 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-06-01 9:32 UTC (permalink / raw) To: Michal Koutný Cc: Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Vlastimil Babka, Muchun Song, cgroups On Wed 01-06-22 11:15:43, Michal Koutny wrote: > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote: > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit > > 512 > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > -bash: echo: write error: Operation not permitted > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > -bash: echo: write error: Operation not permitted > > > > I doubt this way can be accepted in upstream, however for OpenVz > > something like this it is mandatory because it much better > > than nothing. > > Is this customization of yours something like cgroup.max.descendants on > the unified (v2) hierarchy? (Just curious.) > > (It can be made inaccessible from within the subtree either with cgroup > ns or good old FS permissions.) So we already do have a limit to prevent somebody from running away with the number of cgroups. Nice! I was not aware of that and I guess this looks like the right thing to do. So do we need more control and accounting that this? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-06-01 9:32 ` Michal Hocko @ 2022-06-01 13:05 ` Michal Hocko 2022-06-01 14:22 ` Roman Gushchin 0 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-06-01 13:05 UTC (permalink / raw) To: Michal Koutný, Roman Gushchin Cc: Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Vlastimil Babka, Muchun Song, cgroups On Wed 01-06-22 11:32:26, Michal Hocko wrote: > On Wed 01-06-22 11:15:43, Michal Koutny wrote: > > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote: > > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > 512 > > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > -bash: echo: write error: Operation not permitted > > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > -bash: echo: write error: Operation not permitted > > > > > > I doubt this way can be accepted in upstream, however for OpenVz > > > something like this it is mandatory because it much better > > > than nothing. > > > > Is this customization of yours something like cgroup.max.descendants on > > the unified (v2) hierarchy? (Just curious.) > > > > (It can be made inaccessible from within the subtree either with cgroup > > ns or good old FS permissions.) > > So we already do have a limit to prevent somebody from running away with > the number of cgroups. Nice! I was not aware of that and I guess this > looks like the right thing to do. So do we need more control and > accounting that this? I have checked the actual implementation and noticed that cgroups are uncharged when offlined (rmdir-ed) which means that an adversary could still trick the limit and runaway while still consuming resources. Roman, I guess the reason for this implementation was to avoid limit to trigger on setups with memcgs which can take quite some time to die? Would it make sense to make the implementation more strict to really act as gate against potential cgroups count runways? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-06-01 13:05 ` Michal Hocko @ 2022-06-01 14:22 ` Roman Gushchin 2022-06-01 15:24 ` Michal Hocko 0 siblings, 1 reply; 65+ messages in thread From: Roman Gushchin @ 2022-06-01 14:22 UTC (permalink / raw) To: Michal Hocko Cc: Michal Koutný, Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Vlastimil Babka, Muchun Song, cgroups On Wed, Jun 01, 2022 at 03:05:34PM +0200, Michal Hocko wrote: > On Wed 01-06-22 11:32:26, Michal Hocko wrote: > > On Wed 01-06-22 11:15:43, Michal Koutny wrote: > > > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote: > > > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > > 512 > > > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > > -bash: echo: write error: Operation not permitted > > > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > > -bash: echo: write error: Operation not permitted > > > > > > > > I doubt this way can be accepted in upstream, however for OpenVz > > > > something like this it is mandatory because it much better > > > > than nothing. > > > > > > Is this customization of yours something like cgroup.max.descendants on > > > the unified (v2) hierarchy? (Just curious.) > > > > > > (It can be made inaccessible from within the subtree either with cgroup > > > ns or good old FS permissions.) > > > > So we already do have a limit to prevent somebody from running away with > > the number of cgroups. Nice! Yes, we do! > > I was not aware of that and I guess this > > looks like the right thing to do. So do we need more control and > > accounting that this? > > I have checked the actual implementation and noticed that cgroups are > uncharged when offlined (rmdir-ed) which means that an adversary could > still trick the limit and runaway while still consuming resources. > > Roman, I guess the reason for this implementation was to avoid limit to > trigger on setups with memcgs which can take quite some time to die? > Would it make sense to make the implementation more strict to really act > as gate against potential cgroups count runways? The reasoning was that in many cases a user can't do much about dying cgroups, so it's not clear how they should/would handle getting -EAGAIN on creating a new cgroup (retrying will not help, obviously). Live cgroups can be easily deleted, dying cgroups - not always. I'm not sure about switching the semantics. I'd wait till Muchun's lru page reparenting will be landed (could be within 1-2 releases, I guess) and then we can check whether the whole problem is mostly gone. Honestly, I think we might need to fix few another things, but it might be not that hard (in comparison to what we already did). ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-06-01 14:22 ` Roman Gushchin @ 2022-06-01 15:24 ` Michal Hocko 0 siblings, 0 replies; 65+ messages in thread From: Michal Hocko @ 2022-06-01 15:24 UTC (permalink / raw) To: Roman Gushchin Cc: Michal Koutný, Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Vlastimil Babka, Muchun Song, cgroups On Wed 01-06-22 07:22:05, Roman Gushchin wrote: > On Wed, Jun 01, 2022 at 03:05:34PM +0200, Michal Hocko wrote: > > On Wed 01-06-22 11:32:26, Michal Hocko wrote: > > > On Wed 01-06-22 11:15:43, Michal Koutny wrote: > > > > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote: > > > > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > > > 512 > > > > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > > > -bash: echo: write error: Operation not permitted > > > > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > > > > > -bash: echo: write error: Operation not permitted > > > > > > > > > > I doubt this way can be accepted in upstream, however for OpenVz > > > > > something like this it is mandatory because it much better > > > > > than nothing. > > > > > > > > Is this customization of yours something like cgroup.max.descendants on > > > > the unified (v2) hierarchy? (Just curious.) > > > > > > > > (It can be made inaccessible from within the subtree either with cgroup > > > > ns or good old FS permissions.) > > > > > > So we already do have a limit to prevent somebody from running away with > > > the number of cgroups. Nice! > > Yes, we do! > > > > I was not aware of that and I guess this > > > looks like the right thing to do. So do we need more control and > > > accounting that this? > > > > I have checked the actual implementation and noticed that cgroups are > > uncharged when offlined (rmdir-ed) which means that an adversary could > > still trick the limit and runaway while still consuming resources. > > > > Roman, I guess the reason for this implementation was to avoid limit to > > trigger on setups with memcgs which can take quite some time to die? > > Would it make sense to make the implementation more strict to really act > > as gate against potential cgroups count runways? > > The reasoning was that in many cases a user can't do much about dying cgroups, > so it's not clear how they should/would handle getting -EAGAIN on creating a > new cgroup (retrying will not help, obviously). Live cgroups can be easily > deleted, dying cgroups - not always. > > I'm not sure about switching the semantics. I'd wait till Muchun's lru page > reparenting will be landed (could be within 1-2 releases, I guess) and then we > can check whether the whole problem is mostly gone. Honestly, I think we might > need to fix few another things, but it might be not that hard (in comparison > to what we already did). OK, thanks for the confirmation! Say we end up mitigating the too-easy-to-linger memcgs long standing issue. Do we still need an extended cgroup data structure accounting? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-06-01 3:43 ` Vasily Averin 2022-06-01 9:15 ` Michal Koutný @ 2022-06-01 9:26 ` Michal Hocko 1 sibling, 0 replies; 65+ messages in thread From: Michal Hocko @ 2022-06-01 9:26 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On Wed 01-06-22 06:43:27, Vasily Averin wrote: [...] > However, it isn't critical for OpenVz. Our kernel does not allow > to change of cgroup.subgroups_limit from inside containers. What is the semantic of this limit? > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit > 512 > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > -bash: echo: write error: Operation not permitted > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit > -bash: echo: write error: Operation not permitted > > I doubt this way can be accepted in upstream, however for OpenVz > something like this it is mandatory because it much better > than nothing. > > The number can be adjusted by host admin. The current default limit > looks too small for me, however it is not difficult to increase it > to a reasonable 10,000. > > My experiments show that ~10000 cgroups consumes 0.5 Gb memory on 4cpu VM. > On "big irons" it can easily grow up to several Gb. This is quite a lot > to ignore its accounting. Too many cgroups can certainly have a high memory footprint. I guess this is quite clear. The question is whether trying to limit them by the memory footprint is really the right way to go. I would be especially worried about those smaller machines because of a smaller footprint which would allow to deplete the id space faster. Maybe we need some sort of limit on the number of cgroups in a subtree so that any potential runaway can be prevented regardless of the cgroups memory footprint. One potentially big problem with that is that cgroups can live quite long after being offlined (e.g. memcg) so such a limit could easily trigger I can imagine. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v4 0/9] memcg: accounting for objects allocated by mkdir cgroup 2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin 2022-05-30 11:55 ` Michal Hocko @ 2022-06-13 5:34 ` Vasily Averin 2022-06-23 14:50 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin ` (3 more replies) 2022-06-13 5:34 ` [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup Vasily Averin ` (2 subsequent siblings) 4 siblings, 4 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-13 5:34 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups In some cases, creating a cgroup allocates a noticeable amount of memory. This operation can be executed from inside memory-limited container, but currently this memory is not accounted to memcg and can be misused. This allow container to exceed the assigned memory limit and avoid memcg OOM. Moreover, in case of global memory shortage on the host, the OOM-killer may not find a real memory eater and start killing random processes on the host. This is especially important for OpenVZ and LXC used on hosting, where containers are used by untrusted end users. Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 4cpu VM with Fedora and self-complied upstream kernel. The calculations are not precise, it depends on kernel config options, number of cpus, enabled controllers, ignores possible page allocations etc. However this is enough to clarify the general situation. All allocations are splitted into: - common part, always called for each cgroup type - per-cgroup allocations In each group we consider 2 corner cases: - usual allocations, important for 1-2 CPU nodes/Vms - percpu allocations, important for 'big irons' common part: ~11Kb + 318 bytes percpu memcg: ~17Kb + 4692 bytes percpu cpu: ~2.5Kb + 1036 bytes percpu cpuset: ~3Kb + 12 bytes percpu blkcg: ~3Kb + 12 bytes percpu pid: ~1.5Kb + 12 bytes percpu perf: ~320b + 60 bytes percpu ------------------------------------------- total: ~38Kb + 6142 bytes percpu currently accounted: 4668 bytes percpu - it's important to account usual allocations called in common part, because almost all of cgroup-specific allocations are small. One exception here is memory cgroup, it allocates a few huge objects that should be accounted. - Percpu allocation called in common part, in memcg and cpu cgroups should be accounted, rest ones are small an can be ignored. - KERNFS objects are allocated both in common part and in most of cgroups Details can be found here: https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ I checked other cgroups types was found that they all can be ignored. Additionally I found allocation of struct rt_rq called in cpu cgroup if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) percpu structure and should be accounted too. v4: 1) re-based to linux-next (next-20220610) now psi_group is not a part of struct cgroup and is allocated on demand 2) added received approval from Muchun Song 3) improved cover letter description according to akpm@ request v3: 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7) 2) fixed few typos 3) added received approvals v2: 1) re-split to simplify possible bisect, re-ordered 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu, allocated in common part 3) added accounting for percpu allocation of struct rt_rq (actual if CONFIG_RT_GROUP_SCHED is enabled) 4) improved patches descriptions Vasily Averin (9): memcg: enable accounting for struct cgroup memcg: enable accounting for kernfs nodes memcg: enable accounting for kernfs iattrs memcg: enable accounting for struct simple_xattr memcg: enable accounting for percpu allocation of struct psi_group_cpu memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu memcg: enable accounting for large allocations in mem_cgroup_css_alloc memcg: enable accounting for allocations in alloc_fair_sched_group memcg: enable accounting for perpu allocation of struct rt_rq fs/kernfs/mount.c | 6 ++++-- fs/xattr.c | 2 +- kernel/cgroup/cgroup.c | 2 +- kernel/cgroup/rstat.c | 3 ++- kernel/sched/fair.c | 4 ++-- kernel/sched/psi.c | 5 +++-- kernel/sched/rt.c | 2 +- mm/memcontrol.c | 4 ++-- 8 files changed, 16 insertions(+), 12 deletions(-) -- 2.36.1 ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin @ 2022-06-23 14:50 ` Vasily Averin 2022-06-23 15:03 ` Vasily Averin 2022-06-23 14:50 ` [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup Vasily Averin ` (2 subsequent siblings) 3 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-06-23 14:50 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups In some cases, creating a cgroup allocates a noticeable amount of memory. This operation can be executed from inside memory-limited container, but currently this memory is not accounted to memcg and can be misused. This allow container to exceed the assigned memory limit and avoid memcg OOM. Moreover, in case of global memory shortage on the host, the OOM-killer may not find a real memory eater and start killing random processes on the host. This is especially important for OpenVZ and LXC used on hosting, where containers are used by untrusted end users. Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 4cpu VM with Fedora and self-complied upstream kernel. The calculations are not precise, it depends on kernel config options, number of cpus, enabled controllers, ignores possible page allocations etc. However this is enough to clarify the general situation. All allocations are splitted into: - common part, always called for each cgroup type - per-cgroup allocations In each group we consider 2 corner cases: - usual allocations, important for 1-2 CPU nodes/Vms - percpu allocations, important for 'big irons' common part: ~11Kb + 318 bytes percpu memcg: ~17Kb + 4692 bytes percpu cpu: ~2.5Kb + 1036 bytes percpu cpuset: ~3Kb + 12 bytes percpu blkcg: ~3Kb + 12 bytes percpu pid: ~1.5Kb + 12 bytes percpu perf: ~320b + 60 bytes percpu ------------------------------------------- total: ~38Kb + 6142 bytes percpu currently accounted: 4668 bytes percpu - it's important to account usual allocations called in common part, because almost all of cgroup-specific allocations are small. One exception here is memory cgroup, it allocates a few huge objects that should be accounted. - Percpu allocation called in common part, in memcg and cpu cgroups should be accounted, rest ones are small an can be ignored. - KERNFS objects are allocated both in common part and in most of cgroups Details can be found here: https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ I checked other cgroups types was found that they all can be ignored. Additionally I found allocation of struct rt_rq called in cpu cgroup if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) percpu structure and should be accounted too. v5: 1) re-based to linux-mm (mm-everything-2022-06-22-20-36) v4: 1) re-based to linux-next (next-20220610) now psi_group is not a part of struct cgroup and is allocated on demand 2) added received approval from Muchun Song 3) improved cover letter description according to akpm@ request v3: 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7) 2) fixed few typos 3) added received approvals v2: 1) re-split to simplify possible bisect, re-ordered 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu, allocated in common part 3) added accounting for percpu allocation of struct rt_rq (actual if CONFIG_RT_GROUP_SCHED is enabled) 4) improved patches descriptions Vasily Averin (9): memcg: enable accounting for struct cgroup memcg: enable accounting for kernfs nodes memcg: enable accounting for kernfs iattrs memcg: enable accounting for struct simple_xattr memcg: enable accounting for percpu allocation of struct psi_group_cpu memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu memcg: enable accounting for large allocations in mem_cgroup_css_alloc memcg: enable accounting for allocations in alloc_fair_sched_group memcg: enable accounting for perpu allocation of struct rt_rq fs/kernfs/mount.c | 6 ++++-- fs/xattr.c | 2 +- kernel/cgroup/cgroup.c | 2 +- kernel/cgroup/rstat.c | 3 ++- kernel/sched/fair.c | 4 ++-- kernel/sched/psi.c | 2 +- kernel/sched/rt.c | 2 +- mm/memcontrol.c | 4 ++-- 8 files changed, 14 insertions(+), 11 deletions(-) -- 2.36.1 ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-23 14:50 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin @ 2022-06-23 15:03 ` Vasily Averin 2022-06-23 16:07 ` Michal Hocko 0 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-06-23 15:03 UTC (permalink / raw) To: Michal Hocko Cc: kernel, Andrew Morton, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups Dear Michal, do you still have any concerns about this patch set? Thank you, Vasily Averin On 6/23/22 17:50, Vasily Averin wrote: > In some cases, creating a cgroup allocates a noticeable amount of memory. > This operation can be executed from inside memory-limited container, > but currently this memory is not accounted to memcg and can be misused. > This allow container to exceed the assigned memory limit and avoid > memcg OOM. Moreover, in case of global memory shortage on the host, > the OOM-killer may not find a real memory eater and start killing > random processes on the host. > > This is especially important for OpenVZ and LXC used on hosting, > where containers are used by untrusted end users. > > Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on > 4cpu VM with Fedora and self-complied upstream kernel. The calculations > are not precise, it depends on kernel config options, number of cpus, > enabled controllers, ignores possible page allocations etc. > However this is enough to clarify the general situation. > All allocations are splitted into: > - common part, always called for each cgroup type > - per-cgroup allocations > > In each group we consider 2 corner cases: > - usual allocations, important for 1-2 CPU nodes/Vms > - percpu allocations, important for 'big irons' > > common part: ~11Kb + 318 bytes percpu > memcg: ~17Kb + 4692 bytes percpu > cpu: ~2.5Kb + 1036 bytes percpu > cpuset: ~3Kb + 12 bytes percpu > blkcg: ~3Kb + 12 bytes percpu > pid: ~1.5Kb + 12 bytes percpu > perf: ~320b + 60 bytes percpu > ------------------------------------------- > total: ~38Kb + 6142 bytes percpu > currently accounted: 4668 bytes percpu > > - it's important to account usual allocations called > in common part, because almost all of cgroup-specific allocations > are small. One exception here is memory cgroup, it allocates a few > huge objects that should be accounted. > - Percpu allocation called in common part, in memcg and cpu cgroups > should be accounted, rest ones are small an can be ignored. > - KERNFS objects are allocated both in common part and in most of > cgroups > > Details can be found here: > https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/ > > I checked other cgroups types was found that they all can be ignored. > Additionally I found allocation of struct rt_rq called in cpu cgroup > if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes) > percpu structure and should be accounted too. > > v5: > 1) re-based to linux-mm (mm-everything-2022-06-22-20-36) > > v4: > 1) re-based to linux-next (next-20220610) > now psi_group is not a part of struct cgroup and is allocated on demand > 2) added received approval from Muchun Song > 3) improved cover letter description according to akpm@ request > > v3: > 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7) > 2) fixed few typos > 3) added received approvals > > v2: > 1) re-split to simplify possible bisect, re-ordered > 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu, > allocated in common part > 3) added accounting for percpu allocation of struct rt_rq > (actual if CONFIG_RT_GROUP_SCHED is enabled) > 4) improved patches descriptions > > Vasily Averin (9): > memcg: enable accounting for struct cgroup > memcg: enable accounting for kernfs nodes > memcg: enable accounting for kernfs iattrs > memcg: enable accounting for struct simple_xattr > memcg: enable accounting for percpu allocation of struct psi_group_cpu > memcg: enable accounting for percpu allocation of struct > cgroup_rstat_cpu > memcg: enable accounting for large allocations in mem_cgroup_css_alloc > memcg: enable accounting for allocations in alloc_fair_sched_group > memcg: enable accounting for perpu allocation of struct rt_rq > > fs/kernfs/mount.c | 6 ++++-- > fs/xattr.c | 2 +- > kernel/cgroup/cgroup.c | 2 +- > kernel/cgroup/rstat.c | 3 ++- > kernel/sched/fair.c | 4 ++-- > kernel/sched/psi.c | 2 +- > kernel/sched/rt.c | 2 +- > mm/memcontrol.c | 4 ++-- > 8 files changed, 14 insertions(+), 11 deletions(-) > ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-23 15:03 ` Vasily Averin @ 2022-06-23 16:07 ` Michal Hocko 2022-06-23 16:55 ` Shakeel Butt 0 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-06-23 16:07 UTC (permalink / raw) To: Vasily Averin Cc: kernel, Andrew Morton, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, cgroups On Thu 23-06-22 18:03:31, Vasily Averin wrote: > Dear Michal, > do you still have any concerns about this patch set? Yes, I do not think we have concluded this to be really necessary. IIRC Roman would like to see lingering cgroups addressed in not-so-distant future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already have a limit for the number of cgroups in the tree. So why should we chase after allocations that correspond the cgroups and somehow try to cap their number via the memory consumption. This looks like something that will get out of sync eventually and it also doesn't seem like the best control to me (comparing to an explicit limit to prevent runaways). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-23 16:07 ` Michal Hocko @ 2022-06-23 16:55 ` Shakeel Butt 2022-06-24 10:40 ` Vasily Averin 2022-06-24 13:59 ` Michal Hocko 0 siblings, 2 replies; 65+ messages in thread From: Shakeel Butt @ 2022-06-23 16:55 UTC (permalink / raw) To: Michal Hocko Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 18:03:31, Vasily Averin wrote: > > Dear Michal, > > do you still have any concerns about this patch set? > > Yes, I do not think we have concluded this to be really necessary. IIRC > Roman would like to see lingering cgroups addressed in not-so-distant > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already > have a limit for the number of cgroups in the tree. So why should we > chase after allocations that correspond the cgroups and somehow try to > cap their number via the memory consumption. This looks like something > that will get out of sync eventually and it also doesn't seem like the > best control to me (comparing to an explicit limit to prevent runaways). > -- Let me give a counter argument to that. On a system running multiple workloads, how can the admin come up with a sensible limit for the number of cgroups? There will definitely be jobs that require much more number of sub-cgroups. Asking the admins to dynamically tune another tuneable is just asking for more complications. At the end all the users would just set it to max. I would recommend to see the commit ac7b79fd190b ("inotify, memcg: account inotify instances to kmemcg") where there is already a sysctl (inotify/max_user_instances) to limit the number of instances but there was no sensible way to set that limit on a multi-tenant system. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-23 16:55 ` Shakeel Butt @ 2022-06-24 10:40 ` Vasily Averin 2022-06-24 12:26 ` Michal Koutný 2022-06-24 13:59 ` Michal Hocko 1 sibling, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-06-24 10:40 UTC (permalink / raw) To: Shakeel Butt, Michal Hocko Cc: kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On 6/23/22 19:55, Shakeel Butt wrote: > On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote: >> >> On Thu 23-06-22 18:03:31, Vasily Averin wrote: >>> Dear Michal, >>> do you still have any concerns about this patch set? >> >> Yes, I do not think we have concluded this to be really necessary. IIRC >> Roman would like to see lingering cgroups addressed in not-so-distant >> future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already >> have a limit for the number of cgroups in the tree. So why should we >> chase after allocations that correspond the cgroups and somehow try to >> cap their number via the memory consumption. This looks like something >> that will get out of sync eventually and it also doesn't seem like the >> best control to me (comparing to an explicit limit to prevent runaways). >> -- > > Let me give a counter argument to that. On a system running multiple > workloads, how can the admin come up with a sensible limit for the > number of cgroups? There will definitely be jobs that require much > more number of sub-cgroups. Asking the admins to dynamically tune > another tuneable is just asking for more complications. At the end all > the users would just set it to max. > > I would recommend to see the commit ac7b79fd190b ("inotify, memcg: > account inotify instances to kmemcg") where there is already a sysctl > (inotify/max_user_instances) to limit the number of instances but > there was no sensible way to set that limit on a multi-tenant system. I've found that MEM_CGROUP_ID_MAX limits memory cgroups only. Other types of cgroups do not have similar restrictions. Yes, we can set some per-container limit for all cgroups, but to me it looks like workaround while proper memory accounting looks like real solution. Btw could you please explain why memory cgroups have MEM_CGROUP_ID_MAX limit Why it is required at all and why it was set to USHRT_MAX? I believe that in the future it may be really reachable: Let's set up per-container cgroup limit to some small numbers, for example to 512 as OpenVz doing right now. On real node with 300 containers we can easily get 100*300 = 30000 cgroups, and consume ~3Gb memory, without any misuse. I think it is too much to ignore its accounting. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-24 10:40 ` Vasily Averin @ 2022-06-24 12:26 ` Michal Koutný 0 siblings, 0 replies; 65+ messages in thread From: Michal Koutný @ 2022-06-24 12:26 UTC (permalink / raw) To: Vasily Averin Cc: Shakeel Butt, Michal Hocko, kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Vlastimil Babka, Muchun Song, Cgroups [-- Attachment #1: Type: text/plain, Size: 478 bytes --] On Fri, Jun 24, 2022 at 01:40:14PM +0300, Vasily Averin <vvs@openvz.org> wrote: > Btw could you please explain why memory cgroups have MEM_CGROUP_ID_MAX limit > Why it is required at all and why it was set to USHRT_MAX? I believe that > in the future it may be really reachable: IIRC, one reason is 2B * nr_swap_pages of memory overhead (in swap_cgroup_swapon()) that's ~0.05% of swap space occupied additionally in RAM (fortunately swap needn't cover whole RAM). HTH, Michal [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-23 16:55 ` Shakeel Butt 2022-06-24 10:40 ` Vasily Averin @ 2022-06-24 13:59 ` Michal Hocko 2022-06-25 9:43 ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin ` (2 more replies) 1 sibling, 3 replies; 65+ messages in thread From: Michal Hocko @ 2022-06-24 13:59 UTC (permalink / raw) To: Shakeel Butt Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On Thu 23-06-22 09:55:33, Shakeel Butt wrote: > On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 18:03:31, Vasily Averin wrote: > > > Dear Michal, > > > do you still have any concerns about this patch set? > > > > Yes, I do not think we have concluded this to be really necessary. IIRC > > Roman would like to see lingering cgroups addressed in not-so-distant > > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already > > have a limit for the number of cgroups in the tree. So why should we > > chase after allocations that correspond the cgroups and somehow try to > > cap their number via the memory consumption. This looks like something > > that will get out of sync eventually and it also doesn't seem like the > > best control to me (comparing to an explicit limit to prevent runaways). > > -- > > Let me give a counter argument to that. On a system running multiple > workloads, how can the admin come up with a sensible limit for the > number of cgroups? How is that any easier through memory consumption? Something that might change between kernel versions? Is it even possible to prevent from id depletion by the memory consumption? Any medium sized memcg can easily consume all the ids AFAICS. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH RFC] memcg: avoid idr ids space depletion 2022-06-24 13:59 ` Michal Hocko @ 2022-06-25 9:43 ` Vasily Averin [not found] ` <c53e1df0-5174-66de-23cc-18797f0b512d@openvz.org> 2022-06-27 16:37 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt 2 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-25 9:43 UTC (permalink / raw) To: Shakeel Butt, Michal Koutný, Michal Hocko Cc: kernel, linux-kernel, Andrew Morton, linux-mm, Roman Gushchin, Vlastimil Babka, Muchun Song, cgroups I tried to increase MEM_CGROUP_ID_MAX to INT_MAX and found no significant difficulties. What do you think about following patch? I did not tested it, just checked its compilation. I hope it allows: - to avoid memcg id space depletion on normal nodes - to set up per-container cgroup limit to USHRT_MAX to prevent possible misuse and in general use memcg accounting for allocated resources. Thank you, Vasily Averin --- Michal Hocko pointed that memory controller depends on idr ids which have a space that is rather limited #define MEM_CGROUP_ID_MAX USHRT_MAX The limit can be reached on nodes hosted several hundred OS containers with new distributions running hundreds of services in their own memory cgroups. This patch increases the space up to INT_MAX. --- include/linux/memcontrol.h | 15 +++++++++------ include/linux/swap_cgroup.h | 14 +++++--------- mm/memcontrol.c | 6 +++--- mm/swap_cgroup.c | 10 ++++------ 4 files changed, 21 insertions(+), 24 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 744cde2b2368..e3468550ba20 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -59,10 +59,13 @@ struct mem_cgroup_reclaim_cookie { }; #ifdef CONFIG_MEMCG - +#ifdef CONFIG_64BIT +#define MEM_CGROUP_ID_SHIFT 31 +#define MEM_CGROUP_ID_MAX INT_MAX - 1 +#else #define MEM_CGROUP_ID_SHIFT 16 #define MEM_CGROUP_ID_MAX USHRT_MAX - +#endif struct mem_cgroup_id { int id; refcount_t ref; @@ -852,14 +855,14 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); int mem_cgroup_scan_tasks(struct mem_cgroup *, int (*)(struct task_struct *, void *), void *); -static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) +static inline int mem_cgroup_id(struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return 0; return memcg->id.id; } -struct mem_cgroup *mem_cgroup_from_id(unsigned short id); +struct mem_cgroup *mem_cgroup_from_id(int id); #ifdef CONFIG_SHRINKER_DEBUG static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) @@ -1374,12 +1377,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, return 0; } -static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) +static inline int mem_cgroup_id(struct mem_cgroup *memcg) { return 0; } -static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) +static inline struct mem_cgroup *mem_cgroup_from_id(int id) { WARN_ON_ONCE(id); /* XXX: This should always return root_mem_cgroup */ diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index a12dd1c3966c..711dd18380ed 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -6,25 +6,21 @@ #ifdef CONFIG_MEMCG_SWAP -extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, - unsigned short old, unsigned short new); -extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, - unsigned int nr_ents); -extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); +extern int swap_cgroup_cmpxchg(swp_entry_t ent, int old, int new); +extern int swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents); +extern int lookup_swap_cgroup_id(swp_entry_t ent); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); #else static inline -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, - unsigned int nr_ents) +unsigned short swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents) { return 0; } -static inline -unsigned short lookup_swap_cgroup_id(swp_entry_t ent) +static inline int lookup_swap_cgroup_id(swp_entry_t ent) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 275d0c847f05..d4c606a06bcd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5224,7 +5224,7 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg) * * Caller must hold rcu_read_lock(). */ -struct mem_cgroup *mem_cgroup_from_id(unsigned short id) +struct mem_cgroup *mem_cgroup_from_id(int id) { WARN_ON_ONCE(!rcu_read_lock_held()); return idr_find(&mem_cgroup_idr, id); @@ -7021,7 +7021,7 @@ int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, { struct folio *folio = page_folio(page); struct mem_cgroup *memcg; - unsigned short id; + int id; int ret; if (mem_cgroup_disabled()) @@ -7541,7 +7541,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) { struct mem_cgroup *memcg; - unsigned short id; + int id; id = swap_cgroup_record(entry, 0, nr_pages); rcu_read_lock(); diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c index 5a9442979a18..76fa5c42e03f 100644 --- a/mm/swap_cgroup.c +++ b/mm/swap_cgroup.c @@ -15,7 +15,7 @@ struct swap_cgroup_ctrl { static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; struct swap_cgroup { - unsigned short id; + int id; }; #define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup)) @@ -94,8 +94,7 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, * Returns old id at success, 0 at failure. * (There is no mem_cgroup using 0 as its id) */ -unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, - unsigned short old, unsigned short new) +int swap_cgroup_cmpxchg(swp_entry_t ent, int old, int new) { struct swap_cgroup_ctrl *ctrl; struct swap_cgroup *sc; @@ -123,8 +122,7 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, * Returns old value at success, 0 at failure. * (Of course, old value can be 0.) */ -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, - unsigned int nr_ents) +int swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents) { struct swap_cgroup_ctrl *ctrl; struct swap_cgroup *sc; @@ -159,7 +157,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, * * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID) */ -unsigned short lookup_swap_cgroup_id(swp_entry_t ent) +int lookup_swap_cgroup_id(swp_entry_t ent) { return lookup_swap_cgroup(ent, NULL)->id; } -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
[parent not found: <c53e1df0-5174-66de-23cc-18797f0b512d@openvz.org>]
* Re: [PATCH RFC] memcg: notify about global mem_cgroup_id space depletion [not found] ` <c53e1df0-5174-66de-23cc-18797f0b512d@openvz.org> @ 2022-06-26 1:56 ` Roman Gushchin [not found] ` <97bed1fd-f230-c2ea-1cb6-8230825a9a64@openvz.org> 0 siblings, 1 reply; 65+ messages in thread From: Roman Gushchin @ 2022-06-26 1:56 UTC (permalink / raw) To: Vasily Averin Cc: Shakeel Butt, Michal Koutný, Michal Hocko, kernel, linux-kernel, Andrew Morton, linux-mm, Vlastimil Babka, Muchun Song, cgroups On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote: > Currently host owner is not informed about the exhaustion of the > global mem_cgroup_id space. When this happens, systemd cannot > start a new service, but nothing points to the real cause of > this failure. > > Signed-off-by: Vasily Averin <vvs@openvz.org> > --- > mm/memcontrol.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d4c606a06bcd..5229321636f2 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL); > if (memcg->id.id < 0) { > error = memcg->id.id; > + pr_notice_ratelimited("mem_cgroup_id space is exhausted\n"); > goto fail; > } Hm, in this case it should return -ENOSPC and it's a very unique return code. If it's not returned from the mkdir() call, we should fix this. Otherwise it's up to systemd to handle it properly. I'm not opposing for adding a warning, but parsing dmesg is not how the error handling should be done. Thanks! ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <97bed1fd-f230-c2ea-1cb6-8230825a9a64@openvz.org>]
* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion [not found] ` <97bed1fd-f230-c2ea-1cb6-8230825a9a64@openvz.org> @ 2022-06-27 3:23 ` Muchun Song [not found] ` <f3e4059c-69ea-eccd-a22f-9f6c6780f33a@openvz.org> 0 siblings, 1 reply; 65+ messages in thread From: Muchun Song @ 2022-06-27 3:23 UTC (permalink / raw) To: Vasily Averin Cc: Shakeel Butt, Roman Gushchin, Michal Koutný, Michal Hocko, kernel, LKML, Andrew Morton, Linux Memory Management List, Vlastimil Babka, Cgroups On Mon, Jun 27, 2022 at 10:11 AM Vasily Averin <vvs@openvz.org> wrote: > > Currently, the host owner is not informed about the exhaustion of the > global mem_cgroup_id space. When this happens, systemd cannot start a > new service and receives a unique -ENOSPC error code. > However, this can happen inside this container, persist in the log file > of the local container, and may not be noticed by the host owner if he > did not try to start any new services. > > Signed-off-by: Vasily Averin <vvs@openvz.org> > --- > v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC If the caller can know -ENOSPC is returned by mkdir(), then I think the user (perhaps systemd) is the best place to throw out the error message instead of in the kernel log. Right? Thanks. > if no free IDs could be found, but can also return -ENOMEM. > Therefore error code check was added before message output and > patch descriprion was adopted. > --- > mm/memcontrol.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d4c606a06bcd..ffc6b5d6b95e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL); > if (memcg->id.id < 0) { > error = memcg->id.id; > + if (error == -ENOSPC) > + pr_notice_ratelimited("mem_cgroup_id space is exhausted\n"); > goto fail; > } > > -- > 2.36.1 > ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <f3e4059c-69ea-eccd-a22f-9f6c6780f33a@openvz.org>]
* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion [not found] ` <f3e4059c-69ea-eccd-a22f-9f6c6780f33a@openvz.org> @ 2022-06-28 1:11 ` Roman Gushchin 2022-06-28 9:08 ` Michal Koutný 0 siblings, 1 reply; 65+ messages in thread From: Roman Gushchin @ 2022-06-28 1:11 UTC (permalink / raw) To: Vasily Averin Cc: Muchun Song, Shakeel Butt, Michal Koutný, Michal Hocko, kernel, LKML, Andrew Morton, Linux Memory Management List, Vlastimil Babka, Cgroups On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote: > On 6/27/22 06:23, Muchun Song wrote: > > If the caller can know -ENOSPC is returned by mkdir(), then I > > think the user (perhaps systemd) is the best place to throw out the > > error message instead of in the kernel log. Right? > > Such an incident may occur inside the container. > OpenVZ nodes can host 300-400 containers, and the host admin cannot > monitor guest logs. the dmesg message is necessary to inform the host > owner that the global limit has been reached, otherwise he can > continue to believe that there are no problems on the node. Why this is happening? It's hard to believe someone really needs that many cgroups. Is this when somebody fails to delete old cgroups? I wanted to say that it's better to introduce a memcg event, but then I realized it's probably not worth the wasted space. Is this a common scenario? I think a better approach will be to add a cgroup event (displayed via cgroup.events) about reaching the maximum limit of cgroups. E.g. cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants to some value below memcg_id space size. It's more work, but IMO it's a better way to communicate this event. As a bonus, you can easily get an idea which cgroup depletes the limit. Thanks! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion 2022-06-28 1:11 ` Roman Gushchin @ 2022-06-28 9:08 ` Michal Koutný 0 siblings, 0 replies; 65+ messages in thread From: Michal Koutný @ 2022-06-28 9:08 UTC (permalink / raw) To: Roman Gushchin Cc: Vasily Averin, Muchun Song, Shakeel Butt, Michal Hocko, kernel, LKML, Andrew Morton, Linux Memory Management List, Vlastimil Babka, Cgroups On Mon, Jun 27, 2022 at 06:11:27PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote: > I think a better approach will be to add a cgroup event (displayed via > cgroup.events) about reaching the maximum limit of cgroups. E.g. > cgroups.events::max_nr_reached. This sounds like a good generalization. > Then you can set cgroup.max.descendants to some value below memcg_id > space size. It's more work, but IMO it's a better way to communicate > this event. As a bonus, you can easily get an idea which cgroup > depletes the limit. Just mind there's a difference between events: what cgroup's limit was hit and what cgroup was affected by the limit [1] (the former is more useful for the calibration if I understand the situation). Michal [1] https://lore.kernel.org/all/20200205134426.10570-2-mkoutny@suse.com/ ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-24 13:59 ` Michal Hocko 2022-06-25 9:43 ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin [not found] ` <c53e1df0-5174-66de-23cc-18797f0b512d@openvz.org> @ 2022-06-27 16:37 ` Shakeel Butt 2022-07-01 11:03 ` Michal Hocko 2 siblings, 1 reply; 65+ messages in thread From: Shakeel Butt @ 2022-06-27 16:37 UTC (permalink / raw) To: Michal Hocko Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 09:55:33, Shakeel Butt wrote: > > On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 18:03:31, Vasily Averin wrote: > > > > Dear Michal, > > > > do you still have any concerns about this patch set? > > > > > > Yes, I do not think we have concluded this to be really necessary. IIRC > > > Roman would like to see lingering cgroups addressed in not-so-distant > > > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already > > > have a limit for the number of cgroups in the tree. So why should we > > > chase after allocations that correspond the cgroups and somehow try to > > > cap their number via the memory consumption. This looks like something > > > that will get out of sync eventually and it also doesn't seem like the > > > best control to me (comparing to an explicit limit to prevent runaways). > > > -- > > > > Let me give a counter argument to that. On a system running multiple > > workloads, how can the admin come up with a sensible limit for the > > number of cgroups? > > How is that any easier through memory consumption? Something that might > change between kernel versions? In v2, we do provide a way for admins to right size the containers without killing them. Actually we are trying to use memory.high for right sizing the jobs. (It is not the best but workable and there are opportunities to improve it). Similar mechanisms for other types of limits are lacking. Usually the application would be getting the error for which it can not do anything most of the time. > Is it even possible to prevent from id > depletion by the memory consumption? Any medium sized memcg can easily > consume all the ids AFAICS. Though the patch series is pitched as protection against OOMs, I think it is beneficial irrespective. Protection against an adversarial actor should not be the aim here. IMO this patch series improves the memory association to the actual user which is better than unattributed memory treated as system overhead. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-06-27 16:37 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt @ 2022-07-01 11:03 ` Michal Hocko 2022-07-10 18:53 ` Vasily Averin 0 siblings, 1 reply; 65+ messages in thread From: Michal Hocko @ 2022-07-01 11:03 UTC (permalink / raw) To: Shakeel Butt Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On Mon 27-06-22 09:37:14, Shakeel Butt wrote: > On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > Is it even possible to prevent from id > > depletion by the memory consumption? Any medium sized memcg can easily > > consume all the ids AFAICS. > > Though the patch series is pitched as protection against OOMs, I think > it is beneficial irrespective. Protection against an adversarial actor > should not be the aim here. IMO this patch series improves the memory > association to the actual user which is better than unattributed > memory treated as system overhead. Considering the amount of memory and "normal" cgroup usage (I guess we can agree that delegated subtrees do not count their cgroups in thousands) is this really something that is worth bothering with? I mean, these patches are really small and not really disruptive so I do not really see any problem with them. Except that they clearly add a maintenance overhead. Not directly with the memory they track but any future cgroup/memcg metadata related objects would need to be tracked as well and I am worried this will get quickly out of sync. So we will have a half assed solution in place that doesn't really help any containment nor it provides a good and robust consumption tracking. All that being said I find these changes rather without a great value or use. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-07-01 11:03 ` Michal Hocko @ 2022-07-10 18:53 ` Vasily Averin 2022-07-11 16:24 ` Michal Hocko 0 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-07-10 18:53 UTC (permalink / raw) To: Michal Hocko, Shakeel Butt Cc: kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On 7/1/22 14:03, Michal Hocko wrote: > On Mon 27-06-22 09:37:14, Shakeel Butt wrote: >> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote: > [...] >>> Is it even possible to prevent from id >>> depletion by the memory consumption? Any medium sized memcg can easily >>> consume all the ids AFAICS. >> >> Though the patch series is pitched as protection against OOMs, I think >> it is beneficial irrespective. Protection against an adversarial actor >> should not be the aim here. IMO this patch series improves the memory >> association to the actual user which is better than unattributed >> memory treated as system overhead. > > Considering the amount of memory and "normal" cgroup usage (I guess we > can agree that delegated subtrees do not count their cgroups in > thousands) is this really something that is worth bothering with? > > I mean, these patches are really small and not really disruptive so I do > not really see any problem with them. Except that they clearly add a > maintenance overhead. Not directly with the memory they track but any > future cgroup/memcg metadata related objects would need to be tracked as > well and I am worried this will get quickly out of sync. So we will have > a half assed solution in place that doesn't really help any containment > nor it provides a good and robust consumption tracking. > > All that being said I find these changes rather without a great value or > use. Dear Michal, I sill have 2 questions: 1) if you do not want to account any memory allocated for cgroup objects, should you perhaps revert commit 3e38e0aaca9e "mm: memcg: charge memcg percpu memory to the parent cgroup". Is it an exception perhaps? (in fact I hope you will not revert this patch, I just would like to know your explanations about this accounting) 2) my patch set includes kernfs accounting required for proper netdevices accounting Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) 2/9] memcg: enable accounting for kernfs nodes 3/9] memcg: enable accounting for kernfs iattrs 4/9] memcg: enable accounting for struct simple_xattr What do you think about them? Should I resend them as a new separate patch set? Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup 2022-07-10 18:53 ` Vasily Averin @ 2022-07-11 16:24 ` Michal Hocko 0 siblings, 0 replies; 65+ messages in thread From: Michal Hocko @ 2022-07-11 16:24 UTC (permalink / raw) To: Vasily Averin Cc: Shakeel Butt, kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Muchun Song, Cgroups On Sun 10-07-22 21:53:34, Vasily Averin wrote: > On 7/1/22 14:03, Michal Hocko wrote: > > On Mon 27-06-22 09:37:14, Shakeel Butt wrote: > >> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote: > > [...] > >>> Is it even possible to prevent from id > >>> depletion by the memory consumption? Any medium sized memcg can easily > >>> consume all the ids AFAICS. > >> > >> Though the patch series is pitched as protection against OOMs, I think > >> it is beneficial irrespective. Protection against an adversarial actor > >> should not be the aim here. IMO this patch series improves the memory > >> association to the actual user which is better than unattributed > >> memory treated as system overhead. > > > > Considering the amount of memory and "normal" cgroup usage (I guess we > > can agree that delegated subtrees do not count their cgroups in > > thousands) is this really something that is worth bothering with? > > > > I mean, these patches are really small and not really disruptive so I do > > not really see any problem with them. Except that they clearly add a > > maintenance overhead. Not directly with the memory they track but any > > future cgroup/memcg metadata related objects would need to be tracked as > > well and I am worried this will get quickly out of sync. So we will have > > a half assed solution in place that doesn't really help any containment > > nor it provides a good and robust consumption tracking. > > > > All that being said I find these changes rather without a great value or > > use. > > Dear Michal, > I sill have 2 questions: > 1) if you do not want to account any memory allocated for cgroup objects, > should you perhaps revert commit 3e38e0aaca9e "mm: memcg: charge memcg percpu > memory to the parent cgroup". Is it an exception perhaps? > (in fact I hope you will not revert this patch, I just would like to know > your explanations about this accounting) Well, I have to say I was not a great fan of this patch when it was proposed but I didn't really have strong arguments against it to nack it. It was simple enough, rather self contained in few places. Just to give you an insight into my thinking here. Your patchseries is also not something I would nack (nor I have done that). I am not super fan of it either. I voiced against it because it just hit my internal thrashold of how many different places are patched without any systemic approach. If we consider that it doesn't really help with the initial intention to protect against adversaries then what is the point of all the churn? Others might think differently and if you can get acks by other maintainers then I won't stand in the way. I have voiced my concerns and I hope my thinking is clear now. > 2) my patch set includes kernfs accounting required for proper netdevices accounting > > Allocs Alloc Allocation > number size > -------------------------------------------- > 1 + 128 (__kernfs_new_node+0x4d) kernfs node > 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs > 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb > 1 32 (simple_xattr_set+0x59) > 1 8 (__kernfs_new_node+0x30) > > 2/9] memcg: enable accounting for kernfs nodes > 3/9] memcg: enable accounting for kernfs iattrs > 4/9] memcg: enable accounting for struct simple_xattr > > What do you think about them? Should I resend them as a new separate patch set? kernfs is not really my area so I cannot really comment on those. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup 2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin 2022-06-23 14:50 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin @ 2022-06-23 14:50 ` Vasily Averin 2022-06-23 14:50 ` [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes Vasily Averin 2022-06-23 14:51 ` [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin 3 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-23 14:50 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups Creating each new cgroup allocates 4Kb for struct cgroup. This is the largest memory allocation in this scenario and is epecially important for small VMs with 1-2 CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Accounting of this memory helps to avoid misuse inside memcg-limited containers. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..1be0f81fe8e1 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, /* allocate the cgroup and its ID, 0 is reserved for the root */ cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), - GFP_KERNEL); + GFP_KERNEL_ACCOUNT); if (!cgrp) return ERR_PTR(-ENOMEM); -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes 2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin 2022-06-23 14:50 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin 2022-06-23 14:50 ` [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup Vasily Averin @ 2022-06-23 14:50 ` Vasily Averin 2022-06-23 14:51 ` [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin 3 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-23 14:50 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs nodes slab cache. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index cfa79715fc1a..3ac4191b1c40 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -391,7 +391,8 @@ void __init kernfs_init(void) { kernfs_node_cache = kmem_cache_create("kernfs_node_cache", sizeof(struct kernfs_node), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs 2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin ` (2 preceding siblings ...) 2022-06-23 14:50 ` [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes Vasily Averin @ 2022-06-23 14:51 ` Vasily Averin 3 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs_iattrs_cache slab cache Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index 3ac4191b1c40..40e896c7c86b 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -397,5 +397,6 @@ void __init kernfs_init(void) /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", sizeof(struct kernfs_iattrs), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); } -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup 2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin 2022-05-30 11:55 ` Michal Hocko 2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin @ 2022-06-13 5:34 ` Vasily Averin 2022-06-13 5:34 ` [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes Vasily Averin 2022-06-13 5:34 ` [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin 4 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-13 5:34 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups Creating each new cgroup allocates 4Kb for struct cgroup. This is the largest memory allocation in this scenario and is epecially important for small VMs with 1-2 CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Accounting of this memory helps to avoid misuse inside memcg-limited containers. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 90a654cb8a1e..9adf4ad4b623 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, /* allocate the cgroup and its ID, 0 is reserved for the root */ cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), - GFP_KERNEL); + GFP_KERNEL_ACCOUNT); if (!cgrp) return ERR_PTR(-ENOMEM); -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes 2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin ` (2 preceding siblings ...) 2022-06-13 5:34 ` [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup Vasily Averin @ 2022-06-13 5:34 ` Vasily Averin 2022-06-13 5:34 ` [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin 4 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-13 5:34 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs nodes slab cache. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index cfa79715fc1a..3ac4191b1c40 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -391,7 +391,8 @@ void __init kernfs_init(void) { kernfs_node_cache = kmem_cache_create("kernfs_node_cache", sizeof(struct kernfs_node), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs 2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin ` (3 preceding siblings ...) 2022-06-13 5:34 ` [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes Vasily Averin @ 2022-06-13 5:34 ` Vasily Averin 4 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-06-13 5:34 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs_iattrs_cache slab cache Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index 3ac4191b1c40..40e896c7c86b 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -397,5 +397,6 @@ void __init kernfs_init(void) /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", sizeof(struct kernfs_iattrs), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); } -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
[parent not found: <cover.1653899364.git.vvs@openvz.org>]
* [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup [not found] ` <cover.1653899364.git.vvs@openvz.org> @ 2022-05-30 11:25 ` Vasily Averin 2022-05-30 11:26 ` [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes Vasily Averin ` (5 subsequent siblings) 6 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:25 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups Creating each new cgroup allocates 4Kb for struct cgroup. This is the largest memory allocation in this scenario and is epecially important for small VMs with 1-2 CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Accounting of this memory helps to avoid misuse inside memcg-limited containers. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..1be0f81fe8e1 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, /* allocate the cgroup and its ID, 0 is reserved for the root */ cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), - GFP_KERNEL); + GFP_KERNEL_ACCOUNT); if (!cgrp) return ERR_PTR(-ENOMEM); -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes [not found] ` <cover.1653899364.git.vvs@openvz.org> 2022-05-30 11:25 ` [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup Vasily Averin @ 2022-05-30 11:26 ` Vasily Averin 2022-05-30 11:26 ` [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin ` (4 subsequent siblings) 6 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs nodes slab cache. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index cfa79715fc1a..3ac4191b1c40 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -391,7 +391,8 @@ void __init kernfs_init(void) { kernfs_node_cache = kmem_cache_create("kernfs_node_cache", sizeof(struct kernfs_node), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs [not found] ` <cover.1653899364.git.vvs@openvz.org> 2022-05-30 11:25 ` [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup Vasily Averin 2022-05-30 11:26 ` [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes Vasily Averin @ 2022-05-30 11:26 ` Vasily Averin 2022-05-30 11:26 ` [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin ` (3 subsequent siblings) 6 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs_iattrs_cache slab cache Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index 3ac4191b1c40..40e896c7c86b 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -397,5 +397,6 @@ void __init kernfs_init(void) /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", sizeof(struct kernfs_iattrs), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); } -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr [not found] ` <cover.1653899364.git.vvs@openvz.org> ` (2 preceding siblings ...) 2022-05-30 11:26 ` [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin @ 2022-05-30 11:26 ` Vasily Averin 2022-05-30 11:26 ` [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin ` (2 subsequent siblings) 6 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for struct simple_xattr. Size of this structure depends on userspace and can grow over 4Kb. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> --- fs/xattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xattr.c b/fs/xattr.c index e8dd03e4561e..98dcf6600bd9 100644 --- a/fs/xattr.c +++ b/fs/xattr.c @@ -1001,7 +1001,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size) if (len < sizeof(*new_xattr)) return NULL; - new_xattr = kvmalloc(len, GFP_KERNEL); + new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT); if (!new_xattr) return NULL; -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu [not found] ` <cover.1653899364.git.vvs@openvz.org> ` (3 preceding siblings ...) 2022-05-30 11:26 ` [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin @ 2022-05-30 11:26 ` Vasily Averin 2022-05-30 11:26 ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin [not found] ` <a1fcdab2-a208-0fad-3f4e-233317ab828f@openvz.org> 6 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups struct pci_group_cpu is percpu allocated for each new cgroup and can consume a significant portion of all allocated memory on nodes with a large number of CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> --- kernel/sched/psi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a337f3e35997..f3ec8553283e 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -957,7 +957,8 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return 0; - cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + cgroup->psi.pcpu = alloc_percpu_gfp(struct psi_group_cpu, + GFP_KERNEL_ACCOUNT); if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu [not found] ` <cover.1653899364.git.vvs@openvz.org> ` (4 preceding siblings ...) 2022-05-30 11:26 ` [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin @ 2022-05-30 11:26 ` Vasily Averin 2022-05-30 15:04 ` Muchun Song [not found] ` <a1fcdab2-a208-0fad-3f4e-233317ab828f@openvz.org> 6 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Muchun Song, cgroups struct cgroup_rstat_cpu is percpu allocated for each new cgroup and can consume a significant portion of all allocated memory on nodes with a large number of CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> --- kernel/cgroup/rstat.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 24b5c2ab5598..2904b185b01b 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) /* the root cgrp has rstat_cpu preallocated */ if (!cgrp->rstat_cpu) { - cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu); + cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu, + GFP_KERNEL_ACCOUNT); if (!cgrp->rstat_cpu) return -ENOMEM; } -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu 2022-05-30 11:26 ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin @ 2022-05-30 15:04 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-30 15:04 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Mon, May 30, 2022 at 02:26:36PM +0300, Vasily Averin wrote: > struct cgroup_rstat_cpu is percpu allocated for each new cgroup and > can consume a significant portion of all allocated memory on nodes > with a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Muchun Song <songmuchun@bytedance.com> Thanks. ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <a1fcdab2-a208-0fad-3f4e-233317ab828f@openvz.org>]
* Re: [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq [not found] ` <a1fcdab2-a208-0fad-3f4e-233317ab828f@openvz.org> @ 2022-05-30 15:06 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-30 15:06 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, LKML, Linux Memory Management List, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Cgroups On Mon, May 30, 2022 at 7:27 PM Vasily Averin <vvs@openvz.org> wrote: > > If enabled in config, alloc_rt_sched_group() is called for each new > cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq. > This significantly exceeds the size of the percpu allocation in the > common part of cgroup creation. > > Memory allocated during new cpu cgroup creation > (with enabled RT_GROUP_SCHED): > common part: ~11Kb + 318 bytes percpu > cpu cgroup: ~2.5Kb + ~2800 bytes percpu > > Accounting for this memory helps to avoid misuse inside memcg-limited > containers. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Muchun Song <songmuchun@bytedance.com> Thanks. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup [not found] <Yn6aL3cO7VdrmHHp@carbon> 2022-05-21 16:37 ` [PATCH mm v2 0/9] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin @ 2022-05-21 16:37 ` Vasily Averin 2022-05-22 6:37 ` Muchun Song 2022-05-21 16:37 ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin ` (7 subsequent siblings) 9 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups Creating each new cgroup allocates 4Kb for struct cgroup. This is the largest memory allocation in this scenario and is epecially important for small VMs with 1-2 CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Accounting of this memory helps to avoid misuse inside memcg-limited containers. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> --- kernel/cgroup/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..7595127c5b3a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, /* allocate the cgroup and its ID, 0 is reserved for the root */ cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), - GFP_KERNEL); + GFP_KERNEL_ACCOUNT); if (!cgrp) return ERR_PTR(-ENOMEM); -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup 2022-05-21 16:37 ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin @ 2022-05-22 6:37 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:37 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:37:36PM +0300, Vasily Averin wrote: > Creating each new cgroup allocates 4Kb for struct cgroup. This is the > largest memory allocation in this scenario and is epecially important > for small VMs with 1-2 CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Accounting of this memory helps to avoid misuse inside memcg-limited > containers. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes [not found] <Yn6aL3cO7VdrmHHp@carbon> 2022-05-21 16:37 ` [PATCH mm v2 0/9] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin 2022-05-21 16:37 ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin @ 2022-05-21 16:37 ` Vasily Averin 2022-05-22 6:37 ` Muchun Song 2022-05-21 16:37 ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin ` (6 subsequent siblings) 9 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs nodes slab cache. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index cfa79715fc1a..3ac4191b1c40 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -391,7 +391,8 @@ void __init kernfs_init(void) { kernfs_node_cache = kmem_cache_create("kernfs_node_cache", sizeof(struct kernfs_node), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes 2022-05-21 16:37 ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin @ 2022-05-22 6:37 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:37 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:37:49PM +0300, Vasily Averin wrote: > kernfs nodes are quite small kernel objects, however there are few > scenarios where it consumes significant piece of all allocated memory: > > 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb > was allocated for 80+ kernfs nodes. > > 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs > structures. > > 3) Shakeel Butt reports that Google has workloads which create 100s > of subcontainers and they have observed high system overhead > without memcg accounting of kernfs. > > Usually new kernfs node creates few other objects: > > Allocs Alloc Allocation > number size > -------------------------------------------- > 1 + 128 (__kernfs_new_node+0x4d) kernfs node > 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs > 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb > 1 32 (simple_xattr_set+0x59) > 1 8 (__kernfs_new_node+0x30) > > '+' -- to be accounted > > This patch enables accounting for kernfs nodes slab cache. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs [not found] <Yn6aL3cO7VdrmHHp@carbon> ` (2 preceding siblings ...) 2022-05-21 16:37 ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin @ 2022-05-21 16:37 ` Vasily Averin 2022-05-22 6:38 ` Muchun Song 2022-05-21 16:38 ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin ` (5 subsequent siblings) 9 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for kernfs_iattrs_cache slab cache Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> --- fs/kernfs/mount.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index 3ac4191b1c40..40e896c7c86b 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -397,5 +397,6 @@ void __init kernfs_init(void) /* Creates slab cache for kernfs inode attributes */ kernfs_iattrs_cache = kmem_cache_create("kernfs_iattrs_cache", sizeof(struct kernfs_iattrs), - 0, SLAB_PANIC, NULL); + 0, SLAB_PANIC | SLAB_ACCOUNT, + NULL); } -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs 2022-05-21 16:37 ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin @ 2022-05-22 6:38 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:38 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:37:59PM +0300, Vasily Averin wrote: > kernfs nodes are quite small kernel objects, however there are few > scenarios where it consumes significant piece of all allocated memory: > > 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb > was allocated for 80+ kernfs nodes. > > 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs > structures. > > 3) Shakeel Butt reports that Google has workloads which create 100s > of subcontainers and they have observed high system overhead > without memcg accounting of kernfs. > > Usually new kernfs node creates few other objects: > > Allocs Alloc Allocation > number size > -------------------------------------------- > 1 + 128 (__kernfs_new_node+0x4d) kernfs node > 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs > 1 + 96 (simple_xattr_alloc+0x28) simple_xattr, can grow over 4Kb > 1 32 (simple_xattr_set+0x59) > 1 8 (__kernfs_new_node+0x30) > > '+' -- to be accounted > > This patch enables accounting for kernfs_iattrs_cache slab cache > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr [not found] <Yn6aL3cO7VdrmHHp@carbon> ` (3 preceding siblings ...) 2022-05-21 16:37 ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin @ 2022-05-21 16:38 ` Vasily Averin 2022-05-22 6:38 ` Muchun Song 2022-05-21 16:38 ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin ` (4 subsequent siblings) 9 siblings, 1 reply; 65+ messages in thread From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups kernfs nodes are quite small kernel objects, however there are few scenarios where it consumes significant piece of all allocated memory: 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb was allocated for 80+ kernfs nodes. 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs structures. 3) Shakeel Butt reports that Google has workloads which create 100s of subcontainers and they have observed high system overhead without memcg accounting of kernfs. Usually new kernfs node creates few other objects: Allocs Alloc Allocation number size -------------------------------------------- 1 + 128 (__kernfs_new_node+0x4d) kernfs node 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs 1 + 96 (simple_xattr_alloc+0x28) simple_xattr 1 32 (simple_xattr_set+0x59) 1 8 (__kernfs_new_node+0x30) '+' -- to be accounted This patch enables accounting for struct simple_xattr. Size of this structure depends on userspace and can grow over 4Kb. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> --- fs/xattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xattr.c b/fs/xattr.c index 998045165916..31305b941756 100644 --- a/fs/xattr.c +++ b/fs/xattr.c @@ -950,7 +950,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size) if (len < sizeof(*new_xattr)) return NULL; - new_xattr = kvmalloc(len, GFP_KERNEL); + new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT); if (!new_xattr) return NULL; -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr 2022-05-21 16:38 ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin @ 2022-05-22 6:38 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:38 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:38:11PM +0300, Vasily Averin wrote: > kernfs nodes are quite small kernel objects, however there are few > scenarios where it consumes significant piece of all allocated memory: > > 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb > was allocated for 80+ kernfs nodes. > > 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs > structures. > > 3) Shakeel Butt reports that Google has workloads which create 100s > of subcontainers and they have observed high system overhead > without memcg accounting of kernfs. > > Usually new kernfs node creates few other objects: > > Allocs Alloc Allocation > number size > -------------------------------------------- > 1 + 128 (__kernfs_new_node+0x4d) kernfs node > 1 + 88 (__kernfs_iattrs+0x57) kernfs iattrs > 1 + 96 (simple_xattr_alloc+0x28) simple_xattr > 1 32 (simple_xattr_set+0x59) > 1 8 (__kernfs_new_node+0x30) > > '+' -- to be accounted > > This patch enables accounting for struct simple_xattr. Size of this > structure depends on userspace and can grow over 4Kb. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu [not found] <Yn6aL3cO7VdrmHHp@carbon> ` (4 preceding siblings ...) 2022-05-21 16:38 ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin @ 2022-05-21 16:38 ` Vasily Averin 2022-05-21 21:34 ` Shakeel Butt ` (2 more replies) [not found] ` <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org> ` (3 subsequent siblings) 9 siblings, 3 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups struct pci_group_cpu is percpu allocated for each new cgroup and can consume a significant portion of all allocated memory on nodes with a large number of CPUs. Common part of the cgroup creation: Allocs Alloc $1*$2 Sum Allocation number size -------------------------------------------- 16 ~ 352 5632 5632 KERNFS 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) 1 192 192 10504 (__d_alloc+0x29) 2 72 144 10648 (avc_alloc_node+0x27) 2 64 128 10776 (percpu_ref_init+0x6a) 1 64 64 10840 (memcg_list_lru_alloc+0x21a) percpu: 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f 2 12 24 312 call_site=percpu_ref_init+0x23 1 6 6 318 call_site=__percpu_counter_init+0x22 '+' -- to be accounted, '~' -- partially accounted Signed-off-by: Vasily Averin <vvs@openvz.org> --- kernel/sched/psi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a4fa3aadfcba..f0b25380cb12 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -957,7 +957,8 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return 0; - cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + cgroup->psi.pcpu = alloc_percpu_gfp(struct psi_group_cpu, + GFP_KERNEL_ACCOUNT); if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); -- 2.36.1 ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu 2022-05-21 16:38 ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin @ 2022-05-21 21:34 ` Shakeel Butt 2022-05-22 6:40 ` Muchun Song 2022-05-25 1:30 ` Roman Gushchin 2 siblings, 0 replies; 65+ messages in thread From: Shakeel Butt @ 2022-05-21 21:34 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Cgroups On Sat, May 21, 2022 at 9:38 AM Vasily Averin <vvs@openvz.org> wrote: > > struct pci_group_cpu is percpu allocated for each new cgroup and can > consume a significant portion of all allocated memory on nodes with > a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu 2022-05-21 16:38 ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin 2022-05-21 21:34 ` Shakeel Butt @ 2022-05-22 6:40 ` Muchun Song 2022-05-25 1:30 ` Roman Gushchin 2 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:40 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:38:21PM +0300, Vasily Averin wrote: > struct pci_group_cpu is percpu allocated for each new cgroup and can > consume a significant portion of all allocated memory on nodes with > a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu 2022-05-21 16:38 ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin 2022-05-21 21:34 ` Shakeel Butt 2022-05-22 6:40 ` Muchun Song @ 2022-05-25 1:30 ` Roman Gushchin 2 siblings, 0 replies; 65+ messages in thread From: Roman Gushchin @ 2022-05-25 1:30 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:38:21PM +0300, Vasily Averin wrote: > struct pci_group_cpu is percpu allocated for each new cgroup and can > consume a significant portion of all allocated memory on nodes with > a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org>]
* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu [not found] ` <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org> @ 2022-05-21 17:58 ` Vasily Averin 2022-05-21 21:35 ` Shakeel Butt ` (2 subsequent siblings) 3 siblings, 0 replies; 65+ messages in thread From: Vasily Averin @ 2022-05-21 17:58 UTC (permalink / raw) To: Andrew Morton Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On 5/21/22 19:38, Vasily Averin wrote: > struct cgroup_rstat_cpu is percpu allocated for each new cgroup and > can consume a significant portion of all allocated memory on nodes > with a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> > --- > kernel/cgroup/rstat.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index 24b5c2ab5598..f76cb63ae2e0 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) > > /* the root cgrp has rstat_cpu preallocated */ > if (!cgrp->rstat_cpu) { > - cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu); > + cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu "," was lost here > + GFP_KERNEL_ACCOUNT); > if (!cgrp->rstat_cpu) > return -ENOMEM; > } ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu [not found] ` <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org> 2022-05-21 17:58 ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin @ 2022-05-21 21:35 ` Shakeel Butt 2022-05-21 22:05 ` kernel test robot 2022-05-25 1:31 ` Roman Gushchin 3 siblings, 0 replies; 65+ messages in thread From: Shakeel Butt @ 2022-05-21 21:35 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Cgroups On Sat, May 21, 2022 at 9:38 AM Vasily Averin <vvs@openvz.org> wrote: > > struct cgroup_rstat_cpu is percpu allocated for each new cgroup and > can consume a significant portion of all allocated memory on nodes > with a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu [not found] ` <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org> 2022-05-21 17:58 ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin 2022-05-21 21:35 ` Shakeel Butt @ 2022-05-21 22:05 ` kernel test robot 2022-05-25 1:31 ` Roman Gushchin 3 siblings, 0 replies; 65+ messages in thread From: kernel test robot @ 2022-05-21 22:05 UTC (permalink / raw) To: Vasily Averin, Andrew Morton Cc: kbuild-all, Linux Memory Management List, kernel, linux-kernel, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups Hi Vasily, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/sched/core] [also build test ERROR on tj-cgroup/for-next driver-core/driver-core-testing linus/master v5.18-rc7 next-20220520] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/memcg-enable-accounting-for-struct-cgroup/20220522-004124 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a config: arm-imxrt_defconfig (https://download.01.org/0day-ci/archive/20220522/202205220531.AVnBFrgq-lkp@intel.com/config) compiler: arm-linux-gnueabi-gcc (GCC) 11.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/c1b7edf1635aaef50d25ba8246a5e5c997a6bf44 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Vasily-Averin/memcg-enable-accounting-for-struct-cgroup/20220522-004124 git checkout c1b7edf1635aaef50d25ba8246a5e5c997a6bf44 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash kernel/cgroup/ If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): kernel/cgroup/rstat.c: In function 'cgroup_rstat_init': >> kernel/cgroup/rstat.c:261:70: error: macro "alloc_percpu_gfp" requires 2 arguments, but only 1 given 261 | GFP_KERNEL_ACCOUNT); | ^ In file included from include/linux/hrtimer.h:19, from include/linux/sched.h:19, from include/linux/cgroup.h:12, from kernel/cgroup/cgroup-internal.h:5, from kernel/cgroup/rstat.c:2: include/linux/percpu.h:133: note: macro "alloc_percpu_gfp" defined here 133 | #define alloc_percpu_gfp(type, gfp) \ | >> kernel/cgroup/rstat.c:260:35: error: 'alloc_percpu_gfp' undeclared (first use in this function) 260 | cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu | ^~~~~~~~~~~~~~~~ kernel/cgroup/rstat.c:260:35: note: each undeclared identifier is reported only once for each function it appears in vim +/alloc_percpu_gfp +261 kernel/cgroup/rstat.c 253 254 int cgroup_rstat_init(struct cgroup *cgrp) 255 { 256 int cpu; 257 258 /* the root cgrp has rstat_cpu preallocated */ 259 if (!cgrp->rstat_cpu) { > 260 cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu > 261 GFP_KERNEL_ACCOUNT); 262 if (!cgrp->rstat_cpu) 263 return -ENOMEM; 264 } 265 266 /* ->updated_children list is self terminated */ 267 for_each_possible_cpu(cpu) { 268 struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); 269 270 rstatc->updated_children = cgrp; 271 u64_stats_init(&rstatc->bsync); 272 } 273 274 return 0; 275 } 276 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu [not found] ` <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org> ` (2 preceding siblings ...) 2022-05-21 22:05 ` kernel test robot @ 2022-05-25 1:31 ` Roman Gushchin 3 siblings, 0 replies; 65+ messages in thread From: Roman Gushchin @ 2022-05-25 1:31 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:38:31PM +0300, Vasily Averin wrote: > struct cgroup_rstat_cpu is percpu allocated for each new cgroup and > can consume a significant portion of all allocated memory on nodes > with a large number of CPUs. > > Common part of the cgroup creation: > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 16 ~ 352 5632 5632 KERNFS > 1 + 4096 4096 9728 (cgroup_mkdir+0xe4) > 1 584 584 10312 (radix_tree_node_alloc.constprop.0+0x89) > 1 192 192 10504 (__d_alloc+0x29) > 2 72 144 10648 (avc_alloc_node+0x27) > 2 64 128 10776 (percpu_ref_init+0x6a) > 1 64 64 10840 (memcg_list_lru_alloc+0x21a) > percpu: > 1 + 192 192 192 call_site=psi_cgroup_alloc+0x1e > 1 + 96 96 288 call_site=cgroup_rstat_init+0x5f > 2 12 24 312 call_site=percpu_ref_init+0x23 > 1 6 6 318 call_site=__percpu_counter_init+0x22 > > '+' -- to be accounted, > '~' -- partially accounted > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <d7094aa2-1cd0-835c-9fb7-d76003c47dad@openvz.org>]
* Re: [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq [not found] ` <d7094aa2-1cd0-835c-9fb7-d76003c47dad@openvz.org> @ 2022-05-21 21:37 ` Shakeel Butt 2022-05-25 1:31 ` Roman Gushchin 1 sibling, 0 replies; 65+ messages in thread From: Shakeel Butt @ 2022-05-21 21:37 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, LKML, Linux MM, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Cgroups On Sat, May 21, 2022 at 9:39 AM Vasily Averin <vvs@openvz.org> wrote: > > If enabled in config, alloc_rt_sched_group() is called for each new > cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq. > This significantly exceeds the size of the percpu allocation in the > common part of cgroup creation. > > Memory allocated during new cpu cgroup creation > (with enabled RT_GROUP_SCHED): > common part: ~11Kb + 318 bytes percpu > cpu cgroup: ~2.5Kb + ~2800 bytes percpu > > Accounting for this memory helps to avoid misuse inside memcg-limited > contianers. *containers > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq [not found] ` <d7094aa2-1cd0-835c-9fb7-d76003c47dad@openvz.org> 2022-05-21 21:37 ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Shakeel Butt @ 2022-05-25 1:31 ` Roman Gushchin 1 sibling, 0 replies; 65+ messages in thread From: Roman Gushchin @ 2022-05-25 1:31 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt, Michal Koutný, Vlastimil Babka, Michal Hocko, cgroups On Sat, May 21, 2022 at 07:39:03PM +0300, Vasily Averin wrote: > If enabled in config, alloc_rt_sched_group() is called for each new > cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq. > This significantly exceeds the size of the percpu allocation in the > common part of cgroup creation. > > Memory allocated during new cpu cgroup creation > (with enabled RT_GROUP_SCHED): > common part: ~11Kb + 318 bytes percpu > cpu cgroup: ~2.5Kb + ~2800 bytes percpu > > Accounting for this memory helps to avoid misuse inside memcg-limited > contianers. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <9925d0ba-40d7-e3a8-1fef-054968b26ce6@openvz.org>]
* Re: [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc [not found] ` <9925d0ba-40d7-e3a8-1fef-054968b26ce6@openvz.org> @ 2022-05-22 6:47 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:47 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, LKML, Linux Memory Management List, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Cgroups On Sun, May 22, 2022 at 12:38 AM Vasily Averin <vvs@openvz.org> wrote: > > Creation of each memory cgroup allocates few huge objects in > mem_cgroup_css_alloc(). Its size exceeds the size of memory > accounted in common part of cgroup creation: > > common part: ~11Kb + 318 bytes percpu > memcg: ~17Kb + 4692 bytes percpu > > memory: > ------ > Allocs Alloc $1*$2 Sum Allocation > number size > -------------------------------------------- > 1 + 8192 8192 8192 (mem_cgroup_css_alloc+0x4a) <NB > 14 ~ 352 4928 13120 KERNFS > 1 + 2048 2048 15168 (mem_cgroup_css_alloc+0xdd) <NB > 1 1024 1024 16192 (alloc_shrinker_info+0x79) > 1 584 584 16776 (radix_tree_node_alloc.constprop.0+0x89) > 2 64 128 16904 (percpu_ref_init+0x6a) > 1 64 64 16968 (mem_cgroup_css_online+0x32) > > 1 = 3684 3684 3684 call_site=mem_cgroup_css_alloc+0x9e > 1 = 984 984 4668 call_site=mem_cgroup_css_alloc+0xfd > 2 12 24 4692 call_site=percpu_ref_init+0x23 > > '=' -- already accounted, > '+' -- to be accounted, > '~' -- partially accounted > > Accounting for this memory helps to avoid misuse inside memcg-limited > contianers. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
[parent not found: <46bbde64-7290-cabb-8fef-6f4a30263d8c@openvz.org>]
* Re: [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group [not found] ` <46bbde64-7290-cabb-8fef-6f4a30263d8c@openvz.org> @ 2022-05-22 6:49 ` Muchun Song 0 siblings, 0 replies; 65+ messages in thread From: Muchun Song @ 2022-05-22 6:49 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, kernel, LKML, Linux Memory Management List, Shakeel Butt, Roman Gushchin, Michal Koutný, Vlastimil Babka, Michal Hocko, Cgroups On Sun, May 22, 2022 at 12:39 AM Vasily Averin <vvs@openvz.org> wrote: > > Creating of each new cpu cgroup allocates two 512-bytes kernel objects > per CPU. This is especially important for cgroups shared parent memory > cgroup. In this scenario, on nodes with multiple processors, these > allocations become one of the main memory consumers. > > Memory allocated during new cpu cgroup creation: > common part: ~11Kb + 318 bytes percpu > cpu cgroup: ~2.5Kb + 1036 bytes percpu > > Accounting for this memory helps to avoid misuse inside memcg-limited > contianers. > > Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2022-07-11 16:24 UTC | newest]
Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Yn6aL3cO7VdrmHHp@carbon>
2022-05-21 16:37 ` [PATCH mm v2 0/9] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin
2022-05-30 11:25 ` [PATCH mm v3 " Vasily Averin
2022-05-30 11:55 ` Michal Hocko
2022-05-30 13:09 ` Vasily Averin
2022-05-30 14:22 ` Michal Hocko
2022-05-30 19:58 ` Vasily Averin
2022-05-31 7:16 ` Michal Hocko
2022-06-01 3:43 ` Vasily Averin
2022-06-01 9:15 ` Michal Koutný
2022-06-01 9:32 ` Michal Hocko
2022-06-01 13:05 ` Michal Hocko
2022-06-01 14:22 ` Roman Gushchin
2022-06-01 15:24 ` Michal Hocko
2022-06-01 9:26 ` Michal Hocko
2022-06-13 5:34 ` [PATCH mm v4 " Vasily Averin
2022-06-23 14:50 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin
2022-06-23 15:03 ` Vasily Averin
2022-06-23 16:07 ` Michal Hocko
2022-06-23 16:55 ` Shakeel Butt
2022-06-24 10:40 ` Vasily Averin
2022-06-24 12:26 ` Michal Koutný
2022-06-24 13:59 ` Michal Hocko
2022-06-25 9:43 ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin
[not found] ` <c53e1df0-5174-66de-23cc-18797f0b512d@openvz.org>
2022-06-26 1:56 ` [PATCH RFC] memcg: notify about global mem_cgroup_id " Roman Gushchin
[not found] ` <97bed1fd-f230-c2ea-1cb6-8230825a9a64@openvz.org>
2022-06-27 3:23 ` [PATCH mm v2] " Muchun Song
[not found] ` <f3e4059c-69ea-eccd-a22f-9f6c6780f33a@openvz.org>
2022-06-28 1:11 ` Roman Gushchin
2022-06-28 9:08 ` Michal Koutný
2022-06-27 16:37 ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt
2022-07-01 11:03 ` Michal Hocko
2022-07-10 18:53 ` Vasily Averin
2022-07-11 16:24 ` Michal Hocko
2022-06-23 14:50 ` [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-06-23 14:50 ` [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-06-23 14:51 ` [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-06-13 5:34 ` [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-06-13 5:34 ` [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-06-13 5:34 ` [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
[not found] ` <cover.1653899364.git.vvs@openvz.org>
2022-05-30 11:25 ` [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-05-30 11:26 ` [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-05-30 11:26 ` [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-05-30 11:26 ` [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
2022-05-30 11:26 ` [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
2022-05-30 11:26 ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
2022-05-30 15:04 ` Muchun Song
[not found] ` <a1fcdab2-a208-0fad-3f4e-233317ab828f@openvz.org>
2022-05-30 15:06 ` [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Muchun Song
2022-05-21 16:37 ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-05-22 6:37 ` Muchun Song
2022-05-21 16:37 ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-05-22 6:37 ` Muchun Song
2022-05-21 16:37 ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-05-22 6:38 ` Muchun Song
2022-05-21 16:38 ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
2022-05-22 6:38 ` Muchun Song
2022-05-21 16:38 ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
2022-05-21 21:34 ` Shakeel Butt
2022-05-22 6:40 ` Muchun Song
2022-05-25 1:30 ` Roman Gushchin
[not found] ` <c0d01d6e-530c-9be3-1c9b-67a7f8ea09be@openvz.org>
2022-05-21 17:58 ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
2022-05-21 21:35 ` Shakeel Butt
2022-05-21 22:05 ` kernel test robot
2022-05-25 1:31 ` Roman Gushchin
[not found] ` <d7094aa2-1cd0-835c-9fb7-d76003c47dad@openvz.org>
2022-05-21 21:37 ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Shakeel Butt
2022-05-25 1:31 ` Roman Gushchin
[not found] ` <9925d0ba-40d7-e3a8-1fef-054968b26ce6@openvz.org>
2022-05-22 6:47 ` [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Muchun Song
[not found] ` <46bbde64-7290-cabb-8fef-6f4a30263d8c@openvz.org>
2022-05-22 6:49 ` [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Muchun Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).