* [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 8:13 Vasily Averin
[not found] ` <9d10df01-0127-fb40-81c3-cc53c9733c3e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-18 8:13 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-GEFAQzZX7r8dnm+yROfE0A
[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]
While checking the patches fixed broken memcg accounting in vmalloc I found
another issue: a false global OOM triggered by memcg-limited user space task.
I executed vmalloc-eater inside a memcg limited LXC container in a loop, checked
that it does not consume host memory beyond the assigned limit, triggers memcg OOM
and generates "Memory cgroup out of memory" messages. Everything was as expected.
However I was surprised to find quite rare global OOM messages too.
I set sysctl vm.panic_on_oom to 1, repeated the test and successfully
crashed the node.
Dmesg showed that global OOM was detected on 16 GB node with ~10 GB of free memory.
syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
dump_header+0x4a/0x2c1
out_of_memory.cold+0xa/0x7e
pagefault_out_of_memory+0x46/0x60
exc_page_fault+0x79/0x2b0
asm_exc_page_fault+0x1e/0x30
...
Mem-Info:
Node 0 DMA: 0*4kB 0*8kB <...> = 13296kB
Node 0 DMA32: 705*4kB (UM) <...> = 2586964kB
Node 0 Normal: 2743*4kB (UME) <...> = 6904828kB
...
4095866 pages RAM
...
Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
Full dmesg can be found in attached file.
How could this happen?
User-space task inside the memcg-limited container generated a page fault,
its handler do_user_addr_fault() called handle_mm_fault which could not
allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
Then do_user_addr_fault() called pagefault_out_of_memory() which executed
out_of_memory() without set of memcg.
Partially this problem depends on one of my recent patches, disabled unlimited
memory allocation for dying tasks. However I think the problem can happen
on non-killed tasks too, for example because of kmem limit.
At present do_user_addr_fault() does not know why page allocation was failed,
i.e. was it global or memcg OOM. I propose to save this information in new flag
on task_struct. It can be set in case of memcg restrictons in
obj_cgroup_charge_pages() (for memory controller) and in try_charge_memcg()
(for kmem controller). Then it can be used in mem_cgroup_oom_synchronize()
called inside pagefault_out_of_memory():
in case of memcg-related restrictions it will not trigger fake global OOM and
returns to user space which will retry the fault or kill the process if it got
a fatal signal.
Thank you,
Vasily Averin
Vasily Averin (1):
memcg: prevent false global OOM trigggerd by memcg limited task.
include/linux/sched.h | 1 +
mm/memcontrol.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
--
2.32.0
[-- Attachment #2: dmesg-oom.txt --]
[-- Type: text/plain, Size: 25650 bytes --]
[59622.176098] syz-executor invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=1000
[59622.178633] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.180840] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.182562] Call Trace:
[59622.183525] dump_stack_lvl+0x57/0x72
[59622.184782] dump_header+0x4a/0x2c1
[59622.185929] oom_kill_process.cold+0xb/0x10
[59622.187203] out_of_memory+0x229/0x5b0
[59622.188399] mem_cgroup_out_of_memory+0x111/0x130
[59622.189773] try_charge_memcg+0x693/0x720
[59622.191013] ? kvm_sched_clock_read+0x14/0x30
[59622.192318] charge_memcg+0x57/0x170
[59622.193482] mem_cgroup_swapin_charge_page+0x99/0x1d0
[59622.194932] do_swap_page+0x916/0xbf0
[59622.196110] ? __lock_acquire+0x3b3/0x1e00
[59622.197377] __handle_mm_fault+0xa5f/0x14e0
[59622.198649] ? lock_acquire+0xc4/0x2e0
[59622.199847] handle_mm_fault+0x149/0x3f0
[59622.201071] do_user_addr_fault+0x1f4/0x6c0
[59622.202346] exc_page_fault+0x79/0x2b0
[59622.203545] ? asm_exc_page_fault+0x8/0x30
[59622.204802] asm_exc_page_fault+0x1e/0x30
[59622.206043] RIP: 0033:0x41a948
[59622.207115] Code: 64 83 0c 25 08 03 00 00 10 64 48 8b 3c 25 00 03 00 00 e8 8b fe ff ff f4 66 2e 0f 1f 84 00 00 00 00 00 f7 c7 02 00 00 00 75 27 <64> 8b 04 25 08 03 00 00 41 89 c3 41 83 e3 fd f0 64 44 0f b1 1c 25
[59622.211557] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.213058] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.214919] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.216772] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.218623] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.220467] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.222367] memory: usage 524268kB, limit 524288kB, failcnt 28608
[59622.224112] memory+swap: usage 554776kB, limit 9007199254740988kB, failcnt 0
[59622.225982] kmem: usage 524168kB, limit 9007199254740988kB, failcnt 0
[59622.227740] Memory cgroup stats for /lxc.payload.test:
[59622.234613] anon 0
file 122880
kernel_stack 1294336
pagetables 3665920
percpu 490896
sock 0
shmem 0
file_mapped 12288
file_dirty 0
file_writeback 0
swapcached 49008640
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 0
active_anon 0
inactive_file 110592
active_file 0
unevictable 0
slab_reclaimable 1513744
slab_unreclaimable 14117008
slab 15630752
workingset_refault_anon 3800
workingset_refault_file 24974
workingset_activate_anon 923
workingset_activate_file 414
workingset_restore_anon 422
workingset_restore_file 190
workingset_nodereclaim 74
pgfault 120040
pgmajfault 4766
pgrefill 15999
pgscan 486263
pgsteal 44646
pgactivate 13066
pgdeactivate 14277
pglazyfree 60
pglazyfreed 34
thp_fault_alloc 0
thp_collapse_alloc 0
[59622.265206] Tasks state (memory values in pages):
[59622.266454] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[59622.268394] [ 14696] 0 14696 56226 690 90112 145 0 bash
[59622.270246] [ 14746] 0 14746 262198 0 266240 2900 0 syz-execprog
[59622.272217] [ 14764] 0 14764 12552 0 36864 16 0 syz-executor
[59622.274217] [ 14765] 0 14765 12552 0 36864 16 0 syz-executor
[59622.276165] [ 14769] 0 14769 12552 0 36864 16 0 syz-executor
[59622.278091] [ 14772] 0 14772 12552 0 36864 17 0 syz-executor
[59622.280023] [ 14775] 0 14775 12552 0 36864 17 0 syz-executor
[59622.281949] [ 14777] 0 14777 12552 0 36864 16 0 syz-executor
[59622.283847] [ 14781] 0 14781 12552 0 36864 16 0 syz-executor
[59622.285737] [ 14783] 0 14783 12552 0 36864 17 0 syz-executor
[59622.287673] [ 14787] 0 14787 12552 0 36864 16 0 syz-executor
[59622.287968] systemd-journald[648]: Compressed data object 720 -> 393 using ZSTD
[59622.289564] [ 14800] 0 14800 12552 0 36864 16 0 syz-executor
[59622.293144] [ 14803] 0 14803 12552 0 36864 16 0 syz-executor
[59622.295054] [ 14805] 0 14805 12552 0 36864 16 0 syz-executor
[59622.296928] [ 14812] 0 14812 12552 0 36864 17 0 syz-executor
[59622.298796] [ 14814] 0 14814 12552 0 36864 16 0 syz-executor
[59622.300663] [ 14817] 0 14817 12552 0 36864 17 0 syz-executor
[59622.302528] [ 14820] 0 14820 12552 0 36864 16 0 syz-executor
[59622.304399] [ 14779] 0 14779 12551 0 49152 37 0 syz-executor
[59622.306271] [ 15280] 0 15279 12584 0 69632 0 1000 syz-executor
[59622.308144] [ 14767] 0 14767 12551 0 49152 41 0 syz-executor
[59622.310046] [ 15282] 0 15281 12584 0 69632 0 1000 syz-executor
[59622.311926] [ 14766] 0 14766 12551 0 49152 44 0 syz-executor
[59622.313812] [ 15286] 0 15285 12584 0 65536 0 1000 syz-executor
[59622.315704] [ 14782] 0 14782 12551 0 49152 35 0 syz-executor
[59622.317621] [ 15284] 0 15283 12584 0 65536 0 1000 syz-executor
[59622.319570] [ 14784] 0 14784 12551 0 49152 36 0 syz-executor
[59622.321496] [ 15288] 0 15287 12617 0 65536 0 1000 syz-executor
[59622.323454] [ 14776] 0 14776 12551 0 49152 34 0 syz-executor
[59622.325374] [ 15291] 0 15289 12584 0 69632 4 1000 syz-executor
[59622.327293] [ 14813] 0 14813 12551 0 49152 33 0 syz-executor
[59622.329216] [ 15295] 0 15294 12584 0 65536 0 1000 syz-executor
[59622.331139] [ 14819] 0 14819 12551 0 49152 32 0 syz-executor
[59622.333068] [ 15293] 0 15292 12584 0 65536 0 1000 syz-executor
[59622.334991] [ 14821] 0 14821 12551 0 49152 31 0 syz-executor
[59622.336921] [ 15297] 0 15296 12584 0 65536 0 1000 syz-executor
[59622.338875] [ 14831] 0 14831 12551 0 49152 42 0 syz-executor
[59622.340811] [ 15299] 0 15298 12617 0 65536 0 1000 syz-executor
[59622.342752] [ 14804] 0 14804 12551 0 49152 38 0 syz-executor
[59622.344700] [ 15301] 0 15300 12584 0 69632 0 1000 syz-executor
[59622.346649] [ 14816] 0 14816 12551 0 49152 27 0 syz-executor
[59622.348604] [ 15306] 0 15305 12584 0 69632 0 1000 syz-executor
[59622.350615] [ 14774] 0 14774 12551 0 49152 30 0 syz-executor
[59622.352606] [ 15304] 0 15303 12584 0 65536 0 1000 syz-executor
[59622.354575] [ 14807] 0 14807 12551 0 49152 25 0 syz-executor
[59622.356531] [ 15307] 0 15307 12584 0 69632 37 1000 syz-executor
[59622.358481] [ 14801] 0 14801 12551 0 49152 23 0 syz-executor
[59622.360418] [ 15261] 0 15260 12618 0 65536 31 1000 syz-executor
[59622.362355] [ 14788] 0 14788 12551 0 49152 24 0 syz-executor
[59622.364291] [ 15263] 0 15262 12618 0 69632 0 1000 syz-executor
[59622.366269] [ 14503] 0 14503 25205 1712 241664 235 0 systemd-journal
[59622.368288] [ 14736] 0 14736 1637 398 61440 32 0 agetty
[59622.370161] [ 14737] 0 14737 1637 383 61440 31 0 agetty
[59622.372037] [ 14743] 0 14743 1637 398 61440 31 0 agetty
[59622.373939] [ 14744] 0 14744 1637 387 57344 31 0 agetty
[59622.375849] [ 14512] 0 14512 22648 1593 221184 233 0 systemd-logind
[59622.377858] [ 14513] 81 14513 13516 801 155648 138 -900 dbus-daemon
[59622.379867] [ 14734] 0 14734 52899 856 180224 440 0 rsyslogd
[59622.381813] [ 14735] 0 14735 1637 343 61440 31 0 agetty
[59622.383701] [ 14741] 0 14741 5724 572 94208 239 0 crond
[59622.385568] [ 14086] 0 14086 25345 2120 233472 520 0 systemd
[59622.387470] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=lxc.payload.test,mems_allowed=0,oom_memcg=/lxc.payload.test,task_memcg=/lxc.payload.test,task=syz-executor,pid=15307,uid=0
[59622.391402] Memory cgroup out of memory: Killed process 15307 (syz-executor) total-vm:50336kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68kB oom_score_adj:1000
[59622.395544] oom_reaper: reaped process 15307 (syz-executor), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[59622.411181] syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
[59622.414272] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.416247] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.417950] Call Trace:
[59622.418901] dump_stack_lvl+0x57/0x72
[59622.420067] dump_header+0x4a/0x2c1
[59622.421200] out_of_memory.cold+0xa/0x7e
[59622.422415] pagefault_out_of_memory+0x46/0x60
[59622.423784] exc_page_fault+0x79/0x2b0
[59622.424966] ? asm_exc_page_fault+0x8/0x30
[59622.426207] asm_exc_page_fault+0x1e/0x30
[59622.427426] RIP: 0033:0x41a948
[59622.428485] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.430064] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.431513] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.433299] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.435090] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.436875] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.438691] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.440547] Mem-Info:
[59622.441516] active_anon:279 inactive_anon:24996 isolated_anon:0
active_file:395743 inactive_file:848017 isolated_file:0
unevictable:0 dirty:159 writeback:21
slab_reclaimable:40993 slab_unreclaimable:45236
mapped:66928 shmem:286 pagetables:2477 bounce:0
kernel_misc_reclaimable:0
free:2376272 free_pcp:7579 free_cma:0
[59622.451730] Node 0 active_anon:1116kB inactive_anon:99984kB active_file:1582972kB inactive_file:3392068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:267712kB dirty:832kB writeback:84kB shmem:1144kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:5488kB pagetables:9908kB all_unreclaimable? no
[59622.458826] Node 0 DMA free:13296kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[59622.465131] lowmem_reserve[]: 0 2706 15265 15265 15265
[59622.466746] Node 0 DMA32 free:2586964kB min:11968kB low:14960kB high:17952kB reserved_highatomic:0KB active_anon:0kB inactive_anon:20kB active_file:27084kB inactive_file:172344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:6120kB local_pcp:1004kB free_cma:0kB
[59622.473718] lowmem_reserve[]: 0 0 12559 12559 12559
[59622.475311] Node 0 Normal free:6904828kB min:55544kB low:69428kB high:83312kB reserved_highatomic:0KB active_anon:1116kB inactive_anon:99964kB active_file:1555888kB inactive_file:3219724kB unevictable:0kB writepending:916kB present:13238272kB managed:12871060kB mlocked:0kB bounce:0kB free_pcp:24068kB local_pcp:692kB free_cma:0kB
[59622.482647] lowmem_reserve[]: 0 0 0 0 0
[59622.484096] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 13296kB
[59622.488044] Node 0 DMA32: 705*4kB (UM) 624*8kB (UME) 497*16kB (UME) 594*32kB (UME) 532*64kB (UME) 49*128kB (UME) 36*256kB (UE) 36*512kB (UE) 20*1024kB (U) 17*2048kB (UE) 593*4096kB (U) = 2586964kB
[59622.492687] Node 0 Normal: 2743*4kB (UME) 2084*8kB (UME) 536*16kB (UME) 1496*32kB (UME) 1148*64kB (ME) 171*128kB (UME) 115*256kB (M) 94*512kB (M) 98*1024kB (UME) 163*2048kB (UME) 1517*4096kB (UM) = 6904828kB
[59622.497553] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[59622.499898] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[59622.502197] 1244068 total pagecache pages
[59622.503688] 4 pages in swap cache
[59622.505046] Swap cache stats: add 54921, delete 54922, find 137/9740
[59622.506956] Free swap = 8357116kB
[59622.508352] Total swap = 8388604kB
[59622.510564] 4095866 pages RAM
[59622.512379] 0 pages HighMem/MovableOnly
[59622.513853] 173881 pages reserved
[59622.515220] 0 pages cma reserved
[59622.516581] 0 pages hwpoisoned
[59622.517897] Tasks state (memory values in pages):
[59622.519524] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[59622.521895] [ 648] 0 648 71491 45126 552960 0 -250 systemd-journal
[59622.524550] [ 666] 0 666 11976 3110 102400 0 -1000 systemd-udevd
[59622.526944] [ 797] 193 797 10732 5087 122880 0 0 systemd-resolve
[59622.529352] [ 798] 0 798 26617 605 69632 0 -1000 auditd
[59622.531635] [ 800] 0 800 2077 1056 49152 0 0 sedispatch
[59622.533963] [ 821] 0 821 19874 761 57344 0 0 irqbalance
[59622.536297] [ 823] 0 823 2921 533 45056 0 0 mcelog
[59622.538590] [ 824] 996 824 666787 5873 217088 0 0 polkitd
[59622.540882] [ 825] 0 825 20273 951 53248 0 0 qemu-ga
[59622.543158] [ 826] 0 826 205022 10950 1179648 0 0 rsyslogd
[59622.545449] [ 827] 0 827 12186 2965 102400 0 0 sssd
[59622.547678] [ 828] 0 828 4502 2175 73728 0 0 systemd-homed
[59622.550047] [ 829] 0 829 4466 2118 69632 0 0 systemd-machine
[59622.552466] [ 830] 81 830 2530 1083 69632 0 -900 dbus-broker-lau
[59622.554852] [ 835] 0 835 68167 3994 139264 0 0 abrtd
[59622.557060] [ 839] 982 839 23747 967 69632 0 0 chronyd
[59622.559289] [ 840] 81 840 1525 919 49152 0 -900 dbus-broker
[59622.561576] [ 843] 0 843 12718 3366 110592 0 0 sssd_be
[59622.563803] [ 847] 0 847 177006 5860 655360 0 0 abrt-dump-journ
[59622.566207] [ 849] 0 849 174956 6978 827392 0 0 abrt-dump-journ
[59622.568518] [ 850] 0 850 15570 9663 143360 0 0 sssd_nss
[59622.570712] [ 857] 0 857 4966 2724 81920 0 0 systemd-logind
[59622.572980] [ 860] 0 860 79033 2793 118784 0 0 ModemManager
[59622.575247] [ 861] 0 861 34022 10210 155648 0 0 firewalld
[59622.577428] [ 867] 0 867 66866 5087 147456 0 0 NetworkManager
[59622.579690] [ 881] 0 881 7636 2072 73728 0 -1000 sshd
[59622.581790] [ 887] 0 887 13477 801 77824 0 0 gssproxy
[59622.583916] [ 941] 0 941 5246 715 61440 0 0 atd
[59622.585957] [ 943] 0 943 4474 892 65536 0 0 crond
[59622.588007] [ 961] 0 961 2408 455 49152 0 0 agetty
[59622.590051] [ 962] 0 962 3050 458 57344 0 0 agetty
[59622.592069] [ 1018] 991 1018 6645 533 65536 0 0 dnsmasq
[59622.594114] [ 1060] 0 1060 11399 2828 90112 0 0 sshd
[59622.596075] [ 1459] 0 1459 4387 1924 73728 0 0 systemd-userdbd
[59622.598197] [ 1488] 1000 1488 7862 3570 90112 0 0 systemd
[59622.600204] [ 1490] 1000 1490 12058 1458 102400 0 0 (sd-pam)
[59622.602218] [ 1555] 1000 1555 11399 1367 81920 0 0 sshd
[59622.604170] [ 1561] 1000 1561 5188 1443 61440 0 0 bash
[59622.606168] [ 2078] 991 2078 6645 532 61440 0 0 dnsmasq
[59622.608154] [ 2080] 0 2080 6638 100 61440 0 0 dnsmasq
[59622.610161] [ 6747] 1000 6747 10102 2101 94208 0 0 su
[59622.612074] [ 6751] 0 6751 4504 1424 65536 0 0 bash
[59622.614022] [ 14077] 0 14077 4470 1927 73728 0 0 systemd-userwor
[59622.616135] [ 14078] 0 14078 4470 1908 77824 0 0 systemd-userwor
[59622.618252] [ 14080] 0 14080 4470 1929 69632 0 0 systemd-userwor
[59622.620358] [ 14085] 0 14085 2069 642 57344 0 0 lxc-start
[59622.622379] [ 14086] 0 14086 25345 2120 233472 520 0 systemd
[59622.624401] [ 14503] 0 14503 25205 1712 241664 235 0 systemd-journal
[59622.626610] [ 14512] 0 14512 22648 1593 221184 233 0 systemd-logind
[59622.628684] [ 14513] 81 14513 13516 801 155648 138 -900 dbus-daemon
[59622.630714] [ 14694] 0 14694 2072 1104 49152 0 0 3
[59622.632591] [ 14696] 0 14696 56226 690 90112 145 0 bash
[59622.634510] [ 14734] 0 14734 52899 856 180224 440 0 rsyslogd
[59622.636496] [ 14735] 0 14735 1637 343 61440 31 0 agetty
[59622.638468] [ 14736] 0 14736 1637 398 61440 32 0 agetty
[59622.640417] [ 14737] 0 14737 1637 383 61440 31 0 agetty
[59622.642356] [ 14741] 0 14741 5724 572 94208 239 0 crond
[59622.644286] [ 14743] 0 14743 1637 398 61440 31 0 agetty
[59622.646226] [ 14744] 0 14744 1637 387 57344 31 0 agetty
[59622.648164] [ 14746] 0 14746 262198 0 266240 2900 0 syz-execprog
[59622.650200] [ 14764] 0 14764 12552 0 36864 16 0 syz-executor
[59622.652255] [ 14765] 0 14765 12552 0 36864 16 0 syz-executor
[59622.654283] [ 14766] 0 14766 12551 0 49152 44 0 syz-executor
[59622.656306] [ 14767] 0 14767 12551 0 49152 41 0 syz-executor
[59622.658318] [ 14769] 0 14769 12552 0 36864 16 0 syz-executor
[59622.660328] [ 14772] 0 14772 12552 0 36864 17 0 syz-executor
[59622.662330] [ 14774] 0 14774 12551 0 49152 30 0 syz-executor
[59622.664337] [ 14775] 0 14775 12552 0 36864 17 0 syz-executor
[59622.666456] [ 14776] 0 14776 12551 0 49152 34 0 syz-executor
[59622.668449] [ 14777] 0 14777 12552 0 36864 16 0 syz-executor
[59622.670459] [ 14779] 0 14779 12551 0 49152 37 0 syz-executor
[59622.672445] [ 14781] 0 14781 12552 0 36864 16 0 syz-executor
[59622.674449] [ 14782] 0 14782 12551 0 49152 35 0 syz-executor
[59622.676423] [ 14783] 0 14783 12552 0 36864 17 0 syz-executor
[59622.678418] [ 14784] 0 14784 12551 0 49152 36 0 syz-executor
[59622.680430] [ 14787] 0 14787 12552 0 36864 16 0 syz-executor
[59622.682398] [ 14788] 0 14788 12551 0 49152 24 0 syz-executor
[59622.684359] [ 14800] 0 14800 12552 0 36864 16 0 syz-executor
[59622.686357] [ 14801] 0 14801 12551 0 49152 23 0 syz-executor
[59622.688449] [ 14803] 0 14803 12552 0 36864 16 0 syz-executor
[59622.690434] [ 14804] 0 14804 12551 0 49152 38 0 syz-executor
[59622.692466] [ 14805] 0 14805 12552 0 36864 16 0 syz-executor
[59622.694436] [ 14807] 0 14807 12551 0 49152 25 0 syz-executor
[59622.696378] [ 14812] 0 14812 12552 0 36864 17 0 syz-executor
[59622.698313] [ 14813] 0 14813 12551 0 49152 33 0 syz-executor
[59622.700252] [ 14814] 0 14814 12552 0 36864 16 0 syz-executor
[59622.702186] [ 14816] 0 14816 12551 0 49152 27 0 syz-executor
[59622.704126] [ 14817] 0 14817 12552 0 36864 17 0 syz-executor
[59622.706053] [ 14819] 0 14819 12551 0 49152 32 0 syz-executor
[59622.707978] [ 14820] 0 14820 12552 0 36864 16 0 syz-executor
[59622.709923] [ 14821] 0 14821 12551 0 49152 31 0 syz-executor
[59622.711847] [ 14831] 0 14831 12551 0 49152 42 0 syz-executor
[59622.713773] [ 15261] 0 15260 12618 0 65536 31 1000 syz-executor
[59622.715691] [ 15263] 0 15262 12618 0 69632 0 1000 syz-executor
[59622.717604] [ 15280] 0 15279 12584 0 69632 0 1000 syz-executor
[59622.719506] [ 15282] 0 15281 12584 0 69632 0 1000 syz-executor
[59622.721401] [ 15284] 0 15283 12584 0 65536 0 1000 syz-executor
[59622.723316] [ 15286] 0 15285 12584 0 65536 0 1000 syz-executor
[59622.725243] [ 15288] 0 15287 12617 0 65536 0 1000 syz-executor
[59622.727121] [ 15291] 0 15289 12584 0 69632 4 1000 syz-executor
[59622.728993] [ 15293] 0 15292 12584 0 65536 0 1000 syz-executor
[59622.730857] [ 15295] 0 15294 12584 0 65536 0 1000 syz-executor
[59622.732722] [ 15297] 0 15296 12584 0 65536 0 1000 syz-executor
[59622.734587] [ 15299] 0 15298 12617 0 65536 0 1000 syz-executor
[59622.736441] [ 15301] 0 15300 12584 0 69632 0 1000 syz-executor
[59622.738319] [ 15304] 0 15303 12584 0 65536 0 1000 syz-executor
[59622.740174] [ 15306] 0 15305 12584 0 69632 0 1000 syz-executor
[59622.742022] [ 15307] 0 15307 12584 0 69632 7 1000 syz-executor
[59622.743869] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[59622.745603] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.747339] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.748817] Call Trace:
[59622.749557] dump_stack_lvl+0x57/0x72
[59622.750520] panic+0xff/0x2ea
[59622.751390] out_of_memory.cold+0x2f/0x7e
[59622.752413] pagefault_out_of_memory+0x46/0x60
[59622.753522] exc_page_fault+0x79/0x2b0
[59622.754504] ? asm_exc_page_fault+0x8/0x30
[59622.755549] asm_exc_page_fault+0x1e/0x30
[59622.756579] RIP: 0033:0x41a948
[59622.757441] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.758831] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.760086] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.761687] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.763287] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.764888] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.766512] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
^ permalink raw reply [flat|nested] 66+ messages in thread[parent not found: <9d10df01-0127-fb40-81c3-cc53c9733c3e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <9d10df01-0127-fb40-81c3-cc53c9733c3e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-18 9:04 ` Michal Hocko [not found] ` <YW04jWSv6pQb2Goe-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-18 9:04 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Mon 18-10-21 11:13:52, Vasily Averin wrote: [...] > How could this happen? > > User-space task inside the memcg-limited container generated a page fault, > its handler do_user_addr_fault() called handle_mm_fault which could not > allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. > Then do_user_addr_fault() called pagefault_out_of_memory() which executed > out_of_memory() without set of memcg. > > Partially this problem depends on one of my recent patches, disabled unlimited > memory allocation for dying tasks. However I think the problem can happen > on non-killed tasks too, for example because of kmem limit. Could you be more specific on how this can happen without your patch? I have to say I haven't realized this side effect when discussing it. I will be honest that I am not really happy about pagefault_out_of_memory. I have tried to remove it in the past. Without much success back then, unfortunately[1]. Maybe we should get rid of it finally. The OOM is always triggered from inside the allocator where we have much more infromation about the allocation context. A first step would be to skip pagefault_out_of_memory for killed or exiting processes. [1] I do not have msg-id so I cannot provide a lore link but google pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW04jWSv6pQb2Goe-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW04jWSv6pQb2Goe-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-18 10:05 ` Vasily Averin [not found] ` <6b751abe-aa52-d1d8-2631-ec471975cc3a-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-21 8:03 ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-18 10:05 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 18.10.2021 12:04, Michal Hocko wrote: > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > [...] >> How could this happen? >> >> User-space task inside the memcg-limited container generated a page fault, >> its handler do_user_addr_fault() called handle_mm_fault which could not >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >> out_of_memory() without set of memcg. >> >> Partially this problem depends on one of my recent patches, disabled unlimited >> memory allocation for dying tasks. However I think the problem can happen >> on non-killed tasks too, for example because of kmem limit. > > Could you be more specific on how this can happen without your patch? I > have to say I haven't realized this side effect when discussing it. We can reach obj_cgroup_charge_pages() for example via do_user_addr_fault handle_mm_fault __handle_mm_fault p4d_alloc __p4d_alloc p4d_alloc_one get_zeroed_page __get_free_pages alloc_pages __alloc_pages __memcg_kmem_charge_page obj_cgroup_charge_pages Here we call try_charge_memcg() that return success and approve the allocation, however then we hit into kmem limit and fail the allocation. If required I can try to search how try_charge_memcg() can reject page allocation of non-dying task too. > I will be honest that I am not really happy about pagefault_out_of_memory. > I have tried to remove it in the past. Without much success back then, > unfortunately[1]. > Maybe we should get rid of it finally. The OOM is always triggered from > inside the allocator where we have much more infromation about the > allocation context. A first step would be to skip pagefault_out_of_memory > for killed or exiting processes. I like this idea, however it may be not enough, at least in scenario described above. > [1] I do not have msg-id so I cannot provide a lore link but google > pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html Thank you, I'll read this discussion. Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <6b751abe-aa52-d1d8-2631-ec471975cc3a-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <6b751abe-aa52-d1d8-2631-ec471975cc3a-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-18 10:12 ` Vasily Averin 2021-10-18 11:53 ` Michal Hocko 1 sibling, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-18 10:12 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 18.10.2021 13:05, Vasily Averin wrote: > On 18.10.2021 12:04, Michal Hocko wrote: >> On Mon 18-10-21 11:13:52, Vasily Averin wrote: >>> Partially this problem depends on one of my recent patches, disabled unlimited >>> memory allocation for dying tasks. However I think the problem can happen >>> on non-killed tasks too, for example because of kmem limit. >> >> Could you be more specific on how this can happen without your patch? I >> have to say I haven't realized this side effect when discussing it. > > We can reach obj_cgroup_charge_pages() for example via > > do_user_addr_fault > handle_mm_fault > __handle_mm_fault > p4d_alloc > __p4d_alloc > p4d_alloc_one > get_zeroed_page > __get_free_pages > alloc_pages > __alloc_pages > __memcg_kmem_charge_page > obj_cgroup_charge_pages > > Here we call try_charge_memcg() that return success and approve the allocation, > however then we hit into kmem limit and fail the allocation. btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit, therefore we moved kmem limit check into try_charge_memcg. Are any improvements for kmem controller interesting for upstream? Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <6b751abe-aa52-d1d8-2631-ec471975cc3a-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-18 10:12 ` Vasily Averin @ 2021-10-18 11:53 ` Michal Hocko [not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com> [not found] ` <YW1gRz0rTkJrvc4L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 2 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-18 11:53 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Mon 18-10-21 13:05:35, Vasily Averin wrote: > On 18.10.2021 12:04, Michal Hocko wrote: > > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > > [...] > >> How could this happen? > >> > >> User-space task inside the memcg-limited container generated a page fault, > >> its handler do_user_addr_fault() called handle_mm_fault which could not > >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. > >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed > >> out_of_memory() without set of memcg. > >> > >> Partially this problem depends on one of my recent patches, disabled unlimited > >> memory allocation for dying tasks. However I think the problem can happen > >> on non-killed tasks too, for example because of kmem limit. > > > > Could you be more specific on how this can happen without your patch? I > > have to say I haven't realized this side effect when discussing it. > > We can reach obj_cgroup_charge_pages() for example via > > do_user_addr_fault > handle_mm_fault > __handle_mm_fault > p4d_alloc > __p4d_alloc > p4d_alloc_one > get_zeroed_page > __get_free_pages > alloc_pages > __alloc_pages > __memcg_kmem_charge_page > obj_cgroup_charge_pages > > Here we call try_charge_memcg() that return success and approve the allocation, > however then we hit into kmem limit and fail the allocation. Just to make sure I understand this would be for the v1 kmem explicit limit, correct? > If required I can try to search how try_charge_memcg() can reject page allocation > of non-dying task too. Yes. > > I will be honest that I am not really happy about pagefault_out_of_memory. > > I have tried to remove it in the past. Without much success back then, > > unfortunately[1]. > > Maybe we should get rid of it finally. The OOM is always triggered from > > inside the allocator where we have much more infromation about the > > allocation context. A first step would be to skip pagefault_out_of_memory > > for killed or exiting processes. > > I like this idea, however it may be not enough, at least in scenario described above. I original patch has removed the oom killer completely. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>]
[parent not found: <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-18 12:27 ` Michal Hocko [not found] ` <YW1oMxNkUCaAimmg-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-18 12:27 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A [restore the cc list] On Mon 18-10-21 15:14:26, Vasily Averin wrote: > On 18.10.2021 14:53, Michal Hocko wrote: > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >> On 18.10.2021 12:04, Michal Hocko wrote: > >> Here we call try_charge_memcg() that return success and approve the allocation, > >> however then we hit into kmem limit and fail the allocation. > > > > Just to make sure I understand this would be for the v1 kmem explicit > > limit, correct? > > yes, I mean this limit. OK, thanks for the clarification. This is a known problem. Have a look at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed status since 2019 without any actual report sugested by the kernel message. Maybe we should try and remove it and see whether that prompts some pushback. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW1oMxNkUCaAimmg-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW1oMxNkUCaAimmg-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-18 15:07 ` Shakeel Butt [not found] ` <CALvZod42uwgrg83CCKn6JgYqAQtR1RLJSuybNYjtkFo4wVgT1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Shakeel Butt @ 2021-10-18 15:07 UTC (permalink / raw) To: Michal Hocko Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > [restore the cc list] > > On Mon 18-10-21 15:14:26, Vasily Averin wrote: > > On 18.10.2021 14:53, Michal Hocko wrote: > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > > >> On 18.10.2021 12:04, Michal Hocko wrote: > > >> Here we call try_charge_memcg() that return success and approve the allocation, > > >> however then we hit into kmem limit and fail the allocation. > > > > > > Just to make sure I understand this would be for the v1 kmem explicit > > > limit, correct? > > > > yes, I mean this limit. > > OK, thanks for the clarification. This is a known problem. Have a look > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > status since 2019 without any actual report sugested by the kernel > message. Maybe we should try and remove it and see whether that prompts > some pushback. > Yes, I think now should be the right time to take the next step for deprecation of kmem limits: https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <CALvZod42uwgrg83CCKn6JgYqAQtR1RLJSuybNYjtkFo4wVgT1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <CALvZod42uwgrg83CCKn6JgYqAQtR1RLJSuybNYjtkFo4wVgT1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2021-10-18 16:51 ` Michal Hocko 2021-10-18 17:13 ` Shakeel Butt 2021-10-18 18:52 ` Vasily Averin 1 sibling, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-18 16:51 UTC (permalink / raw) To: Shakeel Butt Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On Mon 18-10-21 08:07:20, Shakeel Butt wrote: > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > [restore the cc list] > > > > On Mon 18-10-21 15:14:26, Vasily Averin wrote: > > > On 18.10.2021 14:53, Michal Hocko wrote: > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > > > >> On 18.10.2021 12:04, Michal Hocko wrote: > > > >> Here we call try_charge_memcg() that return success and approve the allocation, > > > >> however then we hit into kmem limit and fail the allocation. > > > > > > > > Just to make sure I understand this would be for the v1 kmem explicit > > > > limit, correct? > > > > > > yes, I mean this limit. > > > > OK, thanks for the clarification. This is a known problem. Have a look > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > > status since 2019 without any actual report sugested by the kernel > > message. Maybe we should try and remove it and see whether that prompts > > some pushback. > > > > Yes, I think now should be the right time to take the next step for > deprecation of kmem limits: > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ I completely forgot about your patch. Anyway, it usually takes us years to deprecate something so let's stick with it and consider 2 years as years ;) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task 2021-10-18 16:51 ` Michal Hocko @ 2021-10-18 17:13 ` Shakeel Butt 0 siblings, 0 replies; 66+ messages in thread From: Shakeel Butt @ 2021-10-18 17:13 UTC (permalink / raw) To: Michal Hocko Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel On Mon, Oct 18, 2021 at 9:51 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 18-10-21 08:07:20, Shakeel Butt wrote: > > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > [restore the cc list] > > > > > > On Mon 18-10-21 15:14:26, Vasily Averin wrote: > > > > On 18.10.2021 14:53, Michal Hocko wrote: > > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > > > > >> On 18.10.2021 12:04, Michal Hocko wrote: > > > > >> Here we call try_charge_memcg() that return success and approve the allocation, > > > > >> however then we hit into kmem limit and fail the allocation. > > > > > > > > > > Just to make sure I understand this would be for the v1 kmem explicit > > > > > limit, correct? > > > > > > > > yes, I mean this limit. > > > > > > OK, thanks for the clarification. This is a known problem. Have a look > > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > > > status since 2019 without any actual report sugested by the kernel > > > message. Maybe we should try and remove it and see whether that prompts > > > some pushback. > > > > > > > Yes, I think now should be the right time to take the next step for > > deprecation of kmem limits: > > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ > > I completely forgot about your patch. Anyway, it usually takes us years > to deprecate something so let's stick with it and consider 2 years as > years ;) > Sure, I will rebase and resend. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <CALvZod42uwgrg83CCKn6JgYqAQtR1RLJSuybNYjtkFo4wVgT1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2021-10-18 16:51 ` Michal Hocko @ 2021-10-18 18:52 ` Vasily Averin [not found] ` <153f7aa6-39ef-f064-8745-a9489e088239-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-18 18:52 UTC (permalink / raw) To: Shakeel Butt, Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On 18.10.2021 18:07, Shakeel Butt wrote: > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: >> >> [restore the cc list] >> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote: >>> On 18.10.2021 14:53, Michal Hocko wrote: >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: >>>>> On 18.10.2021 12:04, Michal Hocko wrote: >>>>> Here we call try_charge_memcg() that return success and approve the allocation, >>>>> however then we hit into kmem limit and fail the allocation. >>>> >>>> Just to make sure I understand this would be for the v1 kmem explicit >>>> limit, correct? >>> >>> yes, I mean this limit. >> >> OK, thanks for the clarification. This is a known problem. Have a look >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed >> status since 2019 without any actual report sugested by the kernel >> message. Maybe we should try and remove it and see whether that prompts >> some pushback. >> > > Yes, I think now should be the right time to take the next step for > deprecation of kmem limits: > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ Are you going to push it to stable kernels too? Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <153f7aa6-39ef-f064-8745-a9489e088239-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <153f7aa6-39ef-f064-8745-a9489e088239-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-18 19:18 ` Vasily Averin [not found] ` <4a30aa18-e2a2-693c-8237-b75fffac9838-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-19 5:33 ` Shakeel Butt 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-18 19:18 UTC (permalink / raw) To: Shakeel Butt, Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On 18.10.2021 21:52, Vasily Averin wrote: > On 18.10.2021 18:07, Shakeel Butt wrote: >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: >>> >>> [restore the cc list] >>> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: >>>> On 18.10.2021 14:53, Michal Hocko wrote: >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: >>>>>> On 18.10.2021 12:04, Michal Hocko wrote: >>>>>> Here we call try_charge_memcg() that return success and approve the allocation, >>>>>> however then we hit into kmem limit and fail the allocation. >>>>> >>>>> Just to make sure I understand this would be for the v1 kmem explicit >>>>> limit, correct? >>>> >>>> yes, I mean this limit. >>> >>> OK, thanks for the clarification. This is a known problem. Have a look >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed >>> status since 2019 without any actual report sugested by the kernel >>> message. Maybe we should try and remove it and see whether that prompts >>> some pushback. >>> >> >> Yes, I think now should be the right time to take the next step for >> deprecation of kmem limits: >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ > > Are you going to push it to stable kernels too? Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37. ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <4a30aa18-e2a2-693c-8237-b75fffac9838-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <4a30aa18-e2a2-693c-8237-b75fffac9838-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-19 5:34 ` Shakeel Butt 0 siblings, 0 replies; 66+ messages in thread From: Shakeel Butt @ 2021-10-19 5:34 UTC (permalink / raw) To: Vasily Averin Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote: > > On 18.10.2021 21:52, Vasily Averin wrote: > > On 18.10.2021 18:07, Shakeel Butt wrote: > >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > >>> > >>> [restore the cc list] > >>> > >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: > >>>> On 18.10.2021 14:53, Michal Hocko wrote: > >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >>>>>> On 18.10.2021 12:04, Michal Hocko wrote: > >>>>>> Here we call try_charge_memcg() that return success and approve the allocation, > >>>>>> however then we hit into kmem limit and fail the allocation. > >>>>> > >>>>> Just to make sure I understand this would be for the v1 kmem explicit > >>>>> limit, correct? > >>>> > >>>> yes, I mean this limit. > >>> > >>> OK, thanks for the clarification. This is a known problem. Have a look > >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > >>> status since 2019 without any actual report sugested by the kernel > >>> message. Maybe we should try and remove it and see whether that prompts > >>> some pushback. > >>> > >> > >> Yes, I think now should be the right time to take the next step for > >> deprecation of kmem limits: > >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ > > > > Are you going to push it to stable kernels too? > > Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37. > CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not disabling the kmem accounting. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <153f7aa6-39ef-f064-8745-a9489e088239-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-18 19:18 ` Vasily Averin @ 2021-10-19 5:33 ` Shakeel Butt [not found] ` <CALvZod5Kut63MLVfCkEW5XemqN4Jnd1iEQD_Gk0w5=fPffL8Bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 66+ messages in thread From: Shakeel Butt @ 2021-10-19 5:33 UTC (permalink / raw) To: Vasily Averin Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote: > > On 18.10.2021 18:07, Shakeel Butt wrote: > > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > >> > >> [restore the cc list] > >> > >> On Mon 18-10-21 15:14:26, Vasily Averin wrote: > >>> On 18.10.2021 14:53, Michal Hocko wrote: > >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >>>>> On 18.10.2021 12:04, Michal Hocko wrote: > >>>>> Here we call try_charge_memcg() that return success and approve the allocation, > >>>>> however then we hit into kmem limit and fail the allocation. > >>>> > >>>> Just to make sure I understand this would be for the v1 kmem explicit > >>>> limit, correct? > >>> > >>> yes, I mean this limit. > >> > >> OK, thanks for the clarification. This is a known problem. Have a look > >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > >> status since 2019 without any actual report sugested by the kernel > >> message. Maybe we should try and remove it and see whether that prompts > >> some pushback. > >> > > > > Yes, I think now should be the right time to take the next step for > > deprecation of kmem limits: > > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ > > Are you going to push it to stable kernels too? > Not really. Is there a reason I should? More exposure to catch breakage? ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <CALvZod5Kut63MLVfCkEW5XemqN4Jnd1iEQD_Gk0w5=fPffL8Bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <CALvZod5Kut63MLVfCkEW5XemqN4Jnd1iEQD_Gk0w5=fPffL8Bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2021-10-19 6:42 ` Vasily Averin [not found] ` <25120323-d222-cc5e-fe08-6471bce13bd6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-19 6:42 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On 19.10.2021 08:33, Shakeel Butt wrote: > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote: >> >> On 18.10.2021 18:07, Shakeel Butt wrote: >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: >>>> >>>> [restore the cc list] >>>> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: >>>>> On 18.10.2021 14:53, Michal Hocko wrote: >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote: >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation, >>>>>>> however then we hit into kmem limit and fail the allocation. >>>>>> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit >>>>>> limit, correct? >>>>> >>>>> yes, I mean this limit. >>>> >>>> OK, thanks for the clarification. This is a known problem. Have a look >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed >>>> status since 2019 without any actual report sugested by the kernel >>>> message. Maybe we should try and remove it and see whether that prompts >>>> some pushback. >>> >>> Yes, I think now should be the right time to take the next step for >>> deprecation of kmem limits: >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ >> >> Are you going to push it to stable kernels too? > > Not really. Is there a reason I should? More exposure to catch breakage? There is a problem: kmem limit can trigger fake global OOM. To fix it in upstream you can remove kmem controller. However how to handle this problem in stable and LTS kernels? My current patch resolves the problem too, and it can be backported. However I afraid nobody will do it if teh patch will not be approved in upsteam. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <25120323-d222-cc5e-fe08-6471bce13bd6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <25120323-d222-cc5e-fe08-6471bce13bd6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-19 8:47 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-19 8:47 UTC (permalink / raw) To: Vasily Averin Cc: Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A On Tue 19-10-21 09:42:38, Vasily Averin wrote: > On 19.10.2021 08:33, Shakeel Butt wrote: > > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote: > >> > >> On 18.10.2021 18:07, Shakeel Butt wrote: > >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > >>>> > >>>> [restore the cc list] > >>>> > >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: > >>>>> On 18.10.2021 14:53, Michal Hocko wrote: > >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote: > >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation, > >>>>>>> however then we hit into kmem limit and fail the allocation. > >>>>>> > >>>>>> Just to make sure I understand this would be for the v1 kmem explicit > >>>>>> limit, correct? > >>>>> > >>>>> yes, I mean this limit. > >>>> > >>>> OK, thanks for the clarification. This is a known problem. Have a look > >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > >>>> status since 2019 without any actual report sugested by the kernel > >>>> message. Maybe we should try and remove it and see whether that prompts > >>>> some pushback. > >>> > >>> Yes, I think now should be the right time to take the next step for > >>> deprecation of kmem limits: > >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/ > >> > >> Are you going to push it to stable kernels too? > > > > Not really. Is there a reason I should? More exposure to catch breakage? > > There is a problem: kmem limit can trigger fake global OOM. > To fix it in upstream you can remove kmem controller. > > However how to handle this problem in stable and LTS kernels? I do not see any bug reports coming in and I strongly suspect this is because the functionality is simply not used wildly enough or in the mode where it would matter (U != 0, K < U from our documentation). If there are relevant usecases for this setup then we really want to hear about those because we do not want to break userspace. Handling that setup would far from trivial and the global oom killer is not the only one of those. > My current patch resolves the problem too, and it can be backported. > However I afraid nobody will do it if teh patch will not be approved in upsteam. I do not think your current approach is the right one. We do not really want yet another flag to tell we are in a memcg OOM. We already have one. A proper way is to deal with the pagefault oom killer trigger but that might be just too risky for stable kernels. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW1gRz0rTkJrvc4L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW1gRz0rTkJrvc4L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-19 6:30 ` Vasily Averin [not found] ` <339ae4b5-6efd-8fc2-33f1-2eb3aee71cb2-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-19 6:30 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 18.10.2021 14:53, Michal Hocko wrote: > On Mon 18-10-21 13:05:35, Vasily Averin wrote: >> On 18.10.2021 12:04, Michal Hocko wrote: >>> On Mon 18-10-21 11:13:52, Vasily Averin wrote: >>> [...] >>>> How could this happen? >>>> >>>> User-space task inside the memcg-limited container generated a page fault, >>>> its handler do_user_addr_fault() called handle_mm_fault which could not >>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >>>> out_of_memory() without set of memcg. >>>> >>>> Partially this problem depends on one of my recent patches, disabled unlimited >>>> memory allocation for dying tasks. However I think the problem can happen >>>> on non-killed tasks too, for example because of kmem limit. >>> >>> Could you be more specific on how this can happen without your patch? I >>> have to say I haven't realized this side effect when discussing it. >> If required I can try to search how try_charge_memcg() can reject page allocation >> of non-dying task too. > > Yes. Now I think that such failure was very unlikely (w/o my patch and kmem limit). I cannot exclude it completely, because I did not finished this review and perhaps I missed something, but I checked most part of code and found nothing. With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: a) due to fatal signal b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. However all these cases can be successfully handled by my new patch "memcg: prevent false global OOM triggered by memcg limited task" and I think it is better solution. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <339ae4b5-6efd-8fc2-33f1-2eb3aee71cb2-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <339ae4b5-6efd-8fc2-33f1-2eb3aee71cb2-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-19 8:49 ` Michal Hocko [not found] ` <YW6GoZhFUJc1uLYr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-19 8:49 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Tue 19-10-21 09:30:18, Vasily Averin wrote: [...] > With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > a) due to fatal signal > b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > > To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. How is b) possible without current being killed? Do we allow remote charging? > However all these cases can be successfully handled by my new patch > "memcg: prevent false global OOM triggered by memcg limited task" > and I think it is better solution. I have already replied to your approach in other email. Sorry our replies have crossed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW6GoZhFUJc1uLYr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW6GoZhFUJc1uLYr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-19 10:30 ` Vasily Averin [not found] ` <687bf489-f7a7-5604-25c5-0c1a09e0905b-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-19 10:30 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 19.10.2021 11:49, Michal Hocko wrote: > On Tue 19-10-21 09:30:18, Vasily Averin wrote: > [...] >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: >> a) due to fatal signal >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) >> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > How is b) possible without current being killed? Do we allow remote > charging? out_of_memory for memcg_oom select_bad_process mem_cgroup_scan_tasks oom_evaluate_task oom_badness /* * Do not even consider tasks which are explicitly marked oom * unkillable or have been already oom reaped or the are in * the middle of vfork */ adj = (long)p->signal->oom_score_adj; if (adj == OOM_SCORE_ADJ_MIN || test_bit(MMF_OOM_SKIP, &p->mm->flags) || in_vfork(p)) { task_unlock(p); return LONG_MIN; } This time we handle userspace page fault, so we cannot be kenrel thread, and cannot be in_vfork(). However task can be marked as oom unkillable, i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN It can be set via oom_score_adj_write(). Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <687bf489-f7a7-5604-25c5-0c1a09e0905b-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <687bf489-f7a7-5604-25c5-0c1a09e0905b-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-19 11:54 ` Michal Hocko [not found] ` <YW6yAeAO+TeS3OdB-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-19 11:54 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Tue 19-10-21 13:30:06, Vasily Averin wrote: > On 19.10.2021 11:49, Michal Hocko wrote: > > On Tue 19-10-21 09:30:18, Vasily Averin wrote: > > [...] > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > >> a) due to fatal signal > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > >> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > > > How is b) possible without current being killed? Do we allow remote > > charging? > > out_of_memory for memcg_oom > select_bad_process > mem_cgroup_scan_tasks > oom_evaluate_task > oom_badness > > /* > * Do not even consider tasks which are explicitly marked oom > * unkillable or have been already oom reaped or the are in > * the middle of vfork > */ > adj = (long)p->signal->oom_score_adj; > if (adj == OOM_SCORE_ADJ_MIN || > test_bit(MMF_OOM_SKIP, &p->mm->flags) || > in_vfork(p)) { > task_unlock(p); > return LONG_MIN; > } > > This time we handle userspace page fault, so we cannot be kenrel thread, > and cannot be in_vfork(). > However task can be marked as oom unkillable, > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN You are right. I am not sure there is a way out of this though. The task can only retry for ever in this case. There is nothing actionable here. We cannot kill the task and there is no other way to release the memory. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW6yAeAO+TeS3OdB-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW6yAeAO+TeS3OdB-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-19 12:04 ` Michal Hocko [not found] ` <YW60Rs1mi24sJmp4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-19 12:04 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Tue 19-10-21 13:54:42, Michal Hocko wrote: > On Tue 19-10-21 13:30:06, Vasily Averin wrote: > > On 19.10.2021 11:49, Michal Hocko wrote: > > > On Tue 19-10-21 09:30:18, Vasily Averin wrote: > > > [...] > > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > > >> a) due to fatal signal > > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > > >> > > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > > > > > How is b) possible without current being killed? Do we allow remote > > > charging? > > > > out_of_memory for memcg_oom > > select_bad_process > > mem_cgroup_scan_tasks > > oom_evaluate_task > > oom_badness > > > > /* > > * Do not even consider tasks which are explicitly marked oom > > * unkillable or have been already oom reaped or the are in > > * the middle of vfork > > */ > > adj = (long)p->signal->oom_score_adj; > > if (adj == OOM_SCORE_ADJ_MIN || > > test_bit(MMF_OOM_SKIP, &p->mm->flags) || > > in_vfork(p)) { > > task_unlock(p); > > return LONG_MIN; > > } > > > > This time we handle userspace page fault, so we cannot be kenrel thread, > > and cannot be in_vfork(). > > However task can be marked as oom unkillable, > > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN > > You are right. I am not sure there is a way out of this though. The task > can only retry for ever in this case. There is nothing actionable here. > We cannot kill the task and there is no other way to release the memory. Btw. don't we force the charge in that case? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW60Rs1mi24sJmp4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW60Rs1mi24sJmp4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-19 13:26 ` Vasily Averin [not found] ` <6c422150-593f-f601-8f91-914c6c5e82f4-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-19 13:26 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 19.10.2021 15:04, Michal Hocko wrote: > On Tue 19-10-21 13:54:42, Michal Hocko wrote: >> On Tue 19-10-21 13:30:06, Vasily Averin wrote: >>> On 19.10.2021 11:49, Michal Hocko wrote: >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote: >>>> [...] >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: >>>>> a) due to fatal signal >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) >>>>> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. >>> >>>> How is b) possible without current being killed? Do we allow remote >>>> charging? >>> >>> out_of_memory for memcg_oom >>> select_bad_process >>> mem_cgroup_scan_tasks >>> oom_evaluate_task >>> oom_badness >>> >>> /* >>> * Do not even consider tasks which are explicitly marked oom >>> * unkillable or have been already oom reaped or the are in >>> * the middle of vfork >>> */ >>> adj = (long)p->signal->oom_score_adj; >>> if (adj == OOM_SCORE_ADJ_MIN || >>> test_bit(MMF_OOM_SKIP, &p->mm->flags) || >>> in_vfork(p)) { >>> task_unlock(p); >>> return LONG_MIN; >>> } >>> >>> This time we handle userspace page fault, so we cannot be kenrel thread, >>> and cannot be in_vfork(). >>> However task can be marked as oom unkillable, >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN >> >> You are right. I am not sure there is a way out of this though. The task >> can only retry for ever in this case. There is nothing actionable here. >> We cannot kill the task and there is no other way to release the memory. > > Btw. don't we force the charge in that case? We should force charge for allocation from inside page fault handler, to prevent endless cycle in retried page faults. However we should not do it for allocations from task context, to prevent memcg-limited vmalloc-eaters from to consume all host memory. Also I would like to return to the following hunk. @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); Now I think it's better to keep task_is_dying() check here. if task is dying, it is not necessary to push other task to free the memory. We broke vmalloc cycle already, so it looks like nothing should prevent us from returning to userspace, handle fatal signal, exit and free the memory. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <6c422150-593f-f601-8f91-914c6c5e82f4-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <6c422150-593f-f601-8f91-914c6c5e82f4-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-19 14:13 ` Michal Hocko [not found] ` <YW7SfkZR/ZsabkXV-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-19 14:13 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Tue 19-10-21 16:26:50, Vasily Averin wrote: > On 19.10.2021 15:04, Michal Hocko wrote: > > On Tue 19-10-21 13:54:42, Michal Hocko wrote: > >> On Tue 19-10-21 13:30:06, Vasily Averin wrote: > >>> On 19.10.2021 11:49, Michal Hocko wrote: > >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote: > >>>> [...] > >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > >>>>> a) due to fatal signal > >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > >>>>> > >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > >>> > >>>> How is b) possible without current being killed? Do we allow remote > >>>> charging? > >>> > >>> out_of_memory for memcg_oom > >>> select_bad_process > >>> mem_cgroup_scan_tasks > >>> oom_evaluate_task > >>> oom_badness > >>> > >>> /* > >>> * Do not even consider tasks which are explicitly marked oom > >>> * unkillable or have been already oom reaped or the are in > >>> * the middle of vfork > >>> */ > >>> adj = (long)p->signal->oom_score_adj; > >>> if (adj == OOM_SCORE_ADJ_MIN || > >>> test_bit(MMF_OOM_SKIP, &p->mm->flags) || > >>> in_vfork(p)) { > >>> task_unlock(p); > >>> return LONG_MIN; > >>> } > >>> > >>> This time we handle userspace page fault, so we cannot be kenrel thread, > >>> and cannot be in_vfork(). > >>> However task can be marked as oom unkillable, > >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN > >> > >> You are right. I am not sure there is a way out of this though. The task > >> can only retry for ever in this case. There is nothing actionable here. > >> We cannot kill the task and there is no other way to release the memory. > > > > Btw. don't we force the charge in that case? > > We should force charge for allocation from inside page fault handler, > to prevent endless cycle in retried page faults. > However we should not do it for allocations from task context, > to prevent memcg-limited vmalloc-eaters from to consume all host memory. I don't see a big difference between those two. Because the #PF could result into the very same situation depleting all the memory by overcharging. A different behavior just leads to a confusion and unexpected behavior. E.g. in the past we only triggered memcg OOM killer from the #PF path and failed the charge otherwise. That is something different but it shows problems we haven't anticipated and had user visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path"). > Also I would like to return to the following hunk. > @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > - ret = should_force_charge() || out_of_memory(&oc); > + ret = task_is_dying() || out_of_memory(&oc); > > unlock: > mutex_unlock(&oom_lock); > > Now I think it's better to keep task_is_dying() check here. > if task is dying, it is not necessary to push other task to free the memory. > We broke vmalloc cycle already, so it looks like nothing should prevent us > from returning to userspace, handle fatal signal, exit and free the memory. That patch has to be discuss in its full length. There were other details I have brought up AFAIU. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW7SfkZR/ZsabkXV-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW7SfkZR/ZsabkXV-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-19 14:19 ` Michal Hocko 2021-10-19 19:09 ` Vasily Averin 1 sibling, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-19 14:19 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Tue 19-10-21 16:13:19, Michal Hocko wrote: [...] > That patch has to be discuss in its full length. There were other > details I have brought up AFAIU. And btw. thanks for walking through that maze! It is far from trivial with side effects all over the place which far from obvious and heavily underdocumented. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW7SfkZR/ZsabkXV-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2021-10-19 14:19 ` Michal Hocko @ 2021-10-19 19:09 ` Vasily Averin [not found] ` <3c76e2d7-e545-ef34-b2c3-a5f63b1eff51-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-19 19:09 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 19.10.2021 17:13, Michal Hocko wrote: > On Tue 19-10-21 16:26:50, Vasily Averin wrote: >> On 19.10.2021 15:04, Michal Hocko wrote: >>> On Tue 19-10-21 13:54:42, Michal Hocko wrote: >>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote: >>>>> On 19.10.2021 11:49, Michal Hocko wrote: >>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote: >>>>>> [...] >>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: >>>>>>> a) due to fatal signal >>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) >>>>>>> >>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() >>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. >>>>> >>>>>> How is b) possible without current being killed? Do we allow remote >>>>>> charging? >>>>> >>>>> out_of_memory for memcg_oom >>>>> select_bad_process >>>>> mem_cgroup_scan_tasks >>>>> oom_evaluate_task >>>>> oom_badness >>>>> >>>>> /* >>>>> * Do not even consider tasks which are explicitly marked oom >>>>> * unkillable or have been already oom reaped or the are in >>>>> * the middle of vfork >>>>> */ >>>>> adj = (long)p->signal->oom_score_adj; >>>>> if (adj == OOM_SCORE_ADJ_MIN || >>>>> test_bit(MMF_OOM_SKIP, &p->mm->flags) || >>>>> in_vfork(p)) { >>>>> task_unlock(p); >>>>> return LONG_MIN; >>>>> } >>>>> >>>>> This time we handle userspace page fault, so we cannot be kenrel thread, >>>>> and cannot be in_vfork(). >>>>> However task can be marked as oom unkillable, >>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN >>>> >>>> You are right. I am not sure there is a way out of this though. The task >>>> can only retry for ever in this case. There is nothing actionable here. >>>> We cannot kill the task and there is no other way to release the memory. >>> >>> Btw. don't we force the charge in that case? >> >> We should force charge for allocation from inside page fault handler, >> to prevent endless cycle in retried page faults. >> However we should not do it for allocations from task context, >> to prevent memcg-limited vmalloc-eaters from to consume all host memory. > > I don't see a big difference between those two. Because the #PF could > result into the very same situation depleting all the memory by > overcharging. A different behavior just leads to a confusion and > unexpected behavior. E.g. in the past we only triggered memcg OOM killer > from the #PF path and failed the charge otherwise. That is something > different but it shows problems we haven't anticipated and had user > visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back > to the charge path"). In this case I think we should fail this allocation. It's better do not allow overcharge, neither in #PF not in regular allocations. However this failure will trigger false global OOM in pagefault_out_of_memory(), and we need to find some way to prevent it. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <3c76e2d7-e545-ef34-b2c3-a5f63b1eff51-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <3c76e2d7-e545-ef34-b2c3-a5f63b1eff51-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-20 8:07 ` Vasily Averin [not found] ` <f40cd82c-f03a-4d36-e953-f89399cb8f58-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-20 8:07 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. If try_charge_memcg() is called from #PF, its new failres can force pagefault_out_of_memory() to execute the global OOM. To prevent it pagefault_out_of_memory() was updated to properly handle memcg-related restrictions. Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- v4: updated pagefault_out_of_memory() to properly handle new memcg-related restrictions and not allow false global OOM v3: no functional changes, just improved patch description v2: swicthed to patch version proposed by mhocko@ mm/memcontrol.c | 52 ++++++++++++++++++++++++++++--------------------- mm/oom_kill.c | 3 +++ 2 files changed, 33 insertions(+), 22 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..b09d3c64f63f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -239,7 +239,7 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) -static inline bool should_force_charge(void) +static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || (current->flags & PF_EXITING); @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int mem_cgroup_oom_notify(memcg); mem_cgroup_unmark_under_oom(memcg); - if (mem_cgroup_out_of_memory(memcg, mask, order)) + if (mem_cgroup_out_of_memory(memcg, mask, order)) { ret = OOM_SUCCESS; - else + } else { ret = OOM_FAILED; - + /* + * In some rare cases mem_cgroup_out_of_memory() can return false. + * If it was called from #PF it forces handle_mm_fault() + * return VM_FAULT_OOM and executes pagefault_out_of_memory(). + * memcg_in_oom is set here to notify pagefault_out_of_memory() + * that it was a memcg-related failure and not allow to run + * global OOM. + */ + if (current->in_user_fault) + current->memcg_in_oom = (struct mem_cgroup *)ret; + } if (locked) mem_cgroup_oom_unlock(memcg); @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle) if (!memcg) return false; + /* OOM is memcg, however out_of_memory() found no victim */ + if (memcg == (struct mem_cgroup *)OOM_FAILED) { + /* + * Should be called from pagefault_out_of_memory() only, + * where it is used to prevent false global OOM. + */ + current->memcg_in_oom = NULL; + return true; + } if (!handle) goto cleanup; @@ -2530,6 +2549,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2564,15 +2584,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2630,8 +2641,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && task_is_dying()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2640,14 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL)) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 831340e7ad8b..1deef8c7a71b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) if (mem_cgroup_oom_synchronize(true)) return; + if (fatal_signal_pending(current)) + return; + if (!mutex_trylock(&oom_lock)) return; out_of_memory(&oc); -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <f40cd82c-f03a-4d36-e953-f89399cb8f58-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <f40cd82c-f03a-4d36-e953-f89399cb8f58-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-20 8:43 ` Michal Hocko [not found] ` <YW/WoJDFM3ddHn7Y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> [not found] ` <cover.1634730787.git.vvs@virtuozzo.com> 0 siblings, 2 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-20 8:43 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Wed 20-10-21 11:07:02, Vasily Averin wrote: [...] I haven't read through the changelog and only focused on the patch this time. [...] > @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int > mem_cgroup_oom_notify(memcg); > > mem_cgroup_unmark_under_oom(memcg); > - if (mem_cgroup_out_of_memory(memcg, mask, order)) > + if (mem_cgroup_out_of_memory(memcg, mask, order)) { > ret = OOM_SUCCESS; > - else > + } else { > ret = OOM_FAILED; > - > + /* > + * In some rare cases mem_cgroup_out_of_memory() can return false. > + * If it was called from #PF it forces handle_mm_fault() > + * return VM_FAULT_OOM and executes pagefault_out_of_memory(). > + * memcg_in_oom is set here to notify pagefault_out_of_memory() > + * that it was a memcg-related failure and not allow to run > + * global OOM. > + */ > + if (current->in_user_fault) > + current->memcg_in_oom = (struct mem_cgroup *)ret; > + } > if (locked) > mem_cgroup_oom_unlock(memcg); > > @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle) > if (!memcg) > return false; > > + /* OOM is memcg, however out_of_memory() found no victim */ > + if (memcg == (struct mem_cgroup *)OOM_FAILED) { > + /* > + * Should be called from pagefault_out_of_memory() only, > + * where it is used to prevent false global OOM. > + */ > + current->memcg_in_oom = NULL; > + return true; > + } > if (!handle) > goto cleanup; I have to say I am not a great fan of this but this belongs to a separate patch on top of all the previous ones. [...] > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 831340e7ad8b..1deef8c7a71b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) > if (mem_cgroup_oom_synchronize(true)) > return; > > + if (fatal_signal_pending(current)) > + return; > + This belongs to its own patch as well. All that being said I would go with pagefault_out_of_memory as the first patch because it is necessary to handle charge failures. Then go with a patch to remove charge forcing when OOM killer succeeds but the retry still fails and finally go with one that tries to handle oom failures. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YW/WoJDFM3ddHn7Y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* [PATCH memcg RFC 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <YW/WoJDFM3ddHn7Y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-20 12:11 ` Vasily Averin 0 siblings, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-20 12:11 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Dear Michal, as you requested, I splited v4 patch version into 3 separate parts. Let's discuss each of them. Open questions/ToDo: - Update patch descriptions (and add some comments) - in part 2 aka "memcg: remove charge forcinig for dying tasks": should we keep task_is_dying() in mem_cgroup_out_of_memory() ? - in part 3 aka "memcg: handle memcg oom failures" what is the best way to notify pagefault_out_of_memory() about mem_cgroup_out_of_memory failure ? - what is the best way to handle memcg failure doe to kmem limit, it can trigger false global OOM Vasily Averin (3): mm: do not firce global OOM from inside dying tasks memcg: remove charge forcinig for dying tasks memcg: handle memcg oom failures mm/memcontrol.c | 52 ++++++++++++++++++++++++++++--------------------- mm/oom_kill.c | 3 +++ 2 files changed, 33 insertions(+), 22 deletions(-) -- 2.32.0 ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <cover.1634730787.git.vvs@virtuozzo.com>]
[parent not found: <cover.1634730787.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks [not found] ` <cover.1634730787.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-20 12:12 ` Vasily Averin [not found] ` <2c13c739-7282-e6f4-da0a-c0b69e68581e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin 2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin 2 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-20 12:12 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A There is no sense to force global OOM if current task is dying. Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- mm/oom_kill.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 831340e7ad8b..1deef8c7a71b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) if (mem_cgroup_oom_synchronize(true)) return; + if (fatal_signal_pending(current)) + return; + if (!mutex_trylock(&oom_lock)) return; out_of_memory(&oc); -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <2c13c739-7282-e6f4-da0a-c0b69e68581e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks [not found] ` <2c13c739-7282-e6f4-da0a-c0b69e68581e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-20 12:33 ` Michal Hocko [not found] ` <YXAMpxjuV/h2awqG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-20 12:33 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A s@firce@force@ On Wed 20-10-21 15:12:19, Vasily Averin wrote: > There is no sense to force global OOM if current task is dying. This really begs for much more information. Feel free to get an inspiration from my previous attempt to achieve something similar. In minimum it is important to mention that the OOM killer is already handled at the page allocator level for the global OOM and at the charging level for the memcg one. Both have much more information about the scope of allocation/charge request. This means that either the OOM killer has been invoked properly and didn't lead to the allocation success or it has been skipped because it couldn't have been invoked. In both cases triggering it from here is pointless and even harmful. Another argument is that it is more reasonable to let killed task die rather than hit the oom killer and retry the allocation. > Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> > --- > mm/oom_kill.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 831340e7ad8b..1deef8c7a71b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) > if (mem_cgroup_oom_synchronize(true)) > return; > > + if (fatal_signal_pending(current)) > + return; > + > if (!mutex_trylock(&oom_lock)) > return; > out_of_memory(&oc); > -- > 2.32.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXAMpxjuV/h2awqG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks [not found] ` <YXAMpxjuV/h2awqG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-20 13:52 ` Vasily Averin 0 siblings, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-20 13:52 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 20.10.2021 15:33, Michal Hocko wrote: > s@firce@force@ > > On Wed 20-10-21 15:12:19, Vasily Averin wrote: >> There is no sense to force global OOM if current task is dying. > > This really begs for much more information. Feel free to get an > inspiration from my previous attempt to achieve something similar. > In minimum it is important to mention that the OOM killer is already > handled at the page allocator level for the global OOM and at the > charging level for the memcg one. Both have much more information > about the scope of allocation/charge request. This means that either the > OOM killer has been invoked properly and didn't lead to the allocation > success or it has been skipped because it couldn't have been invoked. > In both cases triggering it from here is pointless and even harmful. > > Another argument is that it is more reasonable to let killed task die > rather than hit the oom killer and retry the allocation. Thank you, I'll update patch description later, this time I would like to clarify patch content. >> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> >> --- >> mm/oom_kill.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 831340e7ad8b..1deef8c7a71b 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) >> if (mem_cgroup_oom_synchronize(true)) >> return; >> >> + if (fatal_signal_pending(current)) >> + return; >> + >> if (!mutex_trylock(&oom_lock)) >> return; >> out_of_memory(&oc); >> -- >> 2.32.0 > ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks [not found] ` <cover.1634730787.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin @ 2021-10-20 12:13 ` Vasily Averin [not found] ` <56180e53-b705-b1be-9b60-75e141c8560c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin 2 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-20 12:13 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ? Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- mm/memcontrol.c | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..74a7379dbac1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -239,7 +239,7 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) -static inline bool should_force_charge(void) +static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || (current->flags & PF_EXITING); @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && task_is_dying()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2642,6 +2635,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, get_order(nr_pages * PAGE_SIZE)); switch (oom_status) { case OOM_SUCCESS: + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; case OOM_FAILED: -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <56180e53-b705-b1be-9b60-75e141c8560c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks [not found] ` <56180e53-b705-b1be-9b60-75e141c8560c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-20 12:41 ` Michal Hocko [not found] ` <YXAOjQO5r1g/WKmn-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-20 12:41 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Wed 20-10-21 15:13:46, Vasily Averin wrote: > ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ? > > Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> > --- > mm/memcontrol.c | 20 +++++++------------- > 1 file changed, 7 insertions(+), 13 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 6da5020a8656..74a7379dbac1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -239,7 +239,7 @@ enum res_type { > iter != NULL; \ > iter = mem_cgroup_iter(NULL, iter, NULL)) > > -static inline bool should_force_charge(void) > +static inline bool task_is_dying(void) > { > return tsk_is_oom_victim(current) || fatal_signal_pending(current) || > (current->flags & PF_EXITING); > @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > - ret = should_force_charge() || out_of_memory(&oc); > + ret = task_is_dying() || out_of_memory(&oc); Why are you keeping the task_is_dying check here? IIRC I have already pointed out that out_of_memory already has some means to do a bypass when needed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXAOjQO5r1g/WKmn-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks [not found] ` <YXAOjQO5r1g/WKmn-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-20 14:21 ` Vasily Averin [not found] ` <cbda9b6b-3ee5-06ab-9a3b-debf361b55bb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-20 14:21 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 20.10.2021 15:41, Michal Hocko wrote: > On Wed 20-10-21 15:13:46, Vasily Averin wrote: >> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ? >> >> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> >> --- >> mm/memcontrol.c | 20 +++++++------------- >> 1 file changed, 7 insertions(+), 13 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 6da5020a8656..74a7379dbac1 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -239,7 +239,7 @@ enum res_type { >> iter != NULL; \ >> iter = mem_cgroup_iter(NULL, iter, NULL)) >> >> -static inline bool should_force_charge(void) >> +static inline bool task_is_dying(void) >> { >> return tsk_is_oom_victim(current) || fatal_signal_pending(current) || >> (current->flags & PF_EXITING); >> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, >> * A few threads which were not waiting at mutex_lock_killable() can >> * fail to bail out. Therefore, check again after holding oom_lock. >> */ >> - ret = should_force_charge() || out_of_memory(&oc); >> + ret = task_is_dying() || out_of_memory(&oc); > > Why are you keeping the task_is_dying check here? IIRC I have already > pointed out that out_of_memory already has some means to do a bypass > when needed. It was a misunderstanding. I've been waiting for your final decision. I have no good arguments "pro" or strong objection "contra". However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily. I can't justify why its removal is necessary, but on the other hand, it shouldn't break anything. I can drop it if you think it's necessary. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <cbda9b6b-3ee5-06ab-9a3b-debf361b55bb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks [not found] ` <cbda9b6b-3ee5-06ab-9a3b-debf361b55bb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-20 14:57 ` Michal Hocko [not found] ` <YXAubuMMgNDeguNx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-20 14:57 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Wed 20-10-21 17:21:33, Vasily Averin wrote: > On 20.10.2021 15:41, Michal Hocko wrote: > > On Wed 20-10-21 15:13:46, Vasily Averin wrote: > >> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ? > >> > >> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> > >> --- > >> mm/memcontrol.c | 20 +++++++------------- > >> 1 file changed, 7 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> index 6da5020a8656..74a7379dbac1 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> @@ -239,7 +239,7 @@ enum res_type { > >> iter != NULL; \ > >> iter = mem_cgroup_iter(NULL, iter, NULL)) > >> > >> -static inline bool should_force_charge(void) > >> +static inline bool task_is_dying(void) > >> { > >> return tsk_is_oom_victim(current) || fatal_signal_pending(current) || > >> (current->flags & PF_EXITING); > >> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > >> * A few threads which were not waiting at mutex_lock_killable() can > >> * fail to bail out. Therefore, check again after holding oom_lock. > >> */ > >> - ret = should_force_charge() || out_of_memory(&oc); > >> + ret = task_is_dying() || out_of_memory(&oc); > > > > Why are you keeping the task_is_dying check here? IIRC I have already > > pointed out that out_of_memory already has some means to do a bypass > > when needed. > > It was a misunderstanding. Sorry if I made you confused. > I've been waiting for your final decision. > > I have no good arguments "pro" or strong objection "contra". > However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily. One argument for removing it from here is the maintainability. Now you have a memcg specific check which is not in sync with the oom. E.g. out_of_memory does task_will_free_mem as the very first thing. You are also automatically excluding oom killer for cases where that might make a sense. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXAubuMMgNDeguNx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks [not found] ` <YXAubuMMgNDeguNx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-20 15:20 ` Tetsuo Handa [not found] ` <dee26724-3ead-24d4-0c1b-23905bfcdae9-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Tetsuo Handa @ 2021-10-20 15:20 UTC (permalink / raw) To: Michal Hocko, Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 2021/10/20 23:57, Michal Hocko wrote: > One argument for removing it from here is the maintainability. Now you > have a memcg specific check which is not in sync with the oom. E.g. > out_of_memory does task_will_free_mem as the very first thing. You are > also automatically excluding oom killer for cases where that might make > a sense. What makes it possible to remove this check? This check was added because task_will_free_mem() in out_of_memory() does NOT work. See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer"). ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <dee26724-3ead-24d4-0c1b-23905bfcdae9-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>]
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks [not found] ` <dee26724-3ead-24d4-0c1b-23905bfcdae9-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> @ 2021-10-21 10:03 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-21 10:03 UTC (permalink / raw) To: Tetsuo Handa Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Thu 21-10-21 00:20:02, Tetsuo Handa wrote: > On 2021/10/20 23:57, Michal Hocko wrote: > > One argument for removing it from here is the maintainability. Now you > > have a memcg specific check which is not in sync with the oom. E.g. > > out_of_memory does task_will_free_mem as the very first thing. You are > > also automatically excluding oom killer for cases where that might make > > a sense. > > What makes it possible to remove this check? > This check was added because task_will_free_mem() in out_of_memory() does NOT work. > See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer"). You are right. I've forgot about this and should have checked git blame. Thanks for bringing this up! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg 3/3] memcg: handle memcg oom failures [not found] ` <cover.1634730787.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin 2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin @ 2021-10-20 12:14 ` Vasily Averin 2021-10-20 13:02 ` Michal Hocko 2 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-20 12:14 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A mem_cgroup_oom() can fail if current task was marked unkillable and oom killer cannot find any victim. Currently we force memcg charge for such allocations, however it allow memcg-limited userspace task in to overuse assigned limits and potentially trigger the global memory shortage. Let's fail the memory charge in such cases. This failure should be somehow recognised in #PF context, so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED ToDo: what is the best way to notify pagefault_out_of_memory() about mem_cgroup_out_of_memory failure ? Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- mm/memcontrol.c | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 74a7379dbac1..b09d3c64f63f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int mem_cgroup_oom_notify(memcg); mem_cgroup_unmark_under_oom(memcg); - if (mem_cgroup_out_of_memory(memcg, mask, order)) + if (mem_cgroup_out_of_memory(memcg, mask, order)) { ret = OOM_SUCCESS; - else + } else { ret = OOM_FAILED; - + /* + * In some rare cases mem_cgroup_out_of_memory() can return false. + * If it was called from #PF it forces handle_mm_fault() + * return VM_FAULT_OOM and executes pagefault_out_of_memory(). + * memcg_in_oom is set here to notify pagefault_out_of_memory() + * that it was a memcg-related failure and not allow to run + * global OOM. + */ + if (current->in_user_fault) + current->memcg_in_oom = (struct mem_cgroup *)ret; + } if (locked) mem_cgroup_oom_unlock(memcg); @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle) if (!memcg) return false; + /* OOM is memcg, however out_of_memory() found no victim */ + if (memcg == (struct mem_cgroup *)OOM_FAILED) { + /* + * Should be called from pagefault_out_of_memory() only, + * where it is used to prevent false global OOM. + */ + current->memcg_in_oom = NULL; + return true; + } if (!handle) goto cleanup; @@ -2633,15 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL)) -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures 2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin @ 2021-10-20 13:02 ` Michal Hocko [not found] ` <YXATW7KsUZzbbGHy-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-20 13:02 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel On Wed 20-10-21 15:14:27, Vasily Averin wrote: > mem_cgroup_oom() can fail if current task was marked unkillable > and oom killer cannot find any victim. > > Currently we force memcg charge for such allocations, > however it allow memcg-limited userspace task in to overuse assigned limits > and potentially trigger the global memory shortage. You should really go into more details whether that is a practical problem to handle. OOM_FAILED means that the memcg oom killer couldn't find any oom victim so it cannot help with a forward progress. There are not that many situations when that can happen. Naming that would be really useful. > Let's fail the memory charge in such cases. > > This failure should be somehow recognised in #PF context, explain why > so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED > > ToDo: what is the best way to notify pagefault_out_of_memory() about > mem_cgroup_out_of_memory failure ? why don't you simply remove out_of_memory from pagefault_out_of_memory and leave it only with the blocking memcg OOM handling? Wouldn't that be a more generic solution? Your first patch already goes that way partially. This change is more risky than the first one. If somebody returns VM_FAULT_OOM without invoking allocator then it can loop for ever but invoking OOM killer in such a situation is equally wrong as the oom killer cannot really help, right? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXATW7KsUZzbbGHy-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures [not found] ` <YXATW7KsUZzbbGHy-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-20 15:46 ` Vasily Averin [not found] ` <d3b32c72-6375-f755-7599-ab804719e1f6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-20 15:46 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 20.10.2021 16:02, Michal Hocko wrote: > On Wed 20-10-21 15:14:27, Vasily Averin wrote: >> mem_cgroup_oom() can fail if current task was marked unkillable >> and oom killer cannot find any victim. >> >> Currently we force memcg charge for such allocations, >> however it allow memcg-limited userspace task in to overuse assigned limits >> and potentially trigger the global memory shortage. > > You should really go into more details whether that is a practical > problem to handle. OOM_FAILED means that the memcg oom killer couldn't > find any oom victim so it cannot help with a forward progress. There are > not that many situations when that can happen. Naming that would be > really useful. I've pointed above: "if current task was marked unkillable and oom killer cannot find any victim." This may happen when current task cannot be oom-killed because it was marked unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN and other processes in memcg are either dying, or are kernel threads or are marked unkillable by the same way. Or when memcg have this process only. If we always approve such kind of allocation, it can be misused. Process can mmap a lot of memory, ant then touch it and generate page fault and make overcharged memory allocations. Finally it can consume all node memory and trigger global memory shortage on the host. >> Let's fail the memory charge in such cases. >> >> This failure should be somehow recognised in #PF context, > > explain why When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM, then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail, it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process or just crash node if sysctl vm.panic_on_oom is set to 1. Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly. However it is not aware that memcg can reject some other allocations, do not recognize the fault as memcg-related and allows to run global OOM. >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED >> >> ToDo: what is the best way to notify pagefault_out_of_memory() about >> mem_cgroup_out_of_memory failure ? > > why don't you simply remove out_of_memory from pagefault_out_of_memory > and leave it only with the blocking memcg OOM handling? Wouldn't that be a > more generic solution? Your first patch already goes that way partially. I clearly understand that global out_of_memory should not be trggired by memcg restrictions. I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying. However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it. > This change is more risky than the first one. If somebody returns > VM_FAULT_OOM without invoking allocator then it can loop for ever but > invoking OOM killer in such a situation is equally wrong as the oom > killer cannot really help, right? I'm not ready to comment this right now and take time out to think about. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <d3b32c72-6375-f755-7599-ab804719e1f6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures [not found] ` <d3b32c72-6375-f755-7599-ab804719e1f6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-21 11:49 ` Michal Hocko [not found] ` <YXFPSvGFV539OcEk-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Wed 20-10-21 18:46:56, Vasily Averin wrote: > On 20.10.2021 16:02, Michal Hocko wrote: > > On Wed 20-10-21 15:14:27, Vasily Averin wrote: > >> mem_cgroup_oom() can fail if current task was marked unkillable > >> and oom killer cannot find any victim. > >> > >> Currently we force memcg charge for such allocations, > >> however it allow memcg-limited userspace task in to overuse assigned limits > >> and potentially trigger the global memory shortage. > > > > You should really go into more details whether that is a practical > > problem to handle. OOM_FAILED means that the memcg oom killer couldn't > > find any oom victim so it cannot help with a forward progress. There are > > not that many situations when that can happen. Naming that would be > > really useful. > > I've pointed above: > "if current task was marked unkillable and oom killer cannot find any victim." > This may happen when current task cannot be oom-killed because it was marked > unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN > and other processes in memcg are either dying, or are kernel threads or are marked unkillable > by the same way. Or when memcg have this process only. > > If we always approve such kind of allocation, it can be misused. > Process can mmap a lot of memory, > ant then touch it and generate page fault and make overcharged memory allocations. > Finally it can consume all node memory and trigger global memory shortage on the host. Yes, this is true but a) OOM_SCORE_ADJ_MIN tasks are excluded from the OOM handling so they have to be careful with the memory consumption and b) is this a theoretical or a practical concern. This is mostly what I wanted to make sure you describe in the changelog. > >> Let's fail the memory charge in such cases. > >> > >> This failure should be somehow recognised in #PF context, > > > > explain why > > When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM, > then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail, > it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process > or just crash node if sysctl vm.panic_on_oom is set to 1. > > Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly. > However it is not aware that memcg can reject some other allocations, do not recognize the fault > as memcg-related and allows to run global OOM. Again something to be added to the changelog. > >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED > >> > >> ToDo: what is the best way to notify pagefault_out_of_memory() about > >> mem_cgroup_out_of_memory failure ? > > > > why don't you simply remove out_of_memory from pagefault_out_of_memory > > and leave it only with the blocking memcg OOM handling? Wouldn't that be a > > more generic solution? Your first patch already goes that way partially. > > I clearly understand that global out_of_memory should not be trggired by memcg restrictions. > I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying. > > However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it. I do understand that handling a very specific case sounds easier but it would be better to have a robust fix even if that requires some more head scratching. So far we have collected several reasons why the it is bad to trigger oom killer from the #PF path. There is no single argument to keep it so it sounds like a viable path to pursue. Maybe there are some very well hidden reasons but those should be documented and this is a great opportunity to do either of the step. Moreover if it turns out that there is a regression then this can be easily reverted and a different, maybe memcg specific, solution can be implemented. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXFPSvGFV539OcEk-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures [not found] ` <YXFPSvGFV539OcEk-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-21 15:05 ` Vasily Averin [not found] ` <b618ac5c-e982-c4af-ecf3-564b8de52c8c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-21 15:05 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 21.10.2021 14:49, Michal Hocko wrote: > I do understand that handling a very specific case sounds easier but it > would be better to have a robust fix even if that requires some more > head scratching. So far we have collected several reasons why the it is > bad to trigger oom killer from the #PF path. There is no single argument > to keep it so it sounds like a viable path to pursue. Maybe there are > some very well hidden reasons but those should be documented and this is > a great opportunity to do either of the step. > > Moreover if it turns out that there is a regression then this can be > easily reverted and a different, maybe memcg specific, solution can be > implemented. Now I'm agree, however I still have a few open questions. 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory() for exampel it can be caused by incorrect vm fault operations, (a) which can return this error without calling allocator at all. (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory. (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller. We'll drop it in upstream kernels, however how to handle it in old kenrels? We can make sure that out_of_memory or alocator was called by set of some per-task flags. Can pagefault_out_of_memory() send itself a SIGKILL in all these cases? If not -- task will be looped. It is much better than execution of global OOM, however it would be even better to avoid it somehow. You said: "We cannot really kill the task if we could we would have done it by the oom killer already". However what to do if we even not tried to use oom-killer? (see (b) and (c)) or if we did not used the allocator at all (see (a)) 2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF. We can call schedule_timeout() and wait some time before a new #PF restart. Additionally we can increase this delay in each new cycle. It helps to save CPU time for other tasks. What do you think about? Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <b618ac5c-e982-c4af-ecf3-564b8de52c8c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures [not found] ` <b618ac5c-e982-c4af-ecf3-564b8de52c8c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-21 16:47 ` Michal Hocko [not found] ` <YXGZoVhROdFG2Wym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> [not found] ` <cover.1634889066.git.vvs@virtuozzo.com> 0 siblings, 2 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-21 16:47 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Thu 21-10-21 18:05:28, Vasily Averin wrote: > On 21.10.2021 14:49, Michal Hocko wrote: > > I do understand that handling a very specific case sounds easier but it > > would be better to have a robust fix even if that requires some more > > head scratching. So far we have collected several reasons why the it is > > bad to trigger oom killer from the #PF path. There is no single argument > > to keep it so it sounds like a viable path to pursue. Maybe there are > > some very well hidden reasons but those should be documented and this is > > a great opportunity to do either of the step. > > > > Moreover if it turns out that there is a regression then this can be > > easily reverted and a different, maybe memcg specific, solution can be > > implemented. > > Now I'm agree, > however I still have a few open questions. > > 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory() > for exampel it can be caused by incorrect vm fault operations, > (a) which can return this error without calling allocator at all. I would argue this to be a bug. How can that particular code tell whether the system is OOM and the oom killer is the a reasonable measure to take? > (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory. I am not sure I can see any sensible scenario where pagefault oom killer would be an appropriate fix for that. > (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller. > We'll drop it in upstream kernels, however how to handle it in old kenrels? Triggering the global oom killer for legacy kmem charge failure is clearly wrong. Removing oom killer from #PF would fix that problem. > We can make sure that out_of_memory or alocator was called by set of some per-task flags. I am not sure I see how that would be useful other than reporting a dubious VM_FAULT_OOM usage. I am also not sure how that would be implemented as allocator can be called several times not to mention that the allocation itself could have been done from a different context - e.g. WQ. > Can pagefault_out_of_memory() send itself a SIGKILL in all these cases? In principle it can as sending signal is not prohibited. I would argue it should not though because it is just wrong thing to do in all those cases. > If not -- task will be looped. Yes, but it will be killable from userspace. So this is not an unrecoverable situation. > It is much better than execution of global OOM, however it would be even better to avoid it somehow. How? > You said: "We cannot really kill the task if we could we would have done it by the oom killer already". > However what to do if we even not tried to use oom-killer? (see (b) and (c)) > or if we did not used the allocator at all (see (a)) See above > 2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF. > We can call schedule_timeout() and wait some time before a new #PF restart. > Additionally we can increase this delay in each new cycle. > It helps to save CPU time for other tasks. > What do you think about? I do not have a strong opinion on this. A short sleep makes sense. I am not sure a more complex implementation is really needed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXGZoVhROdFG2Wym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <YXGZoVhROdFG2Wym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-22 8:10 ` Vasily Averin 0 siblings, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-22 8:10 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It can be misused and allow to trigger global OOM from inside memcg-limited container. On the other hand if memcg fail allocation, called from inside #PF handler it trigger global OOM from inside pagefault_out_of_memory(). To prevent these problems this patch set: 1) removes execution of out_of_memory() from pagefault_out_of_memory(), becasue nobody can explain why it is necessary. 2) allows memcg to fail the allocations of dying/killed tasks. v2: resplit, use old patch from Michal Hocko removing out_of_memory() from pagefault_out_of_memory() Michal Hocko (1): mm, oom: do not trigger out_of_memory from the #PF Vasily Averin (1): memcg: prohibit unconditional exceeding the limit of dying tasks mm/memcontrol.c | 27 ++++++++------------------- mm/oom_kill.c | 23 ++++++++++------------- 2 files changed, 18 insertions(+), 32 deletions(-) -- 2.32.0 ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <cover.1634889066.git.vvs@virtuozzo.com>]
[parent not found: <cover.1634889066.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <cover.1634889066.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-22 8:11 ` Vasily Averin [not found] ` <91d9196e-842a-757f-a3f2-caeb4a89a0d8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-22 8:11 ` [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-22 8:11 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The later is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. An allocation might fail also when the current task is the oom victim and there are no memory reserves left and we should simply bail out from the #PF rather than invoking out_of_memory. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. It was broken long time ago, most likely since 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg"). In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable.] Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg") Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Reviewed-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- mm/oom_kill.c | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 831340e7ad8b..f98954befafb 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc) } /* - * The pagefault handler calls here because it is out of memory, so kill a - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom - * killing is already in progress so do nothing. + * The pagefault handler calls here because some allocation has failed. We have + * to take care of the memcg OOM here because this is the only safe context without + * any locks held but let the oom killer triggered from the allocation context care + * about the global OOM. */ void pagefault_out_of_memory(void) { - struct oom_control oc = { - .zonelist = NULL, - .nodemask = NULL, - .memcg = NULL, - .gfp_mask = 0, - .order = 0, - }; + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); if (mem_cgroup_oom_synchronize(true)) return; - if (!mutex_trylock(&oom_lock)) + if (fatal_signal_pending(current)) return; - out_of_memory(&oc); - mutex_unlock(&oom_lock); + + if (__ratelimit(&pfoom_rs)) + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n"); } SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <91d9196e-842a-757f-a3f2-caeb4a89a0d8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <91d9196e-842a-757f-a3f2-caeb4a89a0d8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-22 8:55 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-22 8:55 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Fri 22-10-21 11:11:09, Vasily Averin wrote: > From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > > Any allocation failure during the #PF path will return with VM_FAULT_OOM > which in turn results in pagefault_out_of_memory. This can happen for > 2 different reasons. a) Memcg is out of memory and we rely on > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) > normal allocation fails. > > The later is quite problematic because allocation paths already trigger > out_of_memory and the page allocator tries really hard to not fail > allocations. Anyway, if the OOM killer has been already invoked there > is no reason to invoke it again from the #PF path. Especially when the > OOM condition might be gone by that time and we have no way to find out > other than allocate. > > Moreover if the allocation failed and the OOM killer hasn't been > invoked then we are unlikely to do the right thing from the #PF context > because we have already lost the allocation context and restictions and > therefore might oom kill a task from a different NUMA domain. > > An allocation might fail also when the current task is the oom victim > and there are no memory reserves left and we should simply bail out > from the #PF rather than invoking out_of_memory. > > This all suggests that there is no legitimate reason to trigger > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure > that no #PF path returns with VM_FAULT_OOM without allocation print a > warning that this is happening before we restart the #PF. Thanks for reviving my old patch! I still do think this is the right direction. I would argue that we should split this into two stages as already mentioned. First to add a bail out on fatal_signal_pending because that is a trivial thing to do and it should be obviously safer. Then we can remove out_of_memory completely. That should be obviously right thing to do but if it turns out we have missed something we would need to revert that part. > [VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle. > This is a local problem related to memcg, however, it causes unnecessary > global OOM kills that are repeated over and over again and escalate into > a real disaster. It was broken long time ago, most likely since > 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg"). > In upstream the problem will be fixed by removing the outdated kmem limit, > however stable and LTS kernels cannot do it and are still affected. > This patch fixes the problem and should be backported into stable.] > > Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg") I am not sure the sha1 is the right one but the issue is there since kmem accounting has been introduced so it should be around that time. Maybe do not name the specific commit but stick to something like "This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM." > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org I am still not sure this really needs a stable backport. There are no bug reports for years so I strongly suspect people are simply not using kmem hard limit with cgroup v1. > Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > Reviewed-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> Btw. you should be adding your s-o-b here. > --- > mm/oom_kill.c | 23 ++++++++++------------- > 1 file changed, 10 insertions(+), 13 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 831340e7ad8b..f98954befafb 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc) > } > > /* > - * The pagefault handler calls here because it is out of memory, so kill a > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom > - * killing is already in progress so do nothing. > + * The pagefault handler calls here because some allocation has failed. We have > + * to take care of the memcg OOM here because this is the only safe context without > + * any locks held but let the oom killer triggered from the allocation context care > + * about the global OOM. > */ > void pagefault_out_of_memory(void) > { > - struct oom_control oc = { > - .zonelist = NULL, > - .nodemask = NULL, > - .memcg = NULL, > - .gfp_mask = 0, > - .order = 0, > - }; > + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL, > + DEFAULT_RATELIMIT_BURST); > > if (mem_cgroup_oom_synchronize(true)) > return; > > - if (!mutex_trylock(&oom_lock)) > + if (fatal_signal_pending(current)) > return; > - out_of_memory(&oc); > - mutex_unlock(&oom_lock); > + > + if (__ratelimit(&pfoom_rs)) > + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n"); > } > > SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > -- > 2.32.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <cover.1634889066.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-22 8:11 ` [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin @ 2021-10-22 8:11 ` Vasily Averin [not found] ` <4b315938-5600-b7f5-bde9-82f638a2e595-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-22 8:11 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer") Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- mm/memcontrol.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..87e41c3cac10 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -239,7 +239,7 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) -static inline bool should_force_charge(void) +static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || (current->flags & PF_EXITING); @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && task_is_dying()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL)) -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <4b315938-5600-b7f5-bde9-82f638a2e595-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <4b315938-5600-b7f5-bde9-82f638a2e595-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-22 9:10 ` Michal Hocko [not found] ` <YXJ/63kIpTq8AOlD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> [not found] ` <cover.1634994605.git.vvs@virtuozzo.com> 0 siblings, 2 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-22 9:10 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Fri 22-10-21 11:11:29, Vasily Averin wrote: > Memory cgroup charging allows killed or exiting tasks to exceed the hard > limit. It is assumed that the amount of the memory charged by those > tasks is bound and most of the memory will get released while the task > is exiting. This is resembling a heuristic for the global OOM situation > when tasks get access to memory reserves. There is no global memory > shortage at the memcg level so the memcg heuristic is more relieved. > > The above assumption is overly optimistic though. E.g. vmalloc can scale > to really large requests and the heuristic would allow that. We used to > have an early break in the vmalloc allocator for killed tasks but this > has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when > the current task is killed""). There are likely other similar code paths > which do not check for fatal signals in an allocation&charge loop. > Also there are some kernel objects charged to a memcg which are not > bound to a process life time. > > It has been observed that it is not really hard to trigger these > bypasses and cause global OOM situation. > > One potential way to address these runaways would be to limit the amount > of excess (similar to the global OOM with limited oom reserves). This is > certainly possible but it is not really clear how much of an excess is > desirable and still protects from global OOMs as that would have to > consider the overall memcg configuration. > > This patch is addressing the problem by removing the heuristic > altogether. Bypass is only allowed for requests which either cannot fail > or where the failure is not desirable while excess should be still > limited (e.g. atomic requests). Implementation wise a killed or dying > task fails to charge if it has passed the OOM killer stage. That should > give all forms of reclaim chance to restore the limit before the > failure (ENOMEM) and tell the caller to back off. > > In addition, this patch renames should_force_charge() helper > to task_is_dying() because now its use is not associated witch forced > charging. I would explicitly mention that this depends on pagefault_out_of_memory to not trigger out_of_memory because then a memcg failure can unwind to VM_FAULT_OOM and cause a global OOM killer. Maybe it would be even better to fold the removal to this patch so the dependency is more obvious. I will live that to you. > Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer") Fixes tag would be quite hard here. For example you certainly didn't have a practically unbound vector to go over the hard limit - like vmalloc. At least not after __GFP_ACCOUNT has been introduced. So I would just not bother with a Fixes tag at all rather than cause it more questions than answers. > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> Other than that looks good to me. Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! > --- > mm/memcontrol.c | 27 ++++++++------------------- > 1 file changed, 8 insertions(+), 19 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 6da5020a8656..87e41c3cac10 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -239,7 +239,7 @@ enum res_type { > iter != NULL; \ > iter = mem_cgroup_iter(NULL, iter, NULL)) > > -static inline bool should_force_charge(void) > +static inline bool task_is_dying(void) > { > return tsk_is_oom_victim(current) || fatal_signal_pending(current) || > (current->flags & PF_EXITING); > @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > - ret = should_force_charge() || out_of_memory(&oc); > + ret = task_is_dying() || out_of_memory(&oc); > > unlock: > mutex_unlock(&oom_lock); > @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > struct page_counter *counter; > enum oom_status oom_status; > unsigned long nr_reclaimed; > + bool passed_oom = false; > bool may_swap = true; > bool drained = false; > unsigned long pflags; > @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_ATOMIC) > goto force; > > - /* > - * Unlike in global OOM situations, memcg is not in a physical > - * memory shortage. Allow dying and OOM-killed tasks to > - * bypass the last charges so that they can exit quickly and > - * free their memory. > - */ > - if (unlikely(should_force_charge())) > - goto force; > - > /* > * Prevent unbounded recursion when reclaim operations need to > * allocate memory. This might exceed the limits temporarily, > @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_RETRY_MAYFAIL) > goto nomem; > > - if (fatal_signal_pending(current)) > - goto force; > + /* Avoid endless loop for tasks bypassed by the oom killer */ > + if (passed_oom && task_is_dying()) > + goto nomem; > > /* > * keep retrying as long as the memcg oom killer is able to make > @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > */ > oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > - switch (oom_status) { > - case OOM_SUCCESS: > + if (oom_status == OOM_SUCCESS) { > + passed_oom = true; > nr_retries = MAX_RECLAIM_RETRIES; > goto retry; > - case OOM_FAILED: > - goto force; > - default: > - goto nomem; > } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > -- > 2.32.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXJ/63kIpTq8AOlD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* [PATCH memcg v3 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <YXJ/63kIpTq8AOlD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-23 13:18 ` Vasily Averin 0 siblings, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-23 13:18 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It can be misused and allow to trigger global OOM from inside memcg-limited container. On the other hand if memcg fail allocation, called from inside #PF handler it trigger global OOM from inside pagefault_out_of_memory(). To prevent these problems this patch set: a) removes execution of out_of_memory() from pagefault_out_of_memory(), becasue nobody can explain why it is necessary. b) allow memcg to fail allocation of dying/killed tasks. v3: resplit, improved patch descriptions v2: resplit, use old patch from Michal Hocko removing out_of_memory() from pagefault_out_of_memory() Michal Hocko (1): mm, oom: do not trigger out_of_memory from the #PF Vasily Averin (2): mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks memcg: prohibit unconditional exceeding the limit of dying tasks mm/memcontrol.c | 27 ++++++++------------------- mm/oom_kill.c | 23 ++++++++++------------- 2 files changed, 18 insertions(+), 32 deletions(-) -- 2.32.0 ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <cover.1634994605.git.vvs@virtuozzo.com>]
[parent not found: <cover.1634994605.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks [not found] ` <cover.1634994605.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-23 13:19 ` Vasily Averin [not found] ` <0828a149-786e-7c06-b70a-52d086818ea3-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin 2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin 2 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-23 13:19 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory which in own turn executes out_out_memory() and can kill a random task. An allocation might fail when the current task is the oom victim and there are no memory reserves left. The OOM killer is already handled at the page allocator level for the global OOM and at the charging level for the memcg one. Both have much more information about the scope of allocation/charge request. This means that either the OOM killer has been invoked properly and didn't lead to the allocation success or it has been skipped because it couldn't have been invoked. In both cases triggering it from here is pointless and even harmful. It makes much more sense to let the killed task die rather than to wake up an eternally hungry oom-killer and send him to choose a fatter victim for breakfast. Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- mm/oom_kill.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 831340e7ad8b..1deef8c7a71b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) if (mem_cgroup_oom_synchronize(true)) return; + if (fatal_signal_pending(current)) + return; + if (!mutex_trylock(&oom_lock)) return; out_of_memory(&oc); -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <0828a149-786e-7c06-b70a-52d086818ea3-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks [not found] ` <0828a149-786e-7c06-b70a-52d086818ea3-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-25 9:27 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-25 9:27 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Sat 23-10-21 16:19:28, Vasily Averin wrote: > Any allocation failure during the #PF path will return with VM_FAULT_OOM > which in turn results in pagefault_out_of_memory which in own turn > executes out_out_memory() and can kill a random task. > > An allocation might fail when the current task is the oom victim > and there are no memory reserves left. The OOM killer is already > handled at the page allocator level for the global OOM and at the > charging level for the memcg one. Both have much more information > about the scope of allocation/charge request. This means that > either the OOM killer has been invoked properly and didn't lead > to the allocation success or it has been skipped because it couldn't > have been invoked. In both cases triggering it from here is pointless > and even harmful. > > It makes much more sense to let the killed task die rather than to > wake up an eternally hungry oom-killer and send him to choose a fatter > victim for breakfast. > > Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! > --- > mm/oom_kill.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 831340e7ad8b..1deef8c7a71b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void) > if (mem_cgroup_oom_synchronize(true)) > return; > > + if (fatal_signal_pending(current)) > + return; > + > if (!mutex_trylock(&oom_lock)) > return; > out_of_memory(&oc); > -- > 2.32.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <cover.1634994605.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin @ 2021-10-23 13:20 ` Vasily Averin [not found] ` <f5fd8dd8-0ad4-c524-5f65-920b01972a42-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin 2 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The later is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> --- v2: hook with fatal_signal_pending() has beem moved into a separate patch, improved patch description, removed "Fixed" mark. --- mm/oom_kill.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1deef8c7a71b..f98954befafb 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc) } /* - * The pagefault handler calls here because it is out of memory, so kill a - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom - * killing is already in progress so do nothing. + * The pagefault handler calls here because some allocation has failed. We have + * to take care of the memcg OOM here because this is the only safe context without + * any locks held but let the oom killer triggered from the allocation context care + * about the global OOM. */ void pagefault_out_of_memory(void) { - struct oom_control oc = { - .zonelist = NULL, - .nodemask = NULL, - .memcg = NULL, - .gfp_mask = 0, - .order = 0, - }; + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); if (mem_cgroup_oom_synchronize(true)) return; @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void) if (fatal_signal_pending(current)) return; - if (!mutex_trylock(&oom_lock)) - return; - out_of_memory(&oc); - mutex_unlock(&oom_lock); + if (__ratelimit(&pfoom_rs)) + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n"); } SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <f5fd8dd8-0ad4-c524-5f65-920b01972a42-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <f5fd8dd8-0ad4-c524-5f65-920b01972a42-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-23 15:01 ` Tetsuo Handa 2021-10-23 19:15 ` Vasily Averin [not found] ` <e2a847a2-a414-2535-e3d1-b100a023b9d1-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> 2021-10-25 9:34 ` Michal Hocko 1 sibling, 2 replies; 66+ messages in thread From: Tetsuo Handa @ 2021-10-23 15:01 UTC (permalink / raw) To: Vasily Averin, Michal Hocko Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov, Andrew Morton On 2021/10/23 22:20, Vasily Averin wrote: > /* > - * The pagefault handler calls here because it is out of memory, so kill a > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom > - * killing is already in progress so do nothing. > + * The pagefault handler calls here because some allocation has failed. We have > + * to take care of the memcg OOM here because this is the only safe context without > + * any locks held but let the oom killer triggered from the allocation context care > + * about the global OOM. > */ Excuse me for a stupid question. I consider if (!mutex_trylock(&oom_lock)) return; out_of_memory(&oc); mutex_unlock(&oom_lock); here as the last resort (safeguard) when neither __alloc_pages_may_oom() nor mem_cgroup_out_of_memory() can make progress. This patch says let the oom killer triggered from the allocation context care about the global OOM. but what if the OOM killer cannot be invoked from the allocation context? Is there a guarantee that all memory allocations which might result in VM_FAULT_OOM can invoke the OOM killer? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF 2021-10-23 15:01 ` Tetsuo Handa @ 2021-10-23 19:15 ` Vasily Averin [not found] ` <e2a847a2-a414-2535-e3d1-b100a023b9d1-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> 1 sibling, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-23 19:15 UTC (permalink / raw) To: Tetsuo Handa, Michal Hocko Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel, kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton On 23.10.2021 18:01, Tetsuo Handa wrote: > On 2021/10/23 22:20, Vasily Averin wrote: >> /* >> - * The pagefault handler calls here because it is out of memory, so kill a >> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom >> - * killing is already in progress so do nothing. >> + * The pagefault handler calls here because some allocation has failed. We have >> + * to take care of the memcg OOM here because this is the only safe context without >> + * any locks held but let the oom killer triggered from the allocation context care >> + * about the global OOM. >> */ > > Excuse me for a stupid question. I consider > > if (!mutex_trylock(&oom_lock)) > return; > out_of_memory(&oc); > mutex_unlock(&oom_lock); > > here as the last resort (safeguard) when neither __alloc_pages_may_oom() > nor mem_cgroup_out_of_memory() can make progress. This patch says > > let the oom killer triggered from the allocation context care > about the global OOM. > > but what if the OOM killer cannot be invoked from the allocation context? > Is there a guarantee that all memory allocations which might result in > VM_FAULT_OOM can invoke the OOM killer? I don't think this question is stupid, since I asked it myself :) You can find this discussion here: https://lkml.org/lkml/2021/10/21/900 Let me quote it here too :> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory() :> for exampel it can be caused by incorrect vm fault operations, :> (a) which can return this error without calling allocator at all. : :I would argue this to be a bug. How can that particular code tell :whether the system is OOM and the oom killer is the a reasonable measure :to take? : :> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory. : : I am not sure I can see any sensible scenario where pagefault oom killer : would be an appropriate fix for that. : :> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller. :> We'll drop it in upstream kernels, however how to handle it in old kenrels? : :Triggering the global oom killer for legacy kmem charge failure is :clearly wrong. Removing oom killer from #PF would fix that problem. I would note: (c) is not theoretical but real life problem, in this case allocation was failed without execution of OOM, however, it is in this case that execution out_of_memory() from pagefault_out_of_memory() leads to a disaster. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <e2a847a2-a414-2535-e3d1-b100a023b9d1-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>]
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <e2a847a2-a414-2535-e3d1-b100a023b9d1-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> @ 2021-10-25 8:04 ` Michal Hocko [not found] ` <YXZk9Lr217e+saSM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-25 8:04 UTC (permalink / raw) To: Tetsuo Handa Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov, Andrew Morton On Sun 24-10-21 00:01:07, Tetsuo Handa wrote: > On 2021/10/23 22:20, Vasily Averin wrote: > > /* > > - * The pagefault handler calls here because it is out of memory, so kill a > > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom > > - * killing is already in progress so do nothing. > > + * The pagefault handler calls here because some allocation has failed. We have > > + * to take care of the memcg OOM here because this is the only safe context without > > + * any locks held but let the oom killer triggered from the allocation context care > > + * about the global OOM. > > */ > > Excuse me for a stupid question. I consider > > if (!mutex_trylock(&oom_lock)) > return; > out_of_memory(&oc); > mutex_unlock(&oom_lock); > > here as the last resort (safeguard) when neither __alloc_pages_may_oom() > nor mem_cgroup_out_of_memory() can make progress. This patch says > > let the oom killer triggered from the allocation context care > about the global OOM. > > but what if the OOM killer cannot be invoked from the allocation context? > Is there a guarantee that all memory allocations which might result in > VM_FAULT_OOM can invoke the OOM killer? I do not think there is any guarantee. This code has meant to be a safeguard but it turns out to be adding more harm than a safety. There are several scenarios mentioned in this thread where this would be counter productive or outright wrong thing to do. On the other hand it is hard to imagine any legitimate situation where this would be a right thing to do. Maybe you have something more specific in mind? What would be the legit code to rely on OOM handling out of the line (where the details about the allocation scope is lost)? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXZk9Lr217e+saSM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <YXZk9Lr217e+saSM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-26 13:56 ` Tetsuo Handa [not found] ` <62a326bc-37d2-b8c9-ddbf-7adaeaadf341-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Tetsuo Handa @ 2021-10-26 13:56 UTC (permalink / raw) To: Michal Hocko Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov, Andrew Morton On 2021/10/25 17:04, Michal Hocko wrote: > I do not think there is any guarantee. This code has meant to be a > safeguard but it turns out to be adding more harm than a safety. There > are several scenarios mentioned in this thread where this would be > counter productive or outright wrong thing to do. Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem charge limit might be an unexpected combination? > > On the other hand it is hard to imagine any legitimate situation where > this would be a right thing to do. Maybe you have something more > specific in mind? What would be the legit code to rely on OOM handling > out of the line (where the details about the allocation scope is lost)? I don't have specific scenario, but I feel that it might be a chance to retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed"") was 4.5 years ago, and fuzz testing found many bugs triggered by memory allocation fault injection. Thus, I think that the direction is going towards "we can fail memory allocation upon SIGKILL (rather than worrying about depleting memory reserves and/or escalating to global OOM killer invocations)". Most memory allocation requests which allocate memory for userspace process are willing to give up upon SIGKILL. Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could consider KILLABLE support as well. Of course, direct reclaim makes it difficult to immediately give up upon SIGKILL, but killable allocation sounds still nice even if best-effort basis. ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <62a326bc-37d2-b8c9-ddbf-7adaeaadf341-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>]
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <62a326bc-37d2-b8c9-ddbf-7adaeaadf341-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> @ 2021-10-26 14:07 ` Michal Hocko 0 siblings, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-26 14:07 UTC (permalink / raw) To: Tetsuo Handa Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov, Andrew Morton On Tue 26-10-21 22:56:44, Tetsuo Handa wrote: > On 2021/10/25 17:04, Michal Hocko wrote: > > I do not think there is any guarantee. This code has meant to be a > > safeguard but it turns out to be adding more harm than a safety. There > > are several scenarios mentioned in this thread where this would be > > counter productive or outright wrong thing to do. > > Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem > charge limit might be an unexpected combination? I am not sure I follow or why PR_SET_IO_FLUSHER should be relevant. But triggering the global OOM killer on kmem charge limit failure is certainly not the right thing to do. Quite opposite because this would be effectivelly a global DoS as a result of a local memory constrain. > > On the other hand it is hard to imagine any legitimate situation where > > this would be a right thing to do. Maybe you have something more > > specific in mind? What would be the legit code to rely on OOM handling > > out of the line (where the details about the allocation scope is lost)? > > I don't have specific scenario, but I feel that it might be a chance to > retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off > when the current task is killed"") was 4.5 years ago, and fuzz testing found > many bugs triggered by memory allocation fault injection. Thus, I think that > the direction is going towards "we can fail memory allocation upon SIGKILL > (rather than worrying about depleting memory reserves and/or escalating to > global OOM killer invocations)". Most memory allocation requests which > allocate memory for userspace process are willing to give up upon SIGKILL. > > Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could > consider KILLABLE support as well. Of course, direct reclaim makes it difficult > to immediately give up upon SIGKILL, but killable allocation sounds still nice > even if best-effort basis. This is all fine but I am not sure how this is realated to this patch. The previous patch already gives up in pagefault_out_of_memory on fatal signal pending. So this code is not really reachable. Also alowing more allocations to fail doesn't really suggest that we should trigger OOM killer from #PF. I would argue that the opposite is the case actually. Or I just haven't understood your concern? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF [not found] ` <f5fd8dd8-0ad4-c524-5f65-920b01972a42-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-23 15:01 ` Tetsuo Handa @ 2021-10-25 9:34 ` Michal Hocko 1 sibling, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-25 9:34 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Sat 23-10-21 16:20:18, Vasily Averin wrote: > From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > > Any allocation failure during the #PF path will return with VM_FAULT_OOM > which in turn results in pagefault_out_of_memory. This can happen for > 2 different reasons. a) Memcg is out of memory and we rely on > mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) > normal allocation fails. > > The later is quite problematic because allocation paths already trigger > out_of_memory and the page allocator tries really hard to not fail > allocations. Anyway, if the OOM killer has been already invoked there > is no reason to invoke it again from the #PF path. Especially when the > OOM condition might be gone by that time and we have no way to find out > other than allocate. > > Moreover if the allocation failed and the OOM killer hasn't been > invoked then we are unlikely to do the right thing from the #PF context > because we have already lost the allocation context and restictions and > therefore might oom kill a task from a different NUMA domain. > > This all suggests that there is no legitimate reason to trigger > out_of_memory from pagefault_out_of_memory so drop it. Just to be sure > that no #PF path returns with VM_FAULT_OOM without allocation print a > warning that this is happening before we restart the #PF. > > [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. > This is a local problem related to memcg, however, it causes unnecessary > global OOM kills that are repeated over and over again and escalate into > a real disaster. This has been broken since kmem accounting has been > introduced for cgroup v1 (3.8). There was no kmem specific reclaim > for the separate limit so the only way to handle kmem hard limit > was to return with ENOMEM. > In upstream the problem will be fixed by removing the outdated kmem limit, > however stable and LTS kernels cannot do it and are still affected. > This patch fixes the problem and should be backported into stable/LTS.] > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org I would be still careful about backporting to stable trees. At least wait for a release cycle to catch potential problems before backporting. The problem with kmem is documented and for quite a lot of time and we haven't received a single bug report IIRC so this is likely not a real problem people are facing out there. > Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> I think this is the right thing to do. Hopefuly we won't find some tricky code which would depend on this behavior. If this turns out to be the case then we will clearly learn about it by the kernel message and we can to handle that situation more gracefully. Maybe we will want to update the chengelog to be more specific based on the review comments but this one should describe the problem quite well already. Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! > --- > v2: hook with fatal_signal_pending() has beem moved into a separate patch, > improved patch description, removed "Fixed" mark. > --- > mm/oom_kill.c | 22 ++++++++-------------- > 1 file changed, 8 insertions(+), 14 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1deef8c7a71b..f98954befafb 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc) > } > > /* > - * The pagefault handler calls here because it is out of memory, so kill a > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom > - * killing is already in progress so do nothing. > + * The pagefault handler calls here because some allocation has failed. We have > + * to take care of the memcg OOM here because this is the only safe context without > + * any locks held but let the oom killer triggered from the allocation context care > + * about the global OOM. > */ > void pagefault_out_of_memory(void) > { > - struct oom_control oc = { > - .zonelist = NULL, > - .nodemask = NULL, > - .memcg = NULL, > - .gfp_mask = 0, > - .order = 0, > - }; > + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL, > + DEFAULT_RATELIMIT_BURST); > > if (mem_cgroup_oom_synchronize(true)) > return; > @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void) > if (fatal_signal_pending(current)) > return; > > - if (!mutex_trylock(&oom_lock)) > - return; > - out_of_memory(&oc); > - mutex_unlock(&oom_lock); > + if (__ratelimit(&pfoom_rs)) > + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n"); > } > > SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > -- > 2.32.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <cover.1634994605.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin 2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin @ 2021-10-23 13:20 ` Vasily Averin [not found] ` <8f5cebbb-06da-4902-91f0-6566fc4b4203-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2 siblings, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. This patch depends on pagefault_out_of_memory() to not trigger out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM and cause a global OOM killer. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> --- mm/memcontrol.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..87e41c3cac10 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -239,7 +239,7 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) -static inline bool should_force_charge(void) +static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || (current->flags & PF_EXITING); @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && task_is_dying()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL)) -- 2.32.0 ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <8f5cebbb-06da-4902-91f0-6566fc4b4203-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <8f5cebbb-06da-4902-91f0-6566fc4b4203-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-25 9:36 ` Michal Hocko [not found] ` <YXZ6qaMJBomVfV8O-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-25 9:36 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Sat 23-10-21 16:20:51, Vasily Averin wrote: > Memory cgroup charging allows killed or exiting tasks to exceed the hard > limit. It is assumed that the amount of the memory charged by those > tasks is bound and most of the memory will get released while the task > is exiting. This is resembling a heuristic for the global OOM situation > when tasks get access to memory reserves. There is no global memory > shortage at the memcg level so the memcg heuristic is more relieved. > > The above assumption is overly optimistic though. E.g. vmalloc can scale > to really large requests and the heuristic would allow that. We used to > have an early break in the vmalloc allocator for killed tasks but this > has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when > the current task is killed""). There are likely other similar code paths > which do not check for fatal signals in an allocation&charge loop. > Also there are some kernel objects charged to a memcg which are not > bound to a process life time. > > It has been observed that it is not really hard to trigger these > bypasses and cause global OOM situation. > > One potential way to address these runaways would be to limit the amount > of excess (similar to the global OOM with limited oom reserves). This is > certainly possible but it is not really clear how much of an excess is > desirable and still protects from global OOMs as that would have to > consider the overall memcg configuration. > > This patch is addressing the problem by removing the heuristic > altogether. Bypass is only allowed for requests which either cannot fail > or where the failure is not desirable while excess should be still > limited (e.g. atomic requests). Implementation wise a killed or dying > task fails to charge if it has passed the OOM killer stage. That should > give all forms of reclaim chance to restore the limit before the > failure (ENOMEM) and tell the caller to back off. > > In addition, this patch renames should_force_charge() helper > to task_is_dying() because now its use is not associated witch forced > charging. > > This patch depends on pagefault_out_of_memory() to not trigger > out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM > and cause a global OOM killer. > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org My view on stable backport is similar to the previous patch. If we want to have it there then let's wait for some time to see whether there are any fallouts as this patch depends on the PF_OOM change. > Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> > Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! > --- > mm/memcontrol.c | 27 ++++++++------------------- > 1 file changed, 8 insertions(+), 19 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 6da5020a8656..87e41c3cac10 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -239,7 +239,7 @@ enum res_type { > iter != NULL; \ > iter = mem_cgroup_iter(NULL, iter, NULL)) > > -static inline bool should_force_charge(void) > +static inline bool task_is_dying(void) > { > return tsk_is_oom_victim(current) || fatal_signal_pending(current) || > (current->flags & PF_EXITING); > @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > - ret = should_force_charge() || out_of_memory(&oc); > + ret = task_is_dying() || out_of_memory(&oc); > > unlock: > mutex_unlock(&oom_lock); > @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > struct page_counter *counter; > enum oom_status oom_status; > unsigned long nr_reclaimed; > + bool passed_oom = false; > bool may_swap = true; > bool drained = false; > unsigned long pflags; > @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_ATOMIC) > goto force; > > - /* > - * Unlike in global OOM situations, memcg is not in a physical > - * memory shortage. Allow dying and OOM-killed tasks to > - * bypass the last charges so that they can exit quickly and > - * free their memory. > - */ > - if (unlikely(should_force_charge())) > - goto force; > - > /* > * Prevent unbounded recursion when reclaim operations need to > * allocate memory. This might exceed the limits temporarily, > @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_RETRY_MAYFAIL) > goto nomem; > > - if (fatal_signal_pending(current)) > - goto force; > + /* Avoid endless loop for tasks bypassed by the oom killer */ > + if (passed_oom && task_is_dying()) > + goto nomem; > > /* > * keep retrying as long as the memcg oom killer is able to make > @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > */ > oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > - switch (oom_status) { > - case OOM_SUCCESS: > + if (oom_status == OOM_SUCCESS) { > + passed_oom = true; > nr_retries = MAX_RECLAIM_RETRIES; > goto retry; > - case OOM_FAILED: > - goto force; > - default: > - goto nomem; > } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > -- > 2.32.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <YXZ6qaMJBomVfV8O-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <YXZ6qaMJBomVfV8O-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2021-10-27 22:36 ` Andrew Morton [not found] ` <20211027153608.9910f7db99d5ef574045370e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Andrew Morton @ 2021-10-27 22:36 UTC (permalink / raw) To: Michal Hocko Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Greg Kroah-Hartman On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > My view on stable backport is similar to the previous patch. If we want > to have it there then let's wait for some time to see whether there are > any fallouts as this patch depends on the PF_OOM change. It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do not. What is the thinking here? I expect we'd be OK with merging these into 5.16-rc1. This still gives another couple of months in -rc to shake out any problems. But I suspect the -stable maintainers will merge and release the patches before they are released in 5.16. In which case an alternative would be not to mark these patches cc:stable and to somehow remember to ask the -stable maintainers to merge them after 5.16 has been on the streets for a suitable period. Greg, thoughts? ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20211027153608.9910f7db99d5ef574045370e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <20211027153608.9910f7db99d5ef574045370e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2021-10-28 7:22 ` Vasily Averin [not found] ` <ea14200f-ad2c-6901-25da-54900fe2ce14-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 2021-10-29 7:58 ` Michal Hocko 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-28 7:22 UTC (permalink / raw) To: Andrew Morton, Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Greg Kroah-Hartman On 28.10.2021 01:36, Andrew Morton wrote: > On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > >> My view on stable backport is similar to the previous patch. If we want >> to have it there then let's wait for some time to see whether there are >> any fallouts as this patch depends on the PF_OOM change. > > It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do > not. What is the thinking here? My fault, I missed it. All 3 patches should be backported, I did it already to stables kernels since 4.4 and I'm ready to submit it in demand. > I expect we'd be OK with merging these into 5.16-rc1. This still gives > another couple of months in -rc to shake out any problems. But I > suspect the -stable maintainers will merge and release the patches > before they are released in 5.16. > > In which case an alternative would be not to mark these patches > cc:stable and to somehow remember to ask the -stable maintainers to > merge them after 5.16 has been on the streets for a suitable period. > > Greg, thoughts? If you wish I can remind Greg in a month or even after 5.17 release. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <ea14200f-ad2c-6901-25da-54900fe2ce14-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <ea14200f-ad2c-6901-25da-54900fe2ce14-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-29 7:46 ` Greg Kroah-Hartman 0 siblings, 0 replies; 66+ messages in thread From: Greg Kroah-Hartman @ 2021-10-29 7:46 UTC (permalink / raw) To: Vasily Averin Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Vladimir Davydov, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Thu, Oct 28, 2021 at 10:22:56AM +0300, Vasily Averin wrote: > On 28.10.2021 01:36, Andrew Morton wrote: > > On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > >> My view on stable backport is similar to the previous patch. If we want > >> to have it there then let's wait for some time to see whether there are > >> any fallouts as this patch depends on the PF_OOM change. > > > > It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do > > not. What is the thinking here? > > My fault, I missed it. > All 3 patches should be backported, > I did it already to stables kernels since 4.4 and I'm ready to submit it in demand. > > > I expect we'd be OK with merging these into 5.16-rc1. This still gives > > another couple of months in -rc to shake out any problems. But I > > suspect the -stable maintainers will merge and release the patches > > before they are released in 5.16. > > > > In which case an alternative would be not to mark these patches > > cc:stable and to somehow remember to ask the -stable maintainers to > > merge them after 5.16 has been on the streets for a suitable period. > > > > Greg, thoughts? > > If you wish I can remind Greg in a month or even after 5.17 release. Please remind us then, otherwise I will not remember :) thanks, greg k-h ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks [not found] ` <20211027153608.9910f7db99d5ef574045370e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2021-10-28 7:22 ` Vasily Averin @ 2021-10-29 7:58 ` Michal Hocko 1 sibling, 0 replies; 66+ messages in thread From: Michal Hocko @ 2021-10-29 7:58 UTC (permalink / raw) To: Andrew Morton Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A, Greg Kroah-Hartman On Wed 27-10-21 15:36:08, Andrew Morton wrote: > On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > My view on stable backport is similar to the previous patch. If we want > > to have it there then let's wait for some time to see whether there are > > any fallouts as this patch depends on the PF_OOM change. > > It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do > not. What is the thinking here? > > I expect we'd be OK with merging these into 5.16-rc1. This still gives > another couple of months in -rc to shake out any problems. But I > suspect the -stable maintainers will merge and release the patches > before they are released in 5.16. > > In which case an alternative would be not to mark these patches > cc:stable and to somehow remember to ask the -stable maintainers to > merge them after 5.16 has been on the streets for a suitable period. My take on stable backports is http://lkml.kernel.org/r/YXZ6FMzJLEz4TA2d-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <YW04jWSv6pQb2Goe-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2021-10-18 10:05 ` Vasily Averin @ 2021-10-21 8:03 ` Vasily Averin [not found] ` <496ed57e-61c6-023a-05fd-4ef21b0294cf-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> 1 sibling, 1 reply; 66+ messages in thread From: Vasily Averin @ 2021-10-21 8:03 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 18.10.2021 12:04, Michal Hocko wrote: > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > [...] >> How could this happen? >> >> User-space task inside the memcg-limited container generated a page fault, >> its handler do_user_addr_fault() called handle_mm_fault which could not >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >> out_of_memory() without set of memcg. > I will be honest that I am not really happy about pagefault_out_of_memory. > I have tried to remove it in the past. Without much success back then, > unfortunately[1]. > > [1] I do not have msg-id so I cannot provide a lore link but google > pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html I re-read this discussion and in general I support your position. As far as I understand your opponents cannot explain why "random kill" is mandatory here, they are just afraid that it might be useful here and do not want to remove it completely. Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior. However I would like to have some choice in this point. In general we can: - continue to use "random kill" and rely on the wisdom of the ancestors. - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again". - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory. - mark the current task as cycled in #PF and somehow use this mark in allocator - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle. - implement any better ideas, - use any combination of previous points We can select required strategy for example via sysctl. For me "random kill" is worst choice, Why can't we just kill the looped process instead? It can be marked as oom-unkillable, so OOM-killer was unable to select it. However I doubt it means "never kill it", for me it is something like "last possible victim" priority. Thank you, Vasily Averin ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <496ed57e-61c6-023a-05fd-4ef21b0294cf-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>]
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task [not found] ` <496ed57e-61c6-023a-05fd-4ef21b0294cf-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> @ 2021-10-21 11:49 ` Michal Hocko 2021-10-21 13:24 ` Vasily Averin 0 siblings, 1 reply; 66+ messages in thread From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw) To: Vasily Averin Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On Thu 21-10-21 11:03:43, Vasily Averin wrote: > On 18.10.2021 12:04, Michal Hocko wrote: > > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > > [...] > >> How could this happen? > >> > >> User-space task inside the memcg-limited container generated a page fault, > >> its handler do_user_addr_fault() called handle_mm_fault which could not > >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. > >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed > >> out_of_memory() without set of memcg. > > > I will be honest that I am not really happy about pagefault_out_of_memory. > > I have tried to remove it in the past. Without much success back then, > > unfortunately[1]. > > > > [1] I do not have msg-id so I cannot provide a lore link but google > > pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html > > I re-read this discussion and in general I support your position. > As far as I understand your opponents cannot explain why "random kill" is mandatory here, > they are just afraid that it might be useful here and do not want to remove it completely. That aligns with my recollection. > Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior. > > However I would like to have some choice in this point. > > In general we can: > - continue to use "random kill" and rely on the wisdom of the ancestors. I do not follow. Does that mean to preserve existing oom killer from #PF? > - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again". > - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory. Again, not really sure what you mean > - mark the current task as cycled in #PF and somehow use this mark in allocator How? > - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle. No! We cannot really kill the task if we could we would have done it by the oom killer already > - implement any better ideas, > - use any combination of previous points > > We can select required strategy for example via sysctl. Absolutely no! How can admin know any better than the kernel? > For me "random kill" is worst choice, > Why can't we just kill the looped process instead? See above. > It can be marked as oom-unkillable, so OOM-killer was unable to select it. > However I doubt it means "never kill it", for me it is something like "last possible victim" priority. It means never kill it because of OOM. If it is retrying because of OOM then it is effectively the same thing. The oom killer from the #PF doesn't really provide any clear advantage these days AFAIK. On the other hand it allows for a very disruptive behavior. In a worst case it can lead to a system panic if the VM_FAULT_OOM is not really caused by a memory shortage but rather a wrong error handling. If a task is looping there without any progress then it is still kilallable which is a much saner behavior IMHO. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task 2021-10-21 11:49 ` Michal Hocko @ 2021-10-21 13:24 ` Vasily Averin 0 siblings, 0 replies; 66+ messages in thread From: Vasily Averin @ 2021-10-21 13:24 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-GEFAQzZX7r8dnm+yROfE0A On 21.10.2021 14:49, Michal Hocko wrote: > On Thu 21-10-21 11:03:43, Vasily Averin wrote: >> On 18.10.2021 12:04, Michal Hocko wrote: >>> On Mon 18-10-21 11:13:52, Vasily Averin wrote: >>> [...] >>>> How could this happen? >>>> >>>> User-space task inside the memcg-limited container generated a page fault, >>>> its handler do_user_addr_fault() called handle_mm_fault which could not >>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >>>> out_of_memory() without set of memcg. >> >>> I will be honest that I am not really happy about pagefault_out_of_memory. >>> I have tried to remove it in the past. Without much success back then, >>> unfortunately[1]. >>> >>> [1] I do not have msg-id so I cannot provide a lore link but google >>> pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html >> >> I re-read this discussion and in general I support your position. >> As far as I understand your opponents cannot explain why "random kill" is mandatory here, >> they are just afraid that it might be useful here and do not want to remove it completely. > > That aligns with my recollection. > >> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior. >> >> However I would like to have some choice in this point. >> >> In general we can: >> - continue to use "random kill" and rely on the wisdom of the ancestors. > > I do not follow. Does that mean to preserve existing oom killer from > #PF? > >> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again". >> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory. > > Again, not really sure what you mean > >> - mark the current task as cycled in #PF and somehow use this mark in allocator > > How? > >> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle. > > No! We cannot really kill the task if we could we would have done it by > the oom killer already > >> - implement any better ideas, >> - use any combination of previous points >> >> We can select required strategy for example via sysctl. > > Absolutely no! How can admin know any better than the kernel? > >> For me "random kill" is worst choice, >> Why can't we just kill the looped process instead? > > See above. > >> It can be marked as oom-unkillable, so OOM-killer was unable to select it. >> However I doubt it means "never kill it", for me it is something like "last possible victim" priority. > > It means never kill it because of OOM. If it is retrying because of OOM > then it is effectively the same thing. > > The oom killer from the #PF doesn't really provide any clear advantage > these days AFAIK. On the other hand it allows for a very disruptive > behavior. In a worst case it can lead to a system panic if the > VM_FAULT_OOM is not really caused by a memory shortage but rather a > wrong error handling. If a task is looping there without any progress > then it is still kilallable which is a much saner behavior IMHO. Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread. . ^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2021-10-29 7:58 UTC | newest]
Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-18 8:13 [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
[not found] ` <9d10df01-0127-fb40-81c3-cc53c9733c3e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-18 9:04 ` Michal Hocko
[not found] ` <YW04jWSv6pQb2Goe-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-18 10:05 ` Vasily Averin
[not found] ` <6b751abe-aa52-d1d8-2631-ec471975cc3a-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-18 10:12 ` Vasily Averin
2021-10-18 11:53 ` Michal Hocko
[not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>
[not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-18 12:27 ` Michal Hocko
[not found] ` <YW1oMxNkUCaAimmg-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-18 15:07 ` Shakeel Butt
[not found] ` <CALvZod42uwgrg83CCKn6JgYqAQtR1RLJSuybNYjtkFo4wVgT1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-10-18 16:51 ` Michal Hocko
2021-10-18 17:13 ` Shakeel Butt
2021-10-18 18:52 ` Vasily Averin
[not found] ` <153f7aa6-39ef-f064-8745-a9489e088239-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-18 19:18 ` Vasily Averin
[not found] ` <4a30aa18-e2a2-693c-8237-b75fffac9838-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-19 5:34 ` Shakeel Butt
2021-10-19 5:33 ` Shakeel Butt
[not found] ` <CALvZod5Kut63MLVfCkEW5XemqN4Jnd1iEQD_Gk0w5=fPffL8Bg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-10-19 6:42 ` Vasily Averin
[not found] ` <25120323-d222-cc5e-fe08-6471bce13bd6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-19 8:47 ` Michal Hocko
[not found] ` <YW1gRz0rTkJrvc4L-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-19 6:30 ` Vasily Averin
[not found] ` <339ae4b5-6efd-8fc2-33f1-2eb3aee71cb2-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-19 8:49 ` Michal Hocko
[not found] ` <YW6GoZhFUJc1uLYr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-19 10:30 ` Vasily Averin
[not found] ` <687bf489-f7a7-5604-25c5-0c1a09e0905b-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-19 11:54 ` Michal Hocko
[not found] ` <YW6yAeAO+TeS3OdB-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-19 12:04 ` Michal Hocko
[not found] ` <YW60Rs1mi24sJmp4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-19 13:26 ` Vasily Averin
[not found] ` <6c422150-593f-f601-8f91-914c6c5e82f4-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-19 14:13 ` Michal Hocko
[not found] ` <YW7SfkZR/ZsabkXV-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-19 14:19 ` Michal Hocko
2021-10-19 19:09 ` Vasily Averin
[not found] ` <3c76e2d7-e545-ef34-b2c3-a5f63b1eff51-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-20 8:07 ` [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
[not found] ` <f40cd82c-f03a-4d36-e953-f89399cb8f58-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-20 8:43 ` Michal Hocko
[not found] ` <YW/WoJDFM3ddHn7Y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-20 12:11 ` [PATCH memcg RFC 0/3] " Vasily Averin
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
[not found] ` <cover.1634730787.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin
[not found] ` <2c13c739-7282-e6f4-da0a-c0b69e68581e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-20 12:33 ` Michal Hocko
[not found] ` <YXAMpxjuV/h2awqG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-20 13:52 ` Vasily Averin
2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin
[not found] ` <56180e53-b705-b1be-9b60-75e141c8560c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-20 12:41 ` Michal Hocko
[not found] ` <YXAOjQO5r1g/WKmn-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-20 14:21 ` Vasily Averin
[not found] ` <cbda9b6b-3ee5-06ab-9a3b-debf361b55bb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-20 14:57 ` Michal Hocko
[not found] ` <YXAubuMMgNDeguNx-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-20 15:20 ` Tetsuo Handa
[not found] ` <dee26724-3ead-24d4-0c1b-23905bfcdae9-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>
2021-10-21 10:03 ` Michal Hocko
2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin
2021-10-20 13:02 ` Michal Hocko
[not found] ` <YXATW7KsUZzbbGHy-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-20 15:46 ` Vasily Averin
[not found] ` <d3b32c72-6375-f755-7599-ab804719e1f6-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-21 11:49 ` Michal Hocko
[not found] ` <YXFPSvGFV539OcEk-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-21 15:05 ` Vasily Averin
[not found] ` <b618ac5c-e982-c4af-ecf3-564b8de52c8c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-21 16:47 ` Michal Hocko
[not found] ` <YXGZoVhROdFG2Wym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-22 8:10 ` [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
[not found] ` <cover.1634889066.git.vvs@virtuozzo.com>
[not found] ` <cover.1634889066.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-22 8:11 ` [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
[not found] ` <91d9196e-842a-757f-a3f2-caeb4a89a0d8-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-22 8:55 ` Michal Hocko
2021-10-22 8:11 ` [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
[not found] ` <4b315938-5600-b7f5-bde9-82f638a2e595-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-22 9:10 ` Michal Hocko
[not found] ` <YXJ/63kIpTq8AOlD-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-23 13:18 ` [PATCH memcg v3 0/3] " Vasily Averin
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
[not found] ` <cover.1634994605.git.vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin
[not found] ` <0828a149-786e-7c06-b70a-52d086818ea3-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-25 9:27 ` Michal Hocko
2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
[not found] ` <f5fd8dd8-0ad4-c524-5f65-920b01972a42-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-23 15:01 ` Tetsuo Handa
2021-10-23 19:15 ` Vasily Averin
[not found] ` <e2a847a2-a414-2535-e3d1-b100a023b9d1-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>
2021-10-25 8:04 ` Michal Hocko
[not found] ` <YXZk9Lr217e+saSM-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-26 13:56 ` Tetsuo Handa
[not found] ` <62a326bc-37d2-b8c9-ddbf-7adaeaadf341-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>
2021-10-26 14:07 ` Michal Hocko
2021-10-25 9:34 ` Michal Hocko
2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
[not found] ` <8f5cebbb-06da-4902-91f0-6566fc4b4203-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-25 9:36 ` Michal Hocko
[not found] ` <YXZ6qaMJBomVfV8O-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-10-27 22:36 ` Andrew Morton
[not found] ` <20211027153608.9910f7db99d5ef574045370e-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2021-10-28 7:22 ` Vasily Averin
[not found] ` <ea14200f-ad2c-6901-25da-54900fe2ce14-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-29 7:46 ` Greg Kroah-Hartman
2021-10-29 7:58 ` Michal Hocko
2021-10-21 8:03 ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
[not found] ` <496ed57e-61c6-023a-05fd-4ef21b0294cf-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2021-10-21 11:49 ` Michal Hocko
2021-10-21 13:24 ` Vasily Averin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox