* Re: memory-cgroup bug [not found] <20121121200207.01068046@pobox.sk> @ 2012-11-22 15:24 ` Michal Hocko 2012-11-22 18:05 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-22 15:24 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Wed 21-11-12 20:02:07, azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really > strange problem when a cgroup gets out of its memory limit. It's very > strange because it happens only sometimes (about once per week on > random user), out of memory is usually handled ok. What is your memcg configuration? Do you use deeper hierarchies, is use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) enabled? Do you use soft limit for those groups? Is memcg swap accounting enabled and memsw limits in place? Is the machine under global memory pressure as well? Could you post sysrq+t or sysrq+w? > This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace > freezes until process is killed (strace cannot be terminated by > CTRL-c) > - problem can be resolved by raising memory limit for cgroup or > killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc/<pid>/stack of freezed process: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 Hmm what is this? > [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > [<ffffffff815b53ff>] page_fault+0x1f/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > How many tasks are hung in mem_cgroup_handle_oom? If there were many of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: make oom_lock 0 and 1 based rather than counter) and its follow up fix 23751be00940 (memcg: fix hierarchical oom locking) but you are saying that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would make more sense. > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. I guess this is a clean vanilla (stable) kernel, right? Are you able to reproduce with the latest Linus tree? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-22 15:24 ` memory-cgroup bug Michal Hocko @ 2012-11-22 18:05 ` azurIt 2012-11-22 21:42 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-22 18:05 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >> i'm using memory cgroup for limiting our users and having a really >> strange problem when a cgroup gets out of its memory limit. It's very >> strange because it happens only sometimes (about once per week on >> random user), out of memory is usually handled ok. > >What is your memcg configuration? Do you use deeper hierarchies, is >use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) >enabled? Do you use soft limit for those groups? Is memcg swap >accounting enabled and memsw limits in place? >Is the machine under global memory pressure as well? >Could you post sysrq+t or sysrq+w? My cgroups hierarchy: /cgroups/<user_id>/uid/ where '<user_id>' is system user id and 'uid' is just word 'uid'. Memory limits are set in /cgroups/<user_id>/ and hierarchy is enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using hard limits for memory and swap BUT system has no swap at all (it has 'only' 16 GB of real RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enough of free memory when problem occurs. >> This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace >> freezes until process is killed (strace cannot be terminated by >> CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or >> killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc/<pid>/stack of freezed process: >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >Hmm what is this? Really doesn't know, i will get stack of all freezed processes next time so we can compare it. >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 >> [<ffffffffffffffff>] 0xffffffffffffffff >> > >How many tasks are hung in mem_cgroup_handle_oom? If there were many >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: >make oom_lock 0 and 1 based rather than counter) and its follow up fix >23751be00940 (memcg: fix hierarchical oom locking) but you are saying >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would >make more sense. Usually maximum of several 10s of processes but i will check it next time. I was having much worse problems in 2.6.32 - when freezing happens, the whole server was affected (i wasn't able to do anything and needs to wait until my scripts takes case of it and killed apache, so i don't have any detailed info). In 3.2 only target cgroup is affected. >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > >I guess this is a clean vanilla (stable) kernel, right? Are you able to >reproduce with the latest Linus tree? Well, no. I'm using, for example, newest stable grsecurity patch. I'm also using few of Andrea Righi's cgroup subsystems but i don't believe these are doing problems: - cgroup-uid which is moving processes into cgroups based on UID - cgroup-task which can limit number of tasks in cgroup (i already tried to disable this one, it didn't help) http://www.develer.com/~arighi/linux/patches/ Unfortunately i cannot just install new and untested kernel version cos i'm not able to reproduce this problem anytime (it's happening randomly in production environment). Could it be that OOM cannot start and kill processes because there's no free memory in cgroup? Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-22 18:05 ` azurIt @ 2012-11-22 21:42 ` Michal Hocko 2012-11-22 22:34 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-22 21:42 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Thu 22-11-12 19:05:26, azurIt wrote: [...] > My cgroups hierarchy: > /cgroups/<user_id>/uid/ > > where '<user_id>' is system user id and 'uid' is just word 'uid'. > > Memory limits are set in /cgroups/<user_id>/ and hierarchy is > enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using > hard limits for memory and swap BUT system has no swap at all > (it has 'only' 16 GB of real RAM). memory.oom_control is set to > 'oom_kill_disable 0'. Server has enough of free memory when problem > occurs. OK, so so the global reclaim shouldn't be active. This is definitely good to know. > >> This happens when problem occures: > >> - no new processes can be started for this cgroup > >> - current processes are freezed and taking 100% of CPU > >> - when i try to 'strace' any of current processes, the whole strace > >> freezes until process is killed (strace cannot be terminated by > >> CTRL-c) > >> - problem can be resolved by raising memory limit for cgroup or > >> killing of few processes inside cgroup so some memory is freed > >> > >> I also garbbed the content of /proc/<pid>/stack of freezed process: > >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > > >Hmm what is this? > > Really doesn't know, i will get stack of all freezed processes next > time so we can compare it. > > >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 > >> [<ffffffffffffffff>] 0xffffffffffffffff Btw. is this stack stable or is the task bouncing in some loop? And finally could you post the disassembly of your version of mem_cgroup_handle_oom, please? > >How many tasks are hung in mem_cgroup_handle_oom? If there were many > >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: > >make oom_lock 0 and 1 based rather than counter) and its follow up fix > >23751be00940 (memcg: fix hierarchical oom locking) but you are saying > >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would > >make more sense. > > > Usually maximum of several 10s of processes but i will check it next > time. I was having much worse problems in 2.6.32 - when freezing > happens, the whole server was affected (i wasn't able to do anything > and needs to wait until my scripts takes case of it and killed apache, > so i don't have any detailed info). Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before lock_page() at filemap COW) which was merged in 3.1. > In 3.2 only target cgroup is affected. > > >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > > >I guess this is a clean vanilla (stable) kernel, right? Are you able to > >reproduce with the latest Linus tree? > > > Well, no. I'm using, for example, newest stable grsecurity patch. That shouldn't be related > I'm also using few of Andrea Righi's cgroup subsystems but i don't > believe > these are doing problems: > - cgroup-uid which is moving processes into cgroups based on UID > - cgroup-task which can limit number of tasks in cgroup (i already > tried to disable this one, it didn't help) > http://www.develer.com/~arighi/linux/patches/ I am not familiar with those pathces but I will double check. > Unfortunately i cannot just install new and untested kernel version > cos i'm not able to reproduce this problem anytime (it's happening > randomly in production environment). This will make it a bit harder to debug but let's see maybe the new traces would help... > Could it be that OOM cannot start and kill processes because there's > no free memory in cgroup? That shouldn't happen. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-22 21:42 ` Michal Hocko @ 2012-11-22 22:34 ` azurIt [not found] ` <20121122233434.3D5E35E6-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-22 22:34 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Btw. is this stack stable or is the task bouncing in some loop? Not sure, will check it next time. >And finally could you post the disassembly of your version of >mem_cgroup_handle_oom, please? How can i do this? >What does your kernel log says while this is happening. Are there any >memcg OOM messages showing up? I will get the logs next time. Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121122233434.3D5E35E6-Rm0zKEqwvD4@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121122233434.3D5E35E6-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-23 7:40 ` Michal Hocko [not found] ` <20121123074023.GA24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-23 7:40 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121123074023.GA24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121123074023.GA24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-23 9:21 ` azurIt 2012-11-23 9:28 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 168+ messages in thread From: azurIt @ 2012-11-23 9:21 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:21 ` azurIt @ 2012-11-23 9:28 ` Michal Hocko [not found] ` <20121123092829.GE24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-11-23 9:34 ` Glauber Costa 2012-11-23 10:04 ` Michal Hocko 2 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-23 9:28 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: > >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or > >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom > >function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized > > > # objdump -d vmlinuz-3.2.34-grsec-1 You need vmlinux not vmlinuz... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121123092829.GE24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121123092829.GE24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-23 9:44 ` azurIt [not found] ` <20121123104423.338C7725-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-23 9:44 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> <http://www.gnu.org/software/gdb/bugs/>... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom <no output> i can recompile the kernel if anything needs to be added into it. azur ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121123104423.338C7725-Rm0zKEqwvD4@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121123104423.338C7725-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-23 10:10 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-23 10:10 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > <no output> Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:21 ` azurIt 2012-11-23 9:28 ` Michal Hocko @ 2012-11-23 9:34 ` Glauber Costa 2012-11-23 10:04 ` Michal Hocko 2 siblings, 0 replies; 168+ messages in thread From: Glauber Costa @ 2012-11-23 9:34 UTC (permalink / raw) To: azurIt; +Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist On 11/23/2012 01:21 PM, azurIt wrote: >> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 this is vmlinuz, not vmlinux. This is the compressed image. > > # file vmlinuz-3.2.34-grsec-1 > vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA > > I'm probably doing something wrong :) You need this: [glauber@straightjacket linux-glommer]$ file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not stripped instead of bzImage. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:21 ` azurIt 2012-11-23 9:28 ` Michal Hocko 2012-11-23 9:34 ` Glauber Costa @ 2012-11-23 10:04 ` Michal Hocko [not found] ` <20121123100438.GF24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-23 10:04 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: [...] > It, luckily, happend again so i have more info. > > - there wasn't any logs in kernel from OOM for that cgroup > - there were 16 processes in cgroup > - processes in cgroup were taking togather 100% of CPU (it > was allowed to use only one core, so 100% of that core) > - memory.failcnt was groving fast > - oom_control: > oom_kill_disable 0 > under_oom 0 (this was looping from 0 to 1) So there was an OOM going on but no messages in the log? Really strange. Kame already asked about oom_score_adj of the processes in the group but it didn't look like all the processes would have oom disabled, right? > - limit_in_bytes was set to 157286400 > - content of stat (as you can see, the whole memory limit was used): > cache 0 > rss 0 This looks like a top-level group for your user. > mapped_file 0 > pgpgin 0 > pgpgout 0 > swap 0 > pgfault 0 > pgmajfault 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 157286400 > hierarchical_memsw_limit 157286400 > total_cache 0 > total_rss 157286400 OK, so all the memory is anonymous and you have no swap so the oom is the only thing to do. > total_mapped_file 0 > total_pgpgin 10326454 > total_pgpgout 10288054 > total_swap 0 > total_pgfault 12939677 > total_pgmajfault 4283 > total_inactive_anon 0 > total_active_anon 157286400 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > > i also grabber oom_adj, oom_score_adj and stack of all processes, here > it is: > http://www.watchdog.sk/lkml/memcg-bug.tar Hmm, all processes waiting for oom are stuck at the very same place: $ grep mem_cgroup_handle_oom -r [0-9]* 30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 We are taking memcg_oom_lock spinlock twice in that function + we can schedule. As none of the tasks is scheduled this would suggest that you are blocked at the first lock. But who got the lock then? This is really strange. Btw. is sysrq+t resp. sysrq+w showing the same traces as /proc/<pid>/stat? > Notice that stack is different for few processes. Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous but it grabs the page before it really starts a transaction. > Stack for all processes were NOT chaging and was still the same. Could you take few snapshots over time? > Btw, don't know if it matters but i was several cgroup subsystems > mounted and i'm also using them (i was not activating freezer in this > case, don't know if it can be active automatically by kernel or what, No > didn't checked if cgroup was freezed but i suppose it wasn't): > none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Do you see the same issue if only memory controller was mounted (resp. cpuset which you seem to use as well from your description). I know you said booting into a vanilla kernel would be problematic but could you at least rule out te cgroup patches that you have mentioned? If you need to move a task to a group based by an uid you can use cgrules daemon (libcgroup1 package) for that as well. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121123100438.GF24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121123100438.GF24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-23 14:59 ` azurIt [not found] ` <20121123155904.490039C5-Rm0zKEqwvD4@public.gmane.org> 2012-11-25 0:10 ` azurIt 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-23 14:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup/<user-id>/ >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc/<pid>/stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121123155904.490039C5-Rm0zKEqwvD4@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121123155904.490039C5-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-25 10:17 ` Michal Hocko 2012-11-25 12:39 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-25 10:17 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup/<user-id>/ > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-25 10:17 ` Michal Hocko @ 2012-11-25 12:39 ` azurIt [not found] ` <20121125133953.AD1B2F0A-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-25 12:39 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Inlined at the end of the email. Please note I have compile tested >it. It might produce a lot of output. Thank you very much, i will install it ASAP (probably this night). >dmesg | grep "Out of memory" >doesn't tell anything, right? Only messages for other cgroups but not for the freezed one (before nor after the freeze). azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121125133953.AD1B2F0A-Rm0zKEqwvD4@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121125133953.AD1B2F0A-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-25 13:02 ` Michal Hocko [not found] ` <20121125130208.GC10623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-25 13:02 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121125130208.GC10623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121125130208.GC10623-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-25 13:27 ` azurIt [not found] ` <20121125142709.19F4E8C2-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-25 13:27 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121125142709.19F4E8C2-Rm0zKEqwvD4@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121125142709.19F4E8C2-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-25 13:44 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-25 13:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug [not found] ` <20121123100438.GF24698-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-11-23 14:59 ` azurIt @ 2012-11-25 0:10 ` azurIt 2012-11-25 12:05 ` Michal Hocko 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-25 0:10 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-25 0:10 ` azurIt @ 2012-11-25 12:05 ` Michal Hocko 2012-11-25 12:36 ` azurIt 2012-11-25 13:55 ` Michal Hocko 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-25 12:05 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki [Adding Kamezawa into CC] On Sun 25-11-12 01:10:47, azurIt wrote: > >Could you take few snapshots over time? > > > Here it is, now from different server, snapshot was taken every second > for 10 minutes (hope it's enough): > www.watchdog.sk/lkml/memcg-bug-2.tar.gz Hmm, interesting: $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' min:16281 max:224048 avg:18818.943119 So there is a lot of attempts to allocate which fail, every second! Will get to that later. The number of tasks in the group is stable (20): $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c 546 20 And no task has been killed or spawned: $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq 24495 24762 24774 24796 24798 24805 24813 24827 24831 24841 24842 24863 24892 24924 24931 25130 25131 25192 25193 25243 $ for stack in [0-9]*/[0-9]* do head -n1 $stack/stack done | sort | uniq -c 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 533 [<ffffffffffffffff>] 0xffffffffffffffff Tells us that the stacks are pretty much stable. $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c 546 24495 So 24495 is stuck in do_truncate [<ffffffff811109b8>] do_truncate+0x58/0xa0 [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff I suspect it is waiting for i_mutex. Who is holding that lock? Other tasks are blocked on the mem_cgroup_handle_oom either coming from the page fault path so i_mutex can be exluded or vfs_write (24796) and that one is interesting: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This smells like a deadlock. But kind strange one. The rapidly increasing failcnt suggests that somebody still tries to allocate but who when all of them hung in the mem_cgroup_handle_oom. This can be explained though. Memcg OOM killer let's only one process (which is able to lock the hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill a process, while others are waiting on the wait queue. Once the killer is done it calls memcg_wakeup_oom which wakes up other tasks waiting on the queue. Those retry the charge, in a hope there is some memory freed in the meantime which hasn't happened so they get into OOM again (and again and again). This all usually works out except in this particular case I would bet my hat that the OOM selected task is pid 24495 which is blocked on the mutex which is held by one of the oom killer task so it cannot finish - thus free a memory. It seems that the current Linus' tree is affected as well. I will have to think about a solution but it sounds really tricky. It is not just ext3 that is affected. I guess we need to tell mem_cgroup_cache_charge that it should never reach OOM from add_to_page_cache_locked. This sounds quite intrusive to me. On the other hand it is really weird that an excessive writer might trigger a memcg OOM killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-25 12:05 ` Michal Hocko @ 2012-11-25 12:36 ` azurIt 2012-11-25 13:55 ` Michal Hocko 1 sibling, 0 replies; 168+ messages in thread From: azurIt @ 2012-11-25 12:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >So there is a lot of attempts to allocate which fail, every second! Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). Not sure if all processes were using CPU but _few_ of them (not only one) for sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-25 12:05 ` Michal Hocko 2012-11-25 12:36 ` azurIt @ 2012-11-25 13:55 ` Michal Hocko 2012-11-26 0:38 ` azurIt 1 sibling, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-25 13:55 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' > min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 > 533 [<ffffffffffffffff>] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [<ffffffff811109b8>] do_truncate+0x58/0xa0 > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: memory-cgroup bug 2012-11-25 13:55 ` Michal Hocko @ 2012-11-26 0:38 ` azurIt [not found] ` <20121126013855.AF118F5E-Rm0zKEqwvD4@public.gmane.org> 2012-11-26 13:18 ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko 0 siblings, 2 replies; 168+ messages in thread From: azurIt @ 2012-11-26 0:38 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >This is hackish but it should help you in this case. Kamezawa, what do >you think about that? Should we generalize this and prepare something >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >automatically and use the function whenever we are in a locked context? >To be honest I do not like this very much but nothing more sensible >(without touching non-memcg paths) comes to my mind. I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! Btw, will this patch be backported to 3.2? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121126013855.AF118F5E-Rm0zKEqwvD4@public.gmane.org>]
* Re: memory-cgroup bug [not found] ` <20121126013855.AF118F5E-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-26 7:57 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-26 7:57 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 0:38 ` azurIt [not found] ` <20121126013855.AF118F5E-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-26 13:18 ` Michal Hocko [not found] ` <20121126131837.GC17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-11-26 17:46 ` [PATCH " Johannes Weiner 1 sibling, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-26 13:18 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because it has been used to prevent from OOM already (since not-merged-yet "memcg: reclaim when more than one page needed"). The only difference is that the flag doesn't prevent from reclaim anymore which kind of makes sense because the global memory allocator triggers reclaim as well. The retry without any reclaim on __GFP_NORETRY doesn't make much sense anyway because this is effectively a busy loop with allowed OOM in this path. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 12 ++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 5 +---- 4 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 10e667f..aac9b21 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -152,6 +152,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..1ad4bc6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef14351 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..b4754ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; - if (gfp_mask & __GFP_NORETRY) - return CHARGE_NOMEM; - ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
[parent not found: <20121126131837.GC17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121126131837.GC17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-26 13:21 ` Michal Hocko 2012-11-26 21:28 ` azurIt ` (2 more replies) 2012-11-27 0:05 ` [PATCH -mm] " Kamezawa Hiroyuki 1 sibling, 3 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-26 13:21 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Here we go with the patch for 3.2.34. Could you test with this one, please? --- From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1dbbe7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:21 ` [PATCH for 3.2.34] " Michal Hocko @ 2012-11-26 21:28 ` azurIt [not found] ` <20121126132149.GD17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-11-30 2:29 ` azurIt 2 siblings, 0 replies; 168+ messages in thread From: azurIt @ 2012-11-26 21:28 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121126132149.GD17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121126132149.GD17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 1:45 ` azurIt 0 siblings, 0 replies; 168+ messages in thread From: azurIt @ 2012-11-30 1:45 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:21 ` [PATCH for 3.2.34] " Michal Hocko 2012-11-26 21:28 ` azurIt [not found] ` <20121126132149.GD17860-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 2:29 ` azurIt [not found] ` <20121130032918.59B3F780-Rm0zKEqwvD4@public.gmane.org> 2 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-30 2:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources. azurIt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121130032918.59B3F780-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121130032918.59B3F780-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-30 12:45 ` Michal Hocko [not found] ` <20121130124506.GH29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-11-30 13:44 ` azurIt 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-30 12:45 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121130124506.GH29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121130124506.GH29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 12:53 ` azurIt 0 siblings, 0 replies; 168+ messages in thread From: azurIt @ 2012-11-30 12:53 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 12:45 ` Michal Hocko [not found] ` <20121130124506.GH29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 13:44 ` azurIt [not found] ` <20121130144427.51A09169-Rm0zKEqwvD4@public.gmane.org> 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-30 13:44 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121130144427.51A09169-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121130144427.51A09169-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-30 14:44 ` Michal Hocko [not found] ` <20121130144431.GI29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-11-30 15:08 ` azurIt 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-30 14:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121130144431.GI29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121130144431.GI29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 15:03 ` Michal Hocko 2012-11-30 15:37 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-30 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:03 ` Michal Hocko @ 2012-11-30 15:37 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-30 15:37 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:03:47, Michal Hocko wrote: [...] > Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation > from the page fault? Huh this shouldn't happen - ever. OK, it starts making sense now. The message came from pagefault_out_of_memory which doesn't have gfp nor the required node information any longer. This suggests that VM_FAULT_OOM has been returned by the fault handler. So this hasn't been triggered by the page fault allocator. I am wondering whether this could be caused by the patch but the effect of that one should be limitted to the write (unlike the later version for -mm tree which hooks into the shmem as well). Will have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 14:44 ` Michal Hocko [not found] ` <20121130144431.GI29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 15:08 ` azurIt 2012-11-30 15:39 ` Michal Hocko 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-30 15:08 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >DMA32 zone is usually fills up first 4G unless your HW remaps the rest >of the memory above 4G or you have a numa machine and the rest of the >memory is at other node. Could you post your memory map printed during >the boot? (e820: BIOS-provided physical RAM map: and following lines) Here is the full boot log: www.watchdog.sk/lkml/kern.log >You have mentioned that you are comounting with cpuset. If this happens >to be a NUMA machine have you made the access to all nodes available? >Also what does /proc/sys/vm/zone_reclaim_mode says? Don't really know what NUMA means and which nodes are you talking about, sorry :( # cat /proc/sys/vm/zone_reclaim_mode cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:08 ` azurIt @ 2012-11-30 15:39 ` Michal Hocko [not found] ` <20121130153942.GL29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-30 15:39 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:08:11, azurIt wrote: > >DMA32 zone is usually fills up first 4G unless your HW remaps the rest > >of the memory above 4G or you have a numa machine and the rest of the > >memory is at other node. Could you post your memory map printed during > >the boot? (e820: BIOS-provided physical RAM map: and following lines) > > > Here is the full boot log: > www.watchdog.sk/lkml/kern.log The log is not complete. Could you paste the comple dmesg output? Or even better, do you have logs from the previous run? > >You have mentioned that you are comounting with cpuset. If this happens > >to be a NUMA machine have you made the access to all nodes available? > >Also what does /proc/sys/vm/zone_reclaim_mode says? > > > Don't really know what NUMA means and which nodes are you talking > about, sorry :( http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access > # cat /proc/sys/vm/zone_reclaim_mode > cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory OK, so the NUMA is not enabled. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121130153942.GL29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121130153942.GL29317-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-11-30 15:59 ` azurIt 2012-11-30 16:19 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-30 15:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:59 ` azurIt @ 2012-11-30 16:19 ` Michal Hocko 2012-11-30 16:26 ` azurIt 2012-12-03 15:16 ` Michal Hocko 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-11-30 16:19 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:59:37, azurIt wrote: > >> Here is the full boot log: > >> www.watchdog.sk/lkml/kern.log > > > >The log is not complete. Could you paste the comple dmesg output? Or > >even better, do you have logs from the previous run? > > > What is missing there? All kernel messages are logging into > /var/log/kern.log (it's the same as dmesg), dmesg itself was already > rewrited by other messages. I think it's all what that kernel printed. Early boot messages are missing - so exactly the BIOS memory map I was asking for. As the NUMA has been excluded it is probably not that relevant anymore. The important question is why you see VM_FAULT_OOM and whether memcg charging failure can trigger that. I don not see how this could happen right now because __GFP_NORETRY is not used for user pages (except for THP which disable memcg OOM already), file backed page faults (aka __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. This is a real head scratcher. Could you also post your complete containers configuration, maybe there is something strange in there (basically grep . -r YOUR_CGROUP_MNT except for tasks files which are of no use right now). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:19 ` Michal Hocko @ 2012-11-30 16:26 ` azurIt [not found] ` <20121130172651.B6917602-Rm0zKEqwvD4@public.gmane.org> 2012-12-03 15:16 ` Michal Hocko 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-11-30 16:26 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Could you also post your complete containers configuration, maybe there >is something strange in there (basically grep . -r YOUR_CGROUP_MNT >except for tasks files which are of no use right now). Here it is: http://www.watchdog.sk/lkml/cgroups.gz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121130172651.B6917602-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121130172651.B6917602-Rm0zKEqwvD4@public.gmane.org> @ 2012-11-30 16:53 ` Michal Hocko 2012-11-30 20:43 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-11-30 16:53 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:53 ` Michal Hocko @ 2012-11-30 20:43 ` azurIt 0 siblings, 0 replies; 168+ messages in thread From: azurIt @ 2012-11-30 20:43 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The only strange thing I noticed is that some groups have 0 limit. Is >this intentional? >grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c > 3 memory.limit_in_bytes:0 These are users who are not allowed to run anything. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:19 ` Michal Hocko 2012-11-30 16:26 ` azurIt @ 2012-12-03 15:16 ` Michal Hocko [not found] ` <20121203151601.GA17093-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-03 15:16 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:19:23, Michal Hocko wrote: [...] > The important question is why you see VM_FAULT_OOM and whether memcg > charging failure can trigger that. I don not see how this could happen > right now because __GFP_NORETRY is not used for user pages (except for > THP which disable memcg OOM already), file backed page faults (aka > __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. > This is a real head scratcher. The following should print the traces when we hand over ENOMEM to the caller. It should catch all charge paths (migration is not covered but that one is not important here). If we don't see any traces from here and there is still global OOM striking then there must be something else to trigger this. Could you test this with the patch which aims at fixing your deadlock, please? I realise that this is a production environment but I do not see anything relevant in the code. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..9e5b56b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN(); return -ENOMEM; bypass: *ptr = NULL; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
[parent not found: <20121203151601.GA17093-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121203151601.GA17093-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-05 1:36 ` azurIt [not found] ` <20121205023644.18C3006B-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-05 1:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121205023644.18C3006B-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121205023644.18C3006B-Rm0zKEqwvD4@public.gmane.org> @ 2012-12-05 14:17 ` Michal Hocko [not found] ` <20121205141722.GA9714-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-05 14:17 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [<ffffffff81054eaa>] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [<ffffffff81054efa>] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [<ffffffff8110b2e1>] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [<ffffffff8110ba83>] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [<ffffffff8110bb05>] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [<ffffffff810eddf9>] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [<ffffffff810ee268>] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [<ffffffff810270ed>] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [<ffffffff815b54ff>] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [<ffffffff810cc91e>] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [<ffffffff810cc81f>] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [<ffffffff810ccde5>] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [<ffffffff810cd495>] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [<ffffffff810cd66d>] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [<ffffffff815b54ff>] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001 From: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Date: Tue, 29 May 2012 15:06:23 -0700 Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Johannes Weiner <jweiner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> (cherry picked from commit 1f1d06c34f7675026326cd9f39ff91e4555cf355) --- mm/huge_memory.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f005e9..470cbb4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -921,6 +921,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); + if (ret & VM_FAULT_OOM) + split_huge_page(page); put_page(page); goto out; } @@ -928,6 +930,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c index 70f5daf..15e686a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3469,6 +3469,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3482,13 +3483,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom, the huge pmd will + * have been split, so retry the fault on the + * pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
[parent not found: <20121205141722.GA9714-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121205141722.GA9714-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-06 0:29 ` azurIt [not found] ` <20121206012924.FE077FD7-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-06 0:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121206012924.FE077FD7-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121206012924.FE077FD7-Rm0zKEqwvD4@public.gmane.org> @ 2012-12-06 9:54 ` Michal Hocko 2012-12-06 10:12 ` azurIt [not found] ` <20121206095423.GB10931-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-12-06 9:54 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Thu, 6 Dec 2012 10:40:17 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5b56b..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,7 +2397,7 @@ done: return 0; nomem: *ptr = NULL; - __WARN(); + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-06 9:54 ` Michal Hocko @ 2012-12-06 10:12 ` azurIt 2012-12-06 17:06 ` Michal Hocko [not found] ` <20121206095423.GB10931-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-06 10:12 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Dohh. The very same stack mem_cgroup_newpage_charge called from the page >fault. The heavy inlining is not particularly helping here... So there >must be some other THP charge leaking out. >[/me is diving into the code again] > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > charge the huge page >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > handle_pte_fault >* collapse_huge_page is not called in the page fault path >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > so the memcg charging cannot return ENOMEM > >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one) >Anyway do you see the same problem if transparent huge pages are >disabled? >echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-06 10:12 ` azurIt @ 2012-12-06 17:06 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2012-12-06 17:06 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 11:12:49, azurIt wrote: > >Dohh. The very same stack mem_cgroup_newpage_charge called from the page > >fault. The heavy inlining is not particularly helping here... So there > >must be some other THP charge leaking out. > >[/me is diving into the code again] > > > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault > >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > > charge the huge page > >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > > handle_pte_fault > >* collapse_huge_page is not called in the page fault path > >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > > so the memcg charging cannot return ENOMEM > > > >There are no other callers AFAICS so I am getting clueless. Maybe more > >debugging will tell us something (the inlining has been reduced for thp > >paths which can reduce performance in thp page fault heavy workloads but > >this will give us better traces - I hope). > > > Should i apply all patches togather? (fix for this bug, more log > messages, backported fix from 3.5 and this new one) Yes please -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121206095423.GB10931-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121206095423.GB10931-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-10 1:20 ` azurIt [not found] ` <20121210022038.E6570D37-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-10 1:20 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121210022038.E6570D37-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121210022038.E6570D37-Rm0zKEqwvD4@public.gmane.org> @ 2012-12-10 9:43 ` Michal Hocko [not found] ` <20121210094318.GA6777-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-10 9:43 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121210094318.GA6777-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121210094318.GA6777-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-10 10:18 ` azurIt [not found] ` <20121210111817.F697F53E-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-10 10:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121210111817.F697F53E-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121210111817.F697F53E-Rm0zKEqwvD4@public.gmane.org> @ 2012-12-10 15:52 ` Michal Hocko 2012-12-10 17:18 ` azurIt 2012-12-17 1:34 ` azurIt 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2012-12-10 15:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 15:52 ` Michal Hocko @ 2012-12-10 17:18 ` azurIt 2012-12-17 1:34 ` azurIt 1 sibling, 0 replies; 168+ messages in thread From: azurIt @ 2012-12-10 17:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. ok. >But was it at least related to the debugging from the patch or it was >rather a totally unrelated thing? I wasn't reading it much but i think it looks like a traces i was sending you before. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 15:52 ` Michal Hocko 2012-12-10 17:18 ` azurIt @ 2012-12-17 1:34 ` azurIt 2012-12-17 16:32 ` Michal Hocko 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-17 1:34 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. It didn't take off the whole system this time (but i was prepared to record a video of console ;) ), here it is: http://www.watchdog.sk/lkml/oom_mysqld4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-17 1:34 ` azurIt @ 2012-12-17 16:32 ` Michal Hocko 2012-12-17 18:23 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-17 16:32 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 02:34:30, azurIt wrote: > >I would try to limit changes to minimum. So the original kernel you were > >using + the first patch to prevent OOM from the write path + 2 debugging > >patches. > > > It didn't take off the whole system this time (but i was > prepared to record a video of console ;) ), here it is: > http://www.watchdog.sk/lkml/oom_mysqld4 [...] [ 1248.059429] ------------[ cut here ]------------ [ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610() [ 1248.059723] Hardware name: S5000VSA [ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2 This is GFP_KERNEL allocation which is expected. It is also a simple page which is not that expected because we shouldn't return ENOMEM on those unless this was GFP_ATOMIC allocation (which it wasn't) or the caller told us to not trigger OOM which is the case only for THP pages (see mem_cgroup_charge_common). So the big question is how have we ended up with oom=false here... [Ohh, I am really an idiot. I screwed the first patch] - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). No idea how I could have missed that. I am really sorry about that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04676d..1f35a74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-17 16:32 ` Michal Hocko @ 2012-12-17 18:23 ` azurIt [not found] ` <20121217192301.829A7020-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-17 18:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >[Ohh, I am really an idiot. I screwed the first patch] >- bool oom = true; >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > No idea how I could have missed that. I am really sorry about that. :D no problem :) so, now it should really work as expected and completely fix my original problem? is it safe to apply it on 3.2.35? Thank you very much! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121217192301.829A7020-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121217192301.829A7020-Rm0zKEqwvD4@public.gmane.org> @ 2012-12-17 19:55 ` Michal Hocko [not found] ` <20121217195510.GA16375-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-17 19:55 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121217195510.GA16375-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121217195510.GA16375-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-18 14:22 ` azurIt 2012-12-18 15:20 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-18 14:22 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-18 14:22 ` azurIt @ 2012-12-18 15:20 ` Michal Hocko [not found] ` <20121218152004.GA25208-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-18 15:20 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 18-12-12 15:22:23, azurIt wrote: > >It should mitigate the problem. The real fix shouldn't be that specific > >(as per discussion in other thread). The chance this will get upstream > >is not big and that means that it will not get to the stable tree > >either. > > > OOM is no longer killing processes outside target cgroups, so > everything looks fine so far. Will report back when i will have more > info. Thnks! OK, good to hear and fingers crossed. I will try to get back to the original problem and a better solution sometimes early next year when all the things settle a bit. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121218152004.GA25208-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121218152004.GA25208-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-24 13:25 ` azurIt [not found] ` <20121224142526.020165D3-Rm0zKEqwvD4@public.gmane.org> 2012-12-24 13:38 ` azurIt 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-24 13:25 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121224142526.020165D3-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121224142526.020165D3-Rm0zKEqwvD4@public.gmane.org> @ 2012-12-28 16:22 ` Michal Hocko [not found] ` <20121228162209.GA1455-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-28 16:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [<ffffffff8110a971>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b55b>] T.1147+0x5ab/0x5c0 [<ffffffff8110c1de>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca20f>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca322>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cac53>] find_or_create_page+0x73/0xb0 [<ffffffff8114340a>] __getblk+0xea/0x2c0 [<ffffffff811921ab>] ext3_getblk+0xeb/0x240 [<ffffffff81192319>] ext3_bread+0x19/0x90 [<ffffffff811967e3>] ext3_dx_find_entry+0x83/0x1e0 [<ffffffff81196c24>] ext3_find_entry+0x2e4/0x480 [<ffffffff8119750d>] ext3_lookup+0x4d/0x120 [<ffffffff8111cff5>] d_alloc_and_lookup+0x45/0x90 [<ffffffff8111d598>] do_lookup+0x278/0x390 [<ffffffff8111f11e>] path_lookupat+0xae/0x7e0 [<ffffffff8111f885>] do_path_lookup+0x35/0xe0 [<ffffffff8111fa19>] user_path_at_empty+0x59/0xb0 [<ffffffff8111fa81>] user_path_at+0x11/0x20 [<ffffffff811164d7>] vfs_fstatat+0x47/0x80 [<ffffffff8111657e>] vfs_lstat+0x1e/0x20 [<ffffffff811165a4>] sys_newlstat+0x24/0x50 [<ffffffff815b5a66>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
[parent not found: <20121228162209.GA1455-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121228162209.GA1455-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-12-30 1:09 ` azurIt 2012-12-30 11:08 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2012-12-30 1:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-30 1:09 ` azurIt @ 2012-12-30 11:08 ` Michal Hocko [not found] ` <20121230110815.GA12940-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2012-12-30 11:08 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 30-12-12 02:09:47, azurIt wrote: > >which suggests that the patch is incomplete and that I am blind :/ > >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >follow-up patch on top of the one you already have (which should catch > >all the remaining cases). > >Sorry about that... > > > This was, again, killing my MySQL server (search for "(mysqld)"): > http://www.watchdog.sk/lkml/oom_mysqld5 grep "Kill process" oom_mysqld5 Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child So your mysqld has been killed by the global OOM not memcg. But why when you seem to be perfectly fine regarding memory? I guess the following backtrace is relevant: Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 This would suggest that an unexpected ENOMEM leaked during page fault path. I do not see which one could that be because you said THP (CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have mentioned in the thread should fix that issue - btw. the patch is already scheduled for stable tree). __do_fault, do_anonymous_page and do_wp_page call mem_cgroup_newpage_charge with GFP_KERNEL which means that we do memcg OOM and never return ENOMEM. do_swap_page calls mem_cgroup_try_charge_swapin with GFP_KERNEL as well. I might have missed something but I will not get to look closer before 2nd January. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20121230110815.GA12940-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20121230110815.GA12940-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-01-25 15:07 ` azurIt 2013-01-25 16:31 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-01-25 15:07 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko-AlSwsSmVLrQ@public.gmane.org> > Komu: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> > Dátum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-01-25 15:07 ` azurIt @ 2013-01-25 16:31 ` Michal Hocko [not found] ` <20130125163130.GF4721-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-01-25 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > ______________________________________________________________ > > Od: "Michal Hocko" <mhocko@suse.cz> > > Komu: azurIt <azurit@pobox.sk> > > Dátum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >-- > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130125163130.GF4721-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20130125163130.GF4721-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-02-05 13:49 ` Michal Hocko 2013-02-05 14:49 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-05 13:49 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Mon, 3 Dec 2012 16:16:01 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 13:49 ` Michal Hocko @ 2013-02-05 14:49 ` azurIt 2013-02-05 16:09 ` Michal Hocko [not found] ` <20130205154947.CD6411E2-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 2 replies; 168+ messages in thread From: azurIt @ 2013-02-05 14:49 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Sorry, to get back to this that late but I was busy as hell since the >beginning of the year. Thank you for your time! >Has the issue repeated since then? Yes, it's happening all the time but meanwhile i wrote a script which is monitoring the problem and killing freezed processes when it occurs. But i don't like it much, it's not a solution for me :( i also noticed, that problem is always affecting the whole server but not so much as freezed cgroup. Depends on number of freezed processes, sometimes it has almost no imapct on the rest of the server, sometimes the whole server is lagging much. I have another old problem which is maybe also related to this. I wasn't connecting it with this before but now i'm not sure. Two of our servers, which are affected by this cgroup problem, are also randomly freezing completely (few times per month). These are the symptoms: - servers are answering to ping - it is possible to connect via SSH but connection is freezed after sending the password - it is possible to login via console but it is freezed after typeing the login These symptoms are very similar to HDD problems or HDD overload (but there is no overload for sure). The only way to fix it is, probably, hard rebooting the server (didn't find any other way). What do you think? Can this be related? Maybe HDDs are locked in the similar way the cgroups are - we already found out that cgroup freezeing is related also to HDD activity. Maybe there is a little chance that the whole HDD subsystem ends in deadlock? >You said you didn't apply other than the above mentioned patch. Could >you apply also debugging part of the patches I have sent? >In case you don't have it handy then it should be this one: Just to be sure - am i supposed to apply this two patches? http://watchdog.sk/lkml/patches/ azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 14:49 ` azurIt @ 2013-02-05 16:09 ` Michal Hocko 2013-02-05 16:46 ` azurIt ` (2 more replies) [not found] ` <20130205154947.CD6411E2-Rm0zKEqwvD4@public.gmane.org> 1 sibling, 3 replies; 168+ messages in thread From: Michal Hocko @ 2013-02-05 16:09 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 10 ++++++---- 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1986c65..a68aa08 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { @@ -2771,6 +2771,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2783,7 +2784,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2819,6 +2820,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2841,13 +2843,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:09 ` Michal Hocko @ 2013-02-05 16:46 ` azurIt 2013-02-05 16:48 ` Greg Thelen 2013-02-06 1:17 ` azurIt 2 siblings, 0 replies; 168+ messages in thread From: azurIt @ 2013-02-05 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused by all those patches. will try it this night and report back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:09 ` Michal Hocko 2013-02-05 16:46 ` azurIt @ 2013-02-05 16:48 ` Greg Thelen 2013-02-05 17:46 ` Michal Hocko 2013-02-06 1:17 ` azurIt 2 siblings, 1 reply; 168+ messages in thread From: Greg Thelen @ 2013-02-05 16:48 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 15:49:47, azurIt wrote: > [...] >> Just to be sure - am i supposed to apply this two patches? >> http://watchdog.sk/lkml/patches/ > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > mentioned in a follow up email. Here is the full patch: > --- > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff It looks like grab_cache_page_write_begin() passes __GFP_FS into __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me think that this deadlock is also possible in the page allocator even before getting to add_to_page_cache_lru. no? Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the page allocator? If __GFP_FS was avoided, then I think memcg user page charging would need a !__GFP_FS check to avoid invoking oom killer, but at least then we'd avoid both deadlocks and cover both page allocation and memcg page charging in similar fashion. Example from memcg_charge_kmem: may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:48 ` Greg Thelen @ 2013-02-05 17:46 ` Michal Hocko 2013-02-05 18:09 ` Greg Thelen 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-05 17:46 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 08:48:23, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 15:49:47, azurIt wrote: > > [...] > >> Just to be sure - am i supposed to apply this two patches? > >> http://watchdog.sk/lkml/patches/ > > > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > mentioned in a follow up email. Here is the full patch: > > --- > > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > Process B > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > [<ffffffff81112381>] sys_write+0x51/0x90 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > It looks like grab_cache_page_write_begin() passes __GFP_FS into > __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > think that this deadlock is also possible in the page allocator even > before getting to add_to_page_cache_lru. no? I am not that familiar with VFS but i_mutex is a high level lock AFAIR and it shouldn't be called from the pageout path so __page_cache_alloc should be safe. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 17:46 ` Michal Hocko @ 2013-02-05 18:09 ` Greg Thelen [not found] ` <xr93a9ri4op6.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Greg Thelen @ 2013-02-05 18:09 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> > [...] >> >> Just to be sure - am i supposed to apply this two patches? >> >> http://watchdog.sk/lkml/patches/ >> > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > mentioned in a follow up email. Here is the full patch: >> > --- >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> > From: Michal Hocko <mhocko@suse.cz> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> > >> > memcg oom killer might deadlock if the process which falls down to >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> > terminate because it is blocked on the very same lock. >> > This can happen when a write system call needs to allocate a page but >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> > have been reclaimed already) and the process selected by memcg OOM >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> > >> > Process A >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >> > [<ffffffff81121c90>] do_last+0x250/0xa30 >> > [<ffffffff81122547>] path_openat+0xd7/0x440 >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >> > [<ffffffff8110f950>] sys_open+0x20/0x30 >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> > [<ffffffffffffffff>] 0xffffffffffffffff >> > >> > Process B >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 >> > [<ffffffff81112381>] sys_write+0x51/0x90 >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> think that this deadlock is also possible in the page allocator even >> before getting to add_to_page_cache_lru. no? > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > and it shouldn't be called from the pageout path so __page_cache_alloc > should be safe. I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. My concern is that __page_cache_alloc() will invoke the oom killer and select a victim which wants i_mutex. This victim will deadlock because the oom killer caller already holds i_mutex. The wild accusation I am making is that anyone who invokes the oom killer and waits on the victim to die is essentially grabbing all of the locks that any of the oom killer victims may grab (e.g. i_mutex). To avoid deadlock the oom killer can only be called is while holding no locks that the oom victim demands. I think some locks are grabbed in a way that allows the lock request to fail if the task has a fatal signal pending, so they are safe. But any locks acquisitions that cannot fail (e.g. mutex_lock) will deadlock with the oom killing process. So the oom killing process cannot hold any such locks which the victim will attempt to grab. Hopefully I'm missing something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <xr93a9ri4op6.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <xr93a9ri4op6.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> @ 2013-02-05 18:59 ` Michal Hocko 2013-02-08 4:27 ` Greg Thelen 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-05 18:59 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > > >> > Process B > >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 18:59 ` Michal Hocko @ 2013-02-08 4:27 ` Greg Thelen [not found] ` <xr93ip63ig6j.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Greg Thelen @ 2013-02-08 4:27 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 10:09:57, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> >> > [...] >> >> >> Just to be sure - am i supposed to apply this two patches? >> >> >> http://watchdog.sk/lkml/patches/ >> >> > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> >> > mentioned in a follow up email. Here is the full patch: >> >> > --- >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> >> > From: Michal Hocko <mhocko@suse.cz> >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> >> > >> >> > memcg oom killer might deadlock if the process which falls down to >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> >> > terminate because it is blocked on the very same lock. >> >> > This can happen when a write system call needs to allocate a page but >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> >> > have been reclaimed already) and the process selected by memcg OOM >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> >> > >> >> > Process A >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> > >> >> > Process B >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> >> think that this deadlock is also possible in the page allocator even >> >> before getting to add_to_page_cache_lru. no? >> > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR >> > and it shouldn't be called from the pageout path so __page_cache_alloc >> > should be safe. >> >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. >> My concern is that __page_cache_alloc() will invoke the oom killer and >> select a victim which wants i_mutex. This victim will deadlock because >> the oom killer caller already holds i_mutex. > > That would be true for the memcg oom because that one is blocking but > the global oom just puts the allocator into sleep for a while and then > the allocator should back off eventually (unless this is NOFAIL > allocation). I would need to look closer whether this is really the case > - I haven't seen that allocator code path for a while... I think the page allocator can loop forever waiting for an oom victim to terminate even without NOFAIL. Especially if the oom victim wants a resource exclusively held by the allocating thread (e.g. i_mutex). It looks like the same deadlock you describe is also possible (though more rare) without memcg. If the looping thread is an eligible oom victim (i.e. not oom disabled, not an kernel thread, etc) then the page allocator can return NULL in so long as NOFAIL is not used. So any allocator which is able to call the oom killer and is not oom disabled (kernel thread, etc) is already exposed to the possibility of page allocator failure. So if the page allocator could detect the deadlock, then it could safely return NULL. Maybe after looping N times without forward progress the page allocator should consider failing unless NOFAIL is given. Switching back to the memcg oom situation, can we similarly return NULL if memcg oom kill has been tried a reasonable number of times. Simply failing the memcg charge with ENOMEM seems easier to support than exceeding limit (Kame's loan patch). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <xr93ip63ig6j.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <xr93ip63ig6j.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> @ 2013-02-08 16:29 ` Michal Hocko 2013-02-08 16:40 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-08 16:29 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-08 16:29 ` Michal Hocko @ 2013-02-08 16:40 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2013-02-08 16:40 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 17:29:18, Michal Hocko wrote: [...] > OK, I have checked the allocator slow path and you are right even > GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. > OOM killed task blocked on down_write(mmap_sem) while the page fault > handler holding mmap_sem for reading and allocating a new page without > any progress. And now that I think about it some more it sounds like it shouldn't be possible because allocator would fail because it would see TIF_MEMDIE (OOM killer kills all threads that share the same mm). But maybe there are other locks that are dangerous, but I think that the risk is pretty low. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:09 ` Michal Hocko 2013-02-05 16:46 ` azurIt 2013-02-05 16:48 ` Greg Thelen @ 2013-02-06 1:17 ` azurIt [not found] ` <20130206021721.1AE9E3C7-Rm0zKEqwvD4@public.gmane.org> 2 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-06 1:17 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130206021721.1AE9E3C7-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20130206021721.1AE9E3C7-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-06 14:01 ` Michal Hocko [not found] ` <20130206140119.GD10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2013-02-07 11:01 ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2013-02-06 14:01 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 [<ffffffff810eab18>] __do_fault+0x78/0x5a0 [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 [<ffffffff810f2508>] ? vma_link+0x88/0xe0 [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 [<ffffffff8102709d>] do_page_fault+0x13d/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130206140119.GD10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked [not found] ` <20130206140119.GD10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-02-06 14:22 ` Michal Hocko [not found] ` <20130206142219.GF10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-06 14:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130206142219.GF10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130206142219.GF10254-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-02-06 16:00 ` Michal Hocko 2013-02-08 5:03 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-06 16:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Wed, 6 Feb 2013 16:45:07 +0100 Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from dangerous context. Memcg charging code has no way to find out whether it is called from a locked context we have to help it via process flags. PF_OOM_ORIGIN flag removed recently will be reused for PF_NO_MEMCG_OOM which signals that the memcg OOM killer could lead to a deadlock. Only locked callers of __generic_file_aio_write are currently marked. I am pretty sure there are more places (I didn't check shmem and hugetlb uses fancy instantion mutex during page fault and filesystems might use some locks during the write) but I've ignored those as this will probably be just a user specific patch without any way to get upstream in the current form. Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- drivers/staging/pohmelfs/inode.c | 2 ++ include/linux/sched.h | 1 + mm/filemap.c | 2 ++ mm/memcontrol.c | 18 ++++++++++++++---- 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c index 7a19555..523de82e 100644 --- a/drivers/staging/pohmelfs/inode.c +++ b/drivers/staging/pohmelfs/inode.c @@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, if (ret) goto err_out_unlock; + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; *ppos = kiocb.ki_pos; mutex_unlock(&inode->i_mutex); diff --git a/include/linux/sched.h b/include/linux/sched.h index 1e86bb4..f275c8f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..58a316b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, mutex_lock(&inode->i_mutex); blk_start_plug(&plug); + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; mutex_unlock(&inode->i_mutex); if (ret > 0 || ret == -EIOCBQUEUED) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..128b615 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,14 @@ done: return 0; nomem: *ptr = NULL; + if (printk_ratelimit()) + printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." + " If this message shows up very often for the" + " same task then there is a risk that the" + " process is not able to make any progress" + " because of the current limit. Try to enlarge" + " the hard limit.\n", __FUNCTION__, + current->comm, current->pid, memcg); return -ENOMEM; bypass: *ptr = NULL; @@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(current->flags & PF_NO_MEMCG_OOM); int ret; if (PageTransHuge(page)) { @@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg; int ret; @@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-06 16:00 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Michal Hocko @ 2013-02-08 5:03 ` azurIt 2013-02-08 9:44 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-08 5:03 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, thank you very much but it just didn't work and broke everything :( This happened: Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > > >mentioned in a follow up email. Here is the full patch: >> > > >> > > >> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> > >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> > apache2 cpuset=uid mems_allowed=0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > >> > The first trace comes from the debugging WARN and it clearly points to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier. >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> > >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to be >> > not in fact. >> > >> > We need something more clever appaerently. One way would be not misusing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 >> > bits for those flags in gfp_t so there should be some room there. >> > Or we could do this per task flag, same we do for NO_IO in the current >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called from >> > unlocked contexts as well. >> >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left. I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko <mhocko@suse.cz> >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >[<ffffffff81121c90>] do_last+0x250/0xa30 >[<ffffffff81122547>] path_openat+0xd7/0x440 >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >[<ffffffff8110f950>] sys_open+0x20/0x30 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >Process B >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >[<ffffffff8111156a>] do_sync_write+0xea/0x130 >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 >[<ffffffff81112381>] sys_write+0x51/0x90 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt <azurit@pobox.sk> >Signed-off-by: Michal Hocko <mhocko@suse.cz> >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, > if (ret) > goto err_out_unlock; > >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > *ppos = kiocb.ki_pos; > > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); > > if (ret > 0 || ret == -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr = NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr = NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > struct page_cgroup *pc; >- bool oom = true; >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > int ret; > > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg = NULL; > int ret; > >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > mm = &init_mm; > > if (page_is_file_cache(page)) { >- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; > >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; > >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *ptr = memcg; >- ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm = &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } > > static void >-- >1.7.10.4 > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 5:03 ` azurIt @ 2013-02-08 9:44 ` Michal Hocko 2013-02-08 11:02 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-08 9:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 06:03:04, azurIt wrote: > Michal, thank you very much but it just didn't work and broke > everything :( I am sorry to hear that. The patch should help to solve the deadlock you have seen earlier. It in no way can solve side effects of failing writes and it also cannot help much if the oom is permanent. > This happened: > Problem started to occur really often immediately after booting the > new kernel, every few minutes for one of my users. But everything > other seems to work fine so i gave it a try for a day (which was a > mistake). I grabbed some data for you and go to sleep: > http://watchdog.sk/lkml/memcg-bug-4.tar.gz Do you have logs from that time period? I have only glanced through the stacks and most of the threads are waiting in the mem_cgroup_handle_oom (mostly from the page fault path where we do not have other options than waiting) which suggests that your memory limit is seriously underestimated. If you look at the number of charging failures (memory.failcnt per-group file) then you will get 9332083 failures in _average_ per group. This is a lot! Not all those failures end with OOM, of course. But it clearly signals that the workload need much more memory than the limit allows. > Few hours later i was woke up from my sweet sweet dreams by alerts > smses - Apache wasn't working and our system failed to restart > it. When i observed the situation, two apache processes (of that user > as above) were still running and it wasn't possible to kill them by > any way. I grabbed some data for you: > http://watchdog.sk/lkml/memcg-bug-5.tar.gz There are only 5 groups in this one and all of them have no memory charged (so no OOM going on). All tasks are somewhere in the ptrace code. grep cache -r . ./1360297489/memory.stat:cache 0 ./1360297489/memory.stat:total_cache 65642496 ./1360297491/memory.stat:cache 0 ./1360297491/memory.stat:total_cache 65642496 ./1360297492/memory.stat:cache 0 ./1360297492/memory.stat:total_cache 65642496 ./1360297490/memory.stat:cache 0 ./1360297490/memory.stat:total_cache 65642496 ./1360297488/memory.stat:cache 0 ./1360297488/memory.stat:total_cache 65642496 which suggests that this is a parent group and the memory is charged in a child group. I guess that all those are under OOM as the number seems like they have limit at 62M. > Then I logged to the console and this was waiting for me: > http://watchdog.sk/lkml/error.jpg This is just a warning and it should be harmless. There is just one WARN in ptrace_check_attach: WARN_ON_ONCE(task_is_stopped(child)) This has been introduced by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561 and the commit description claim this shouldn't happen. I am not familiar with this code but it sounds like a bug in the tracing code which is not related to the discussed issue. > Finally i rebooted into different kernel, wrote this e-mail and go to > my lovely bed ;) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 9:44 ` Michal Hocko @ 2013-02-08 11:02 ` azurIt [not found] ` <20130208120249.FD733220-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-08 11:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner > >Do you have logs from that time period? > >I have only glanced through the stacks and most of the threads are >waiting in the mem_cgroup_handle_oom (mostly from the page fault path >where we do not have other options than waiting) which suggests that >your memory limit is seriously underestimated. If you look at the number >of charging failures (memory.failcnt per-group file) then you will get >9332083 failures in _average_ per group. This is a lot! >Not all those failures end with OOM, of course. But it clearly signals >that the workload need much more memory than the limit allows. What type of logs? I have all. Memory usage graph: http://www.watchdog.sk/lkml/memory2.png New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). >There are only 5 groups in this one and all of them have no memory >charged (so no OOM going on). All tasks are somewhere in the ptrace >code. It's all from the same cgroup but from different time. >grep cache -r . >./1360297489/memory.stat:cache 0 >./1360297489/memory.stat:total_cache 65642496 >./1360297491/memory.stat:cache 0 >./1360297491/memory.stat:total_cache 65642496 >./1360297492/memory.stat:cache 0 >./1360297492/memory.stat:total_cache 65642496 >./1360297490/memory.stat:cache 0 >./1360297490/memory.stat:total_cache 65642496 >./1360297488/memory.stat:cache 0 >./1360297488/memory.stat:total_cache 65642496 > >which suggests that this is a parent group and the memory is charged in >a child group. I guess that all those are under OOM as the number seems >like they have limit at 62M. The cgroup has limit 330M (346030080 bytes). As i said, these two processes were stucked and was impossible to kill them. They were, maybe, the processes which i was trying to 'strace' before - 'strace' was freezed as always when the cgroup has this problem and i killed it (i was just trying if it is the original cgroup problem). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130208120249.FD733220-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130208120249.FD733220-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-08 12:38 ` Michal Hocko 2013-02-08 13:56 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-08 12:38 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 12:38 ` Michal Hocko @ 2013-02-08 13:56 ` azurIt [not found] ` <20130208145616.FB78CE24-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-08 13:56 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >kernel log would be sufficient. Full kernel log from kernel with you newest patch: http://watchdog.sk/lkml/kern2.log >This limit is for top level groups, right? Those seem to children which >have 62MB charged - is that a limit for those children? It was the limit for parent cgroup and processes were in one (the same) child cgroup. Child cgroup has no memory limit set (so limit for parent was also limit for child - 330 MB). >Which are those two processes? Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ >I have no idea what is the strace role here. I was stracing exactly two processes from that cgroup and exactly two processes were stucked later and was immpossible to kill them. Both of them were waiting on 'ptrace_stop'. Maybe it's completely unrelated, just guessing. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130208145616.FB78CE24-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130208145616.FB78CE24-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-08 14:47 ` Michal Hocko 2013-02-08 15:24 ` Michal Hocko 1 sibling, 0 replies; 168+ messages in thread From: Michal Hocko @ 2013-02-08 14:47 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130208145616.FB78CE24-Rm0zKEqwvD4@public.gmane.org> 2013-02-08 14:47 ` Michal Hocko @ 2013-02-08 15:24 ` Michal Hocko [not found] ` <20130208152402.GD7557-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-08 15:24 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130208152402.GD7557-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130208152402.GD7557-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-02-08 15:58 ` azurIt [not found] ` <20130208165805.8908B143-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-08 15:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130208165805.8908B143-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130208165805.8908B143-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-08 17:10 ` Michal Hocko 2013-02-08 21:02 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-08 17:10 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 17:10 ` Michal Hocko @ 2013-02-08 21:02 ` azurIt [not found] ` <20130208220243.EDEE0825-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-08 21:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner > >I assume you have checked that the killed processes eventually die, >right? > When i killed them by hand, yes, they dissappeard from process list (i saw it). I don't know if they really died when OOM killed them. >Well, I do not see anything supsicious during that time period >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 >02:36:48). The kernel log shows a lot of oom during that time. All >killed processes die eventually. No, they didn't died by OOM when cgroup was freezed. Just check PIDs from memcg-bug-4.tar.gz and try to find them in kernel log. Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no OOM message in the log? Data in memcg-bug-4.tar.gz are only for 2 minutes but i let it run for about 15-20 minutes, no single process killed by OOM. I'm 100% sure that OOM was not killing them (maybe it was trying to but it didn't happen). > >Nothing shows it would be a deadlock so far. It is well possible that >the userspace went mad when seeing a lot of processes dying because it >doesn't expect it. > Lots of processes are dying also now, without your latest patch, and no such things are happening. I'm sure there is something more it this, maybe it revealed another bug? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130208220243.EDEE0825-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130208220243.EDEE0825-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-10 15:03 ` Michal Hocko [not found] ` <20130210150310.GA9504-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-10 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [<ffffffff81069f94>] ptrace_stop+0x114/0x290 [<ffffffff8106a198>] ptrace_do_notify+0x88/0xa0 [<ffffffff8106a203>] ptrace_notify+0x53/0x70 [<ffffffff8100d168>] syscall_trace_enter+0xf8/0x1c0 [<ffffffff815b6983>] tracesys+0x71/0xd7 [<ffffffffffffffff>] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810ee2a9>] handle_pte_fault+0x609/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [<ffffffffffffffff>] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc/<pid>/stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130210150310.GA9504-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130210150310.GA9504-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-02-10 16:46 ` azurIt 2013-02-11 11:22 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-10 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-10 16:46 ` azurIt @ 2013-02-11 11:22 ` Michal Hocko [not found] ` <20130211112240.GC19922-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2013-02-22 12:00 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set azurIt 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2013-02-11 11:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [<ffffffff81125eb9>] poll_schedule_timeout+0x49/0x70 [<ffffffff8112675b>] do_sys_poll+0x54b/0x680 [<ffffffff81126b4c>] sys_poll+0x7c/0xf0 [<ffffffff815b6866>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810eca1e>] do_wp_page+0x14e/0x800 [<ffffffff810edf04>] handle_pte_fault+0x264/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 11 Feb 2013 12:18:36 +0100 Subject: [PATCH] oom: debug skipping killing --- mm/oom_kill.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..3d759f0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) thaw_process(p); + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); } if (!p->mm) @@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * then wait for it to finish before killing * some other task unnecessarily. */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) + if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); + } } } @@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags); set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } @@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) * its memory. */ if (fatal_signal_pending(current)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n", + p->pid, p->flags); set_thread_flag(TIF_MEMDIE); return; } -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
[parent not found: <20130211112240.GC19922-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130211112240.GC19922-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-02-22 8:23 ` azurIt [not found] ` <20130222092332.4001E4B6-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-22 8:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130222092332.4001E4B6-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130222092332.4001E4B6-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-22 12:52 ` Michal Hocko 2013-02-22 12:54 ` azurIt 2013-06-06 16:04 ` Michal Hocko 1 sibling, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-02-22 12:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-22 12:52 ` Michal Hocko @ 2013-02-22 12:54 ` azurIt [not found] ` <20130222135442.ADFFF498-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-02-22 12:54 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I am not sure how much time I'll have for this today but just to make >sure we are on the same page, could you point me to the two patches you >have applied in the mean time? Here: http://watchdog.sk/lkml/patches2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130222135442.ADFFF498-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130222135442.ADFFF498-Rm0zKEqwvD4@public.gmane.org> @ 2013-02-22 13:00 ` Michal Hocko 0 siblings, 0 replies; 168+ messages in thread From: Michal Hocko @ 2013-02-22 13:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set [not found] ` <20130222092332.4001E4B6-Rm0zKEqwvD4@public.gmane.org> 2013-02-22 12:52 ` Michal Hocko @ 2013-06-06 16:04 ` Michal Hocko 2013-06-06 16:16 ` azurIt 1 sibling, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-06-06 16:04 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-06-06 16:04 ` Michal Hocko @ 2013-06-06 16:16 ` azurIt 2013-06-07 13:11 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-06-06 16:16 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >> >> >> >> Hi Michal, >> >> >> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: >> >> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >> >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. >> >> >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >> >> >> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). >> >> >> >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-06 16:16 ` azurIt @ 2013-06-07 13:11 ` Michal Hocko 2013-06-17 10:21 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-06-07 13:11 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, 7 Jun 2013 13:52:42 +0200 Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff OOM kill victim: [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting an OOM and makes sure nobody loops or sleeps on OOM with locks held: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 2. When OOMing in a page fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When detecting an OOM in a page fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Reported-by: Reported-by: azurIt <azurit@pobox.sk> Debugged-by: Michal Hocko <mhocko@suse.cz> Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 22 +++++++ include/linux/mm.h | 1 + include/linux/sched.h | 6 ++ mm/ksm.c | 2 +- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++---------------- mm/memory.c | 40 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 156 insertions(+), 66 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..56bfc39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 1; +} +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 0; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ +} + +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..91380ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..d521a70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..3295a3b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..67189b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); - if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, mask); - } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this is a + * page fault and somebody else is handling the OOM already, + * we need to sleep on the OOM waitqueue for this memcg until + * the situation is resolved. Which can take some time + * because it might be handled by a userspace task. + * + * However, this is the charge context, which means that we + * may sit on a large call stack and hold various filesystem + * locks, the mmap_sem etc. and we don't want the OOM handler + * to deadlock on them while we sit here and wait. Store the + * current OOM context in the task_struct, then return + * -ENOMEM. At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check back + * with us by calling mem_cgroup_oom_synchronize(), possibly + * putting the task to sleep. + */ + if (current->memcg_oom.in_userfault) { + current->memcg_oom.in_memcg_oom = 1; + /* + * Somebody else is handling the situation. Make sure + * no wakeups are missed between now and going to + * sleep at the end of the page fault. + */ + if (!need_to_kill) { + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = + atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; + } } - spin_lock(&memcg_oom_lock); - if (locked) + + if (need_to_kill) + mem_cgroup_out_of_memory(memcg, mask); + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2251,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2400,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2408,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2421,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..bee177c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, cond_resched(); while (!(page = follow_page(vma, start, foll_flags))) { int ret; - unsigned int fault_flags = 0; + unsigned int fault_flags = FAULT_FLAG_KERNEL; /* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { @@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!vma || address < vma->vm_start) return -EFAULT; + fault_flags |= FAULT_FLAG_KERNEL; ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int in_userfault = !(flags & FAULT_FLAG_KERNEL); + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (in_userfault) + mem_cgroup_set_userfault(current); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (in_userfault) + mem_cgroup_clear_userfault(current); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-07 13:11 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Michal Hocko @ 2013-06-17 10:21 ` azurIt 2013-06-19 13:26 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-06-17 10:21 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go. I hope I didn't screw anything (Johannes might double check) >because there were quite some changes in the area since 3.2. Nothing >earth shattering though. Please note that I have only compile tested >this. Also make sure you remove the previous patches you have from me. Hi Michal, it, unfortunately, didn't work. Everything was working fine but original problem is still occuring. I'm unable to send you stacks or more info because problem is taking down the whole server for some time now (don't know what exactly caused it to start happening, maybe newer versions of 3.2.x). But i'm sure of one thing - when problem occurs, nothing is able to access hard drives (every process which tries it is freezed until problem is resolved or server is rebooted). Problem is fixed after killing processes from cgroup which caused it and everything immediatelly starts to work normally. I find this out by keeping terminal opened from another server to one where my problem is occuring quite often and running several apps there (htop, iotop, etc.). When problem occurs, all apps which wasn't working with HDD was ok. The htop proved to be very usefull here because it's only reading proc filesystem and is also able to send KILL signals - i was able to resolve the problem with it without rebooting the server. I created a special daemon (about month ago) which is able to detect and fix the problem so i'm not having server outages now. The point was to NOT access anything which is stored on HDDs, the daemon is only reading info from cgroup filesystem and sending KILL signals to processes. Maybe i should be able to also read stack files before killing, i will try it. Btw, which vanilla kernel includes this patch? Thank you and everyone involved very much for time and help. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-17 10:21 ` azurIt @ 2013-06-19 13:26 ` Michal Hocko 2013-06-22 20:09 ` azurIt 2013-06-24 16:48 ` azurIt 0 siblings, 2 replies; 168+ messages in thread From: Michal Hocko @ 2013-06-19 13:26 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-06-13 12:21:34, azurIt wrote: > >Here we go. I hope I didn't screw anything (Johannes might double check) > >because there were quite some changes in the area since 3.2. Nothing > >earth shattering though. Please note that I have only compile tested > >this. Also make sure you remove the previous patches you have from me. > > > Hi Michal, > > it, unfortunately, didn't work. Everything was working fine but > original problem is still occuring. This would be more than surprising because tasks blocked at memcg OOM don't hold any locks anymore. Maybe I have messed something up during backport but I cannot spot anything. > I'm unable to send you stacks or more info because problem is taking > down the whole server for some time now (don't know what exactly > caused it to start happening, maybe newer versions of 3.2.x). So you are not testing with the same kernel with just the old patch replaced by the new one? > But i'm sure of one thing - when problem occurs, nothing is able to > access hard drives (every process which tries it is freezed until > problem is resolved or server is rebooted). I would be really interesting to see what those tasks are blocked on. > Problem is fixed after killing processes from cgroup which > caused it and everything immediatelly starts to work normally. I > find this out by keeping terminal opened from another server to one > where my problem is occuring quite often and running several apps > there (htop, iotop, etc.). When problem occurs, all apps which wasn't > working with HDD was ok. The htop proved to be very usefull here > because it's only reading proc filesystem and is also able to send > KILL signals - i was able to resolve the problem with it > without rebooting the server. sysrq+t will give you the list of all tasks and their traces. > I created a special daemon (about month ago) which is able to detect > and fix the problem so i'm not having server outages now. The point > was to NOT access anything which is stored on HDDs, the daemon is > only reading info from cgroup filesystem and sending KILL signals to > processes. Maybe i should be able to also read stack files before > killing, i will try it. > > Btw, which vanilla kernel includes this patch? None yet. But I hope it will be merged to 3.11 and backported to the stable trees. > Thank you and everyone involved very much for time and help. > > azur -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-19 13:26 ` Michal Hocko @ 2013-06-22 20:09 ` azurIt [not found] ` <20130622220958.D10567A4-Rm0zKEqwvD4@public.gmane.org> 2013-06-24 16:48 ` azurIt 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2013-06-22 20:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, >> I'm unable to send you stacks or more info because problem is taking >> down the whole server for some time now (don't know what exactly >> caused it to start happening, maybe newer versions of 3.2.x). > >So you are not testing with the same kernel with just the old patch >replaced by the new one? No, i'm not testing with the same kernel but all are 3.2.x. I even cannot install older 3.2.x because grsecurity is always available for newest kernel and there is no archive of older versions (at least i don't know about any). >> But i'm sure of one thing - when problem occurs, nothing is able to >> access hard drives (every process which tries it is freezed until >> problem is resolved or server is rebooted). > >I would be really interesting to see what those tasks are blocked on. I'm trying to get it, stay tuned :) Today i noticed one bug, not 100% sure it is related to 'your' patch but i didn't seen this before. I noticed that i have lots of cgroups which cannot be removed - if i do 'rmdir <cgroup_directory>', it just hangs and never complete. Even more, it's not possible to access the whole cgroup filesystem until i kill that rmdir (anything, which tries it, just hangs). All unremoveable cgroups has this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 And, yes, 'tasks' file is empty. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130622220958.D10567A4-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130622220958.D10567A4-Rm0zKEqwvD4@public.gmane.org> @ 2013-06-24 20:13 ` Johannes Weiner [not found] ` <20130624201345.GA21822-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Johannes Weiner @ 2013-06-24 20:13 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; ^ permalink raw reply related [flat|nested] 168+ messages in thread
[parent not found: <20130624201345.GA21822-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130624201345.GA21822-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2013-06-28 10:06 ` azurIt [not found] ` <20130628120613.6D6CAD21-Rm0zKEqwvD4@public.gmane.org> 2013-07-09 13:00 ` Michal Hocko 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2013-06-28 10:06 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130628120613.6D6CAD21-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130628120613.6D6CAD21-Rm0zKEqwvD4@public.gmane.org> @ 2013-07-05 18:17 ` Johannes Weiner 2013-07-05 19:02 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Johannes Weiner @ 2013-07-05 18:17 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-05 18:17 ` Johannes Weiner @ 2013-07-05 19:02 ` azurIt [not found] ` <20130705210246.11D2135A-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-07-05 19:02 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >I looked at your debug messages but could not find anything that would >hint at a deadlock. All tasks are stuck in the refrigerator, so I >assume you use the freezer cgroup and enabled it somehow? Yes, i'm really using freezer cgroup BUT i was checking if it's not doing problems - unfortunately, several days passed from that day and now i don't fully remember if i was checking it for both cases (unremoveabled cgroups and these freezed processes holding web server port). I'm 100% sure i was checking it for unremoveable cgroups but not so sure for the other problem (i had to act quickly in that case). Are you sure (from stacks) that freezer cgroup was enabled there? Btw, what about that other stacks? I mean this file: http://watchdog.sk/lkml/memcg-bug-7.tar.gz It was taken while running the kernel with your patch and from cgroup which was under unresolveable OOM (just like my very original problem). Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130705210246.11D2135A-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130705210246.11D2135A-Rm0zKEqwvD4@public.gmane.org> @ 2013-07-05 19:18 ` Johannes Weiner 2013-07-07 23:42 ` azurIt 2013-07-14 17:07 ` azurIt 0 siblings, 2 replies; 168+ messages in thread From: Johannes Weiner @ 2013-07-05 19:18 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-05 19:18 ` Johannes Weiner @ 2013-07-07 23:42 ` azurIt 2013-07-09 13:10 ` Michal Hocko 2013-07-14 17:07 ` azurIt 1 sibling, 1 reply; 168+ messages in thread From: azurIt @ 2013-07-07 23:42 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. > Johannes, today I tested both of your patches but problem with unremovable cgroups, unfortunately, persists. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-07 23:42 ` azurIt @ 2013-07-09 13:10 ` Michal Hocko 2013-07-09 13:19 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 08-07-13 01:42:24, azurIt wrote: > > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> > >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >> >I looked at your debug messages but could not find anything that would > >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >> >assume you use the freezer cgroup and enabled it somehow? > >> > >> > >> Yes, i'm really using freezer cgroup BUT i was checking if it's not > >> doing problems - unfortunately, several days passed from that day > >> and now i don't fully remember if i was checking it for both cases > >> (unremoveabled cgroups and these freezed processes holding web > >> server port). I'm 100% sure i was checking it for unremoveable > >> cgroups but not so sure for the other problem (i had to act quickly > >> in that case). Are you sure (from stacks) that freezer cgroup was > >> enabled there? > > > >Yeah, all the traces without exception look like this: > > > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 > >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 > >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 > >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 > >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 > >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > > > >so the freezer was already enabled when you took the backtraces. > > > >> Btw, what about that other stacks? I mean this file: > >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz > >> > >> It was taken while running the kernel with your patch and from > >> cgroup which was under unresolveable OOM (just like my very original > >> problem). > > > >I looked at these traces too, but none of the tasks are stuck in rmdir > >or the OOM path. Some /are/ in the page fault path, but they are > >happily doing reclaim and don't appear to be stuck. So I'm having a > >hard time matching this data to what you otherwise observed. Agreed. > >However, based on what you reported the most likely explanation for > >the continued hangs is the unfinished OOM handling for which I sent > >the followup patch for arch/x86/mm/fault.c. > > Johannes, > > today I tested both of your patches but problem with unremovable > cgroups, unfortunately, persists. Is the group empty again with marked under_oom? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:10 ` Michal Hocko @ 2013-07-09 13:19 ` azurIt [not found] ` <20130709151921.5160C199-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-07-09 13:19 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >On Mon 08-07-13 01:42:24, azurIt wrote: >> > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >> >I looked at your debug messages but could not find anything that would >> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> >> doing problems - unfortunately, several days passed from that day >> >> and now i don't fully remember if i was checking it for both cases >> >> (unremoveabled cgroups and these freezed processes holding web >> >> server port). I'm 100% sure i was checking it for unremoveable >> >> cgroups but not so sure for the other problem (i had to act quickly >> >> in that case). Are you sure (from stacks) that freezer cgroup was >> >> enabled there? >> > >> >Yeah, all the traces without exception look like this: >> > >> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff >> > >> >so the freezer was already enabled when you took the backtraces. >> > >> >> Btw, what about that other stacks? I mean this file: >> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> >> >> It was taken while running the kernel with your patch and from >> >> cgroup which was under unresolveable OOM (just like my very original >> >> problem). >> > >> >I looked at these traces too, but none of the tasks are stuck in rmdir >> >or the OOM path. Some /are/ in the page fault path, but they are >> >happily doing reclaim and don't appear to be stuck. So I'm having a >> >hard time matching this data to what you otherwise observed. > >Agreed. > >> >However, based on what you reported the most likely explanation for >> >the continued hangs is the unfinished OOM handling for which I sent >> >the followup patch for arch/x86/mm/fault.c. >> >> Johannes, >> >> today I tested both of your patches but problem with unremovable >> cgroups, unfortunately, persists. > >Is the group empty again with marked under_oom? Now i realized that i forgot to remove UID from that cgroup before trying to remove it, so cgroup cannot be removed anyway (we are using third party cgroup called cgroup-uid from Andrea Righi, which is able to associate all user's processes with target cgroup). Look here for cgroup-uid patch: https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was permanently '1'. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130709151921.5160C199-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130709151921.5160C199-Rm0zKEqwvD4@public.gmane.org> @ 2013-07-09 13:54 ` Michal Hocko 2013-07-10 16:25 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-07-09 13:54 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:54 ` Michal Hocko @ 2013-07-10 16:25 ` azurIt 2013-07-11 7:25 ` Michal Hocko 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-07-10 16:25 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea >> Now i realized that i forgot to remove UID from that cgroup before >> trying to remove it, so cgroup cannot be removed anyway (we are using >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> to associate all user's processes with target cgroup). Look here for >> cgroup-uid patch: >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> permanently '1'. > >This is really strange. Could you post the whole diff against stable >tree you are using (except for grsecurity stuff and the above cgroup-uid >patch)? Here are all patches which i applied to kernel 3.2.48 in my last test: http://watchdog.sk/lkml/patches3/ Patches marked as 7-* are from Johannes. I'm appling them in order except the grsecurity - it goes as first. azur >Btw. the bellow patch might help us to point to the exit path which >leaves wait_on_memcg without mem_cgroup_oom_synchronize: >--- >diff --git a/kernel/exit.c b/kernel/exit.c >index e6e01b9..ad472e0 100644 >--- a/kernel/exit.c >+++ b/kernel/exit.c >@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) > > profile_task_exit(tsk); > >+ WARN_ON(current->memcg_oom.wait_on_memcg); > WARN_ON(blk_needs_flush_plug(tsk)); > > if (unlikely(in_interrupt())) >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-10 16:25 ` azurIt @ 2013-07-11 7:25 ` Michal Hocko 2013-07-13 23:26 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-07-11 7:25 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-11 7:25 ` Michal Hocko @ 2013-07-13 23:26 ` azurIt 2013-07-13 23:51 ` azurIt 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-07-13 23:26 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> >> to associate all user's processes with target cgroup). Look here for >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> >patch)? >> >> >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-13 23:26 ` azurIt @ 2013-07-13 23:51 ` azurIt [not found] ` <20130714015112.FFCB7AF7-Rm0zKEqwvD4@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: azurIt @ 2013-07-13 23:51 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >>On Wed 10-07-13 18:25:06, azurIt wrote: >>> >> Now i realized that i forgot to remove UID from that cgroup before >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >>> >> to associate all user's processes with target cgroup). Look here for >>> >> cgroup-uid patch: >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >>> >> >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >>> >> permanently '1'. >>> > >>> >This is really strange. Could you post the whole diff against stable >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >>> >patch)? >>> >>> >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >>> http://watchdog.sk/lkml/patches3/ >> >>The two patches from Johannes seem correct. >> >>From a quick look even grsecurity patchset shouldn't interfere as it >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >>and there also doesn't seem to be any new handle_mm_fault call sites. >> >>But I cannot tell there aren't other code paths which would lead to a >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > >Michal, > >now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. > >azur Ok, i think you want this: http://watchdog.sk/lkml/kern4.log -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130714015112.FFCB7AF7-Rm0zKEqwvD4@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130714015112.FFCB7AF7-Rm0zKEqwvD4@public.gmane.org> @ 2013-07-15 15:41 ` Michal Hocko [not found] ` <20130715154119.GA32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-07-15 15:41 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130715154119.GA32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130715154119.GA32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-07-15 16:00 ` Michal Hocko [not found] ` <20130715160006.GB32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Michal Hocko @ 2013-07-15 16:00 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130715160006.GB32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM [not found] ` <20130715160006.GB32435-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-07-16 15:35 ` Johannes Weiner [not found] ` <20130716153544.GX17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 168+ messages in thread From: Johannes Weiner @ 2013-07-16 15:35 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) ^ permalink raw reply [flat|nested] 168+ messages in thread
[parent not found: <20130716153544.GX17812-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]